805 research outputs found

    Scalable Similarity Search for Molecular Descriptors

    Full text link
    Similarity search over chemical compound databases is a fundamental task in the discovery and design of novel drug-like molecules. Such databases often encode molecules as non-negative integer vectors, called molecular descriptors, which represent rich information on various molecular properties. While there exist efficient indexing structures for searching databases of binary vectors, solutions for more general integer vectors are in their infancy. In this paper we present a time- and space- efficient index for the problem that we call the succinct intervals-splitting tree algorithm for molecular descriptors (SITAd). Our approach extends efficient methods for binary-vector databases, and uses ideas from succinct data structures. Our experiments, on a large database of over 40 million compounds, show SITAd significantly outperforms alternative approaches in practice.Comment: To be appeared in the Proceedings of SISAP'1

    Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection

    Get PDF
    The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based on standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site

    Corrected overlap weight and clustering coefficient

    Full text link
    We discuss two well known network measures: the overlap weight of an edge and the clustering coefficient of a node. For both of them it turns out that they are not very useful for data analytic task to identify important elements (nodes or links) of a given network. The reason for this is that they attain their largest values on maximal subgraphs of relatively small size that are more probable to appear in a network than that of larger size. We show how the definitions of these measures can be corrected in such a way that they give the expected results. We illustrate the proposed corrected measures by applying them on the US Airports network using the program Pajek.Comment: The paper is a detailed and extended version of the talk presented at the CMStatistics (ERCIM) 2015 Conferenc

    Prediction of Hydrate and Solvate Formation Using Statistical Models

    Get PDF
    Novel, knowledge based models for the prediction of hydrate and solvate formation are introduced, which require only the molecular formula as input. A data set of more than 19 000 organic, nonionic, and nonpolymeric molecules was extracted from the Cambridge Structural Database. Molecules that formed solvates were compared with those that did not using molecular descriptors and statistical methods, which allowed the identification of chemical properties that contribute to solvate formation. The study was conducted for five types of solvates: ethanol, methanol, dichloromethane, chloroform, and water solvates. The identified properties were all related to the size and branching of the molecules and to the hydrogen bonding ability of the molecules. The corresponding molecular descriptors were used to fit logistic regression models to predict the probability of any given molecule to form a solvate. The established models were able to predict the behavior of ∼80% of the data correctly using only two descriptors in the predictive model

    Predicting toxicity through computers: a changing world

    Get PDF
    The computational approaches used to predict toxicity are evolving rapidly, a process hastened on by the emergence of new ways of describing chemical information. Although this trend offers many opportunities, new regulations, such as the European Community's 'Registration, Evaluation, Authorisation and Restriction of Chemicals' (REACH), demand that models be ever more robust

    Phase II study of eribulin in combination with gemcitabine for the treatment of patients with locally advanced or metastatic triple negative breast cancer (ERIGE Trial). Clinical and pharmacogenetic results on behalf of the Gruppo Oncologico Italiano di Ricerca Clinica (GOIRC)

    Get PDF
    Background: There are no well-established chemotherapy regimens for metastatic triple negative breast cancer. The combination of a microtubule inhibitor (eribulin) with a nucleoside analog (gemcitabine) may synergistically induce tumor cell death, especially in tumors like triple negative breast cancers (TNBC) characterized by high cell proliferation, aggressive tumor behavior, and chemo-resistance. Materials and Methods:This is an open-label, national multicenter phase II study evaluating the combination of eribulin (0.88 mg/m2) plus gemcitabine (1000 mg/m2) on day 1 and 8, q21 as either first- or second-line treatment of locally advanced or metastatic TNBC.The primary endpoint was the objective response rate (ORR) for evaluable patients (pts). The study was designed according to the Simon's two stage optimal design. We chose the lower activity (p0) of 0.20 and target activity level (p1) of 0.35. A prospective, molecular correlative study has been being carried out on germinal DNA of study population to assess the role of BRCA mutations and single nucleotide polymorphisms (SNPs) in predicting efficacy and toxicity of the combination regimen. Results: From July 2013 to September 2016, 83 evaluable pts (37 in the first stage, 46 in the second one) were enrolled. They received a median number of 6 cycles of treatment (range 1-24). The ORR (CR+PR) was 37.35% (90% CI: 28.47-46.93) and the clinical benefit rate (CR+PR+SD 65 24wks) was 48.78% (90% CI: 39.24%-58.39%). The most common grade 3-4 adverse events (> 10% of patients) were neutropenia and liver toxicity. With a median follow-up of 28.8 months, the median progression-free survival (PFS) and overall survival (OS) were 5.1 months (95% CI: 4.2-7.0) and 14.7 months (95% CI: 10.2-20.0), respectively. BRCA1/2 deleterious mutations were observed in 15 (22%) out of 68 genotyped pts. Women with BRCA1/2 mutations were associated with worse ORR, PFS and OS than those with BRCA1/2 wild-type. A panel of SNPs in genes of study drug metabolism pathways was evaluated. Among these, CYP3A4 392A >G and FGD4 2044236G>A SNPs were associated with greater liver toxicity by logistic regression analysis. Furthermore, CDA*2 79A>C, RRM1 2455 A>G, and CYP2C8 416G>A SNPs were associated with poorer overall survival by Cox proportional hazards model. Conclusions:The combination of eribulin and gemcitabine shows promising activity and a moderate toxicity profile in metastatic TNBC. BRCA status and pharmacogenetics tests may help identify pts with high probability of response with negligible toxicity

    Walk your talk: Real-world adherence to guidelines on the use of MRI in multiple sclerosis

    Get PDF
    (1) Although guidelines about the use of MRI sequences for Multiple Sclerosis (MS) diagnosis and follow-up are available, variability in acquisition protocols is not uncommon in everyday clinical practice. The aim of this study was to evaluate the real-world application of MS imaging guidelines in different settings to clarify the level of adherence to these guidelines. (2) Via an on-line anonymous survey, neuroradiologists (NR) were asked about MRI protocols and parameters routinely acquired when MS patients are evaluated in their center, both at diagnosis and followup. Furthermore, data about report content and personal opinions about emerging neuroimaging markers were also retrieved. (3) A total of 46 participants were included, mostly working in a hospital or university hospital (80.4%) and with more than 10 years of experience (47.9%). We found a relatively good adherence to the suggested MRI protocols regarding the use of T2-weighted sequences, although almost 10% of the participants routinely acquired 2D sequences with a slice thickness superior to 3 mm. On the other hand, a wider degree of heterogeneity was found regarding gadolinium administration, almost routinely performed at follow-up examination (87.0% of cases) in contrast with the current guidelines, as well as a low use of a standardized reporting system (17.4% of cases). (4) Although the MS community is getting closer to a standardization of MRI protocols, there is still a relatively wide heterogeneity among NR, with particular reference to contrast administration, which must be overcome to guarantee an adequate quality of patients’ care in MS

    Walk your talk: Real-world adherence to guidelines on the use of MRI in multiple sclerosis

    Get PDF
    (1) Although guidelines about the use of MRI sequences for Multiple Sclerosis (MS) diagnosis and follow-up are available, variability in acquisition protocols is not uncommon in everyday clinical practice. The aim of this study was to evaluate the real-world application of MS imaging guidelines in different settings to clarify the level of adherence to these guidelines. (2) Via an on-line anonymous survey, neuroradiologists (NR) were asked about MRI protocols and parameters routinely acquired when MS patients are evaluated in their center, both at diagnosis and followup. Furthermore, data about report content and personal opinions about emerging neuroimaging markers were also retrieved. (3) A total of 46 participants were included, mostly working in a hospital or university hospital (80.4%) and with more than 10 years of experience (47.9%). We found a relatively good adherence to the suggested MRI protocols regarding the use of T2-weighted sequences, although almost 10% of the participants routinely acquired 2D sequences with a slice thickness superior to 3 mm. On the other hand, a wider degree of heterogeneity was found regarding gadolinium administration, almost routinely performed at follow-up examination (87.0% of cases) in contrast with the current guidelines, as well as a low use of a standardized reporting system (17.4% of cases). (4) Although the MS community is getting closer to a standardization of MRI protocols, there is still a relatively wide heterogeneity among NR, with particular reference to contrast administration, which must be overcome to guarantee an adequate quality of patients' care in MS
    corecore