937 research outputs found

    Identification of a selective G1-phase benzimidazolone inhibitor by a senescence-targeted virtual screen using artificial neural networks

    Get PDF
    Cellular senescence is a barrier to tumorigenesis in normal cells and tumour cells undergo senescence responses to genotoxic stimuli, which is a potential target phenotype for cancer therapy. However, in this setting, mixed-mode responses are common with apoptosis the dominant effect. Hence, more selective senescence inducers are required. Here we report a machine learning-based in silico screen to identify potential senescence agonists. We built profiles of differentially affected biological process networks from expression data obtained under induced telomere dysfunction conditions in colorectal cancer cells and matched these to a panel of 17 protein targets with confirmatory screening data in PubChem. We trained a neural network using 3517 compounds identified as active or inactive against these targets. The resulting classification model was used to screen a virtual library of ~2M lead-like compounds. 147 virtual hits were acquired for validation in growth inhibition and senescence-associated β-galactosidase (SA-β-gal) assays. Among the found hits a benzimidazolone compound, CB-20903630, had low micromolar IC50 for growth inhibition of HCT116 cells and selectively induced SA-β-gal activity in the entire treated cell population without cytotoxicity or apoptosis induction. Growth suppression was mediated by G1 blockade involving increased p21 expression and suppressed cyclin B1, CDK1 and CDC25C. Additionally, the compound inhibited growth of multicellular spheroids and caused severe retardation of population kinetics in long term treatments. Preliminary structure-activity and structure clustering analyses are reported and expression analysis of CB-20903630 against other cell cycle suppressor compounds suggested a PI3K/AKT-inhibitor-like profile in normal cells, with different pathways affected in cancer cells

    Media optimization for biosurfactant production by Rhodococcus erythropolis MTCC 2794: artificial intelligence versus a statistical approach

    Get PDF
    This paper entails a comprehensive study on production of a biosurfactant from Rhodococcus erythropolis MTCC 2794. Two optimization techniques-(1) artificial neural network (ANN) coupled with genetic algorithm (GA) and (2) response surface methodology (RSM)-were used for media optimization in order to enhance the biosurfactant yield by Rhodococcus erythropolis MTCC 2794. ANN and RSM models were developed, incorporating the quantity of four medium components (sucrose, yeast extract, meat peptone, and toluene) as independent input variables and biosurfactant yield [calculated in terms of percent emulsification index (% EI24)] as output variable. ANN-GA and RSM were compared for their predictive and generalization ability using a separate data set of 16 experiments, for which the average quadratic errors were ~3 and ~6%, respectively. ANN-GA was found to be more accurate and consistent in predicting optimized conditions and maximum yield than RSM. For the ANN-GA model, the values of correlation coefficient and average quadratic error were ~0.99 and ~3%, respectively. It was also shown that ANN-based models could be used accurately for sensitivity analysis. ANN-GA-optimized media gave about a 3.5-fold enhancement in biosurfactant yield

    Impact of evaluation methods on decision tree accuracy

    Get PDF
    Decision trees are one of the most powerful and commonly used supervised learning algorithms in the field of data mining. It is important that a decision tree performs accurately when employed on unseen data; therefore, evaluation methods are used to measure the predictive performance of a decision tree classifier. However, the predictive accuracy of a decision tree is also dependant on the evaluation method chosen since training and testing sets of decision tree models are selected according to the evaluation methods. The aim of this thesis was to study and understand how using different evaluation methods might have an impact on decision tree accuracies when they are applied to different decision tree algorithms. Consequently, comprehensive research was made on decision trees and evaluation methods. Additionally, an experiment was conducted using ten different datasets, five decision tree algorithms and five different evaluation methods in order to study the relationship between evaluation methods and decision tree accuracies. The decision tree inducers were tested with Leave-one-out, 5-Fold Cross Validation, 10-Fold Cross Validation, Holdout 50 split and Holdout 66 split evaluation methods. According to the results, cross validation methods were superior to holdout methods in overall. Moreover, Holdout 50 split has performed the poorest in most of the datasets. The possible reasons behind these results have also been discussed in the thesis

    Impact of evaluation methods on decision tree accuracy

    Get PDF
    Decision trees are one of the most powerful and commonly used supervised learning algorithms in the field of data mining. It is important that a decision tree performs accurately when employed on unseen data; therefore, evaluation methods are used to measure the predictive performance of a decision tree classifier. However, the predictive accuracy of a decision tree is also dependant on the evaluation method chosen since training and testing sets of decision tree models are selected according to the evaluation methods. The aim of this thesis was to study and understand how using different evaluation methods might have an impact on decision tree accuracies when they are applied to different decision tree algorithms. Consequently, comprehensive research was made on decision trees and evaluation methods. Additionally, an experiment was conducted using ten different datasets, five decision tree algorithms and five different evaluation methods in order to study the relationship between evaluation methods and decision tree accuracies. The decision tree inducers were tested with Leave-one-out, 5-Fold Cross Validation, 10-Fold Cross Validation, Holdout 50 split and Holdout 66 split evaluation methods. According to the results, cross validation methods were superior to holdout methods in overall. Moreover, Holdout 50 split has performed the poorest in most of the datasets. The possible reasons behind these results have also been discussed in the thesis

    Prediction of drug-drug interaction potential using machine learning approaches

    Get PDF
    Drug discovery is a long, expensive, and complex, yet crucial process for the benefit of society. Selecting potential drug candidates requires an understanding of how well a compound will perform at its task, and more importantly, how safe the compound will act in patients. A key safety insight is understanding a molecule\u27s potential for drug-drug interactions. The metabolism of many drugs is mediated by members of the cytochrome P450 superfamily, notably, the CYP3A4 enzyme. Inhibition of these enzymes can alter the bioavailability of other drugs, potentially increasing their levels to toxic amounts. Four models were developed to predict CYP3A4 inhibition: logistic regression, random forests, support vector machine, and neural network. Two novel convolutional approaches were explored for data featurization: SMILES string auto-extraction and 2D structure auto-extraction. The logistic regression model achieved an accuracy of 83.2%, the random forests model, 83.4%, the support vector machine model, 81.9%, and the neural network model, 82.3%. Additionally, the model built with SMILE string auto-extraction had an accuracy of 82.3%, and the model with 2D structure auto-extraction, 76.4%. The advantages of the novel featurization methods are their ability to learn relevant features from compound SMILE strings, eliminating feature engineering. The developed methodologies can be extended towards predicting any structure-activity relationship and fitted for other areas of drug discovery and development

    New statistical learning methods for chemical toxicity data analysis

    Get PDF
    In the first part of the dissertation, we introduce the change-line classification and regression method to study latent subgroups. The proposed method finds a line which optimally divides a feature space into two heterogeneous subgroups, each of which yields a response having a different probability distribution or having a different regression model. The procedure is useful for classifying biochemicals on the basis of toxicity, where the feature space consists of chemical descriptors and the response is toxicity activity. In this setting, the goal is to identify subgroups of chemicals with different toxicity profiles. The split-line algorithm is utilized to reduce computational complexity. A two step estimation procedure, using either least squares or maximum likelihood for implementation, is described. Two sets of simulation studies and a data analysis applying our method to rat acute toxicity data are presented to demonstrate utility of the proposed method. Second, the asymptotic properties in the change-line regression model are studied, including consistency and the rates of convergence of M-estimators in the change-line regression model through empirical process techniques. We proved that the estimators of the regression parameters achieve a square-root n-consistency while the estimators of the change-line parameters achieve n-consistency. Last, we introduce the Interactive Decision Committee method for classification when high-dimensional feature variables are grouped into feature categories. The proposed method uses the interactive relationships among feature categories to build base classifiers which are combined using decision committees. The proposed procedure is useful for classifying biochemicals on the basis of toxicity activity, where the feature space consists of chemical descriptors belonging to at least one feature category, and the responses are binary indicators of toxicity activity. The support vector machine, the random forests, and the tree-based AdaBoost algorithms are utilized as classifier inducers. To combine base classifiers, the voting method with forward selection given the number of base classifiers by 5-fold CV and a stacked generalization with two different learning algorithms were utilized. We applied the proposed method to two chemical toxicity data sets. For these data sets, the proposed method improved the classification performance with respect to the average prediction accuracy compared to a single classifier.Doctor of Philosoph

    Towards more reliable feature evaluations for classification

    Get PDF
    In this thesis we study feature subset selection and feature weighting algorithms. Our aim is to make their output more stable and more useful when used to train a classifier. We begin by defining the concept of stability and selecting a measure to asses the output of the feature selection process. Then we study different sources of instability and propose modifications of classic algorithms that improve their stability. We propose a modification of wrapper algorithms that take otherwise unused information into account to overcome an intrinsic source of instability for this algorithms: the feature assessment being a random variable that depends on the particular training subsample. Our version accumulates the evaluation results of each feature at each iteration to average out the effect of the randomness. Another novel proposal is to make wrappers evaluate the remainder set of features at each step to overcome another source of instability: randomness of the algorithms themselves. In this case, by evaluating the non-selected set of features, the initial choice of variables is more educated. These modifications do not bring a great amount of computational overhead and deliver better results, both in terms of stability and predictive power. We finally tackle another source of instability: the differential contribution of the instances to feature assessment. We present a framework to combine almost any instance weighting algorithm with any feature weighting one. Our combination of algorithms deliver more stable results for the various feature weighting algorithms we have tested. Finally, we present a deeper integration of instance weighting with feature weighting by modifying the Simba algorithm, that delivers even better results in terms of stabilityEl focus d'aquesta tesi és mesurar, estudiar i millorar l’estabilitat d’algorismes de selecció de subconjunts de variables (SSV) i avaluació de variables (AV) en un context d'aprenentatge supervisat. El propòsit general de la SSV en un context de classificació és millorar la precisió de la predicció. Nosaltres afirmem que hi ha un altre gran repte en SSV i AV: l’estabilitat des resultats. Un cop triada una mesura d’estabilitat entre les estudiades, proposem millores d’un algorisme molt popular: el Relief. Analitzem diferents mesures de distància a més de la original i estudiem l'efecte que tenen sobre la precisió, la detecció de la redundància i l'estabilitat. També posem a prova diferents maneres d’utilitzar els pesos que es calculen a cada pas per influir en el càlcul de distàncies d’una manera similar a com ho fa un altre algorisme d'AV: el Simba. També millorem la seva estabilitat incrementant la contribució dels pesos de les variables en el càlcul de la distància a mesura que avança el temps per minimitzar l’impacte de la selecció aleatòria de les primeres instàncies. Pel què fa als algorismes embolcall, (wrappers) els modifiquem per tenir en compte informació que era ignorada per superar una font intrínseca d’inestabilitat: el fet que l’avaluació de les variables és una variable aleatòria que depèn del subconjunt de dades utilitzat. La nostra versió acumula els resultats en cada iteració per compensar l’efecte aleatori mentre que els originals descarten tota la informació recollida sobre cada variable en una determinada iteració i comencen de nou a la següent, donant lloc a resultats més inestables. Una altra proposta és fer que aquests wrappers avaluïn el subconjunt de variables no seleccionat en cada iteració per evitar una altra font d’inestabilitat. Aquestes modificacions no comporten un gran augment de cost computacional i els seus resultats són més estables i més útils per un classificador. Finalment proposem ponderar la contribució de cada instància en l’AV. Poden existir observacions atípiques que no s'haurien de tenir tant en compte com les altres; si estem intentant predir un càncer utilitzant informació d’anàlisis genètics, hauríem de donar menys credibilitat a les dades obtingudes de persones exposades a grans nivells de radiació tot i que no tenir informació sobre aquesta exposició. Els mètodes d’avaluació d’instàncies (AI) pretenen identificar aquests casos i assignar-los pesos més baixos. Varis autors han treballat en esquemes d’AI per millorar la SSV però no hi ha treball previ en la combinació d'AI amb AV. Presentem un marc de treball per combinar algorismes d'AI amb altres d'AV. A més proposem un nou algorisme d’AI basat en el concepte de marge de decisió que utilitzen alguns algorismes d’AV. Amb aquest marc de treball hem posat a prova les modificacions contra les versions originals utilitzant varis jocs de dades del repositori UCI, de xips d'ADN i els utilitzats en el desafiament de SSV del NIPS-2003. Les nostres combinacions d'algorismes d'avaluació d'instàncies i atributs ens aporten resultats més estables per varis algorismes d'avaluació d'atributs que hem estudiat. Finalment, presentem una integració més profunda de l'avaluació d'instàncies amb l'algorisme de selecció de variables Simba consistent a utilitzar els pesos de les instàncies per ponderar el càlcul de les distàncies, amb la que obtenim resultats encara millors en termes d’estabilitat. Les contribucions principals d’aquesta tesi son: (i) aportar un marc de treball per combinar l'AI amb l’AV, (ii) una revisió de les mesures d’estabilitat de SSV, (iii) diverses modificacions d’algorismes de SSV i AV que milloren la seva estabilitat i el poder predictiu del subconjunt de variables seleccionats; sense un augment significatiu del seu cost computacional, (iv) una definició teòrica de la importància d'una variable i (v) l'estudi de la relació entre l'estabilitat de la SSV i la redundància de les variables.Postprint (published version

    Advances and applications in Ensemble Learning

    Get PDF

    Computational prediction of metabolism: sites, products, SAR, P450 enzyme dynamics, and mechanisms.

    Get PDF
    Metabolism of xenobiotics remains a central challenge for the discovery and development of drugs, cosmetics, nutritional supplements, and agrochemicals. Metabolic transformations are frequently related to the incidence of toxic effects that may result from the emergence of reactive species, the systemic accumulation of metabolites, or by induction of metabolic pathways. Experimental investigation of the metabolism of small organic molecules is particularly resource demanding; hence, computational methods are of considerable interest to complement experimental approaches. This review provides a broad overview of structure- and ligand-based computational methods for the prediction of xenobiotic metabolism. Current computational approaches to address xenobiotic metabolism are discussed from three major perspectives: (i) prediction of sites of metabolism (SOMs), (ii) elucidation of potential metabolites and their chemical structures, and (iii) prediction of direct and indirect effects of xenobiotics on metabolizing enzymes, where the focus is on the cytochrome P450 (CYP) superfamily of enzymes, the cardinal xenobiotics metabolizing enzymes. For each of these domains, a variety of approaches and their applications are systematically reviewed, including expert systems, data mining approaches, quantitative structure-activity relationships (QSARs), and machine learning-based methods, pharmacophore-based algorithms, shape-focused techniques, molecular interaction fields (MIFs), reactivity-focused techniques, protein-ligand docking, molecular dynamics (MD) simulations, and combinations of methods. Predictive metabolism is a developing area, and there is still enormous potential for improvement. However, it is clear that the combination of rapidly increasing amounts of available ligand- and structure-related experimental data (in particular, quantitative data) with novel and diverse simulation and modeling approaches is accelerating the development of effective tools for prediction of in vivo metabolism, which is reflected by the diverse and comprehensive data sources and methods for metabolism prediction reviewed here. This review attempts to survey the range and scope of computational methods applied to metabolism prediction and also to compare and contrast their applicability and performance.JK, MJW, JT, PJB, AB and RCG thank Unilever for funding
    corecore