642 research outputs found

    Interpreting linear support vector machine models with heat map molecule coloring

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Model-based virtual screening plays an important role in the early drug discovery stage. The outcomes of high-throughput screenings are a valuable source for machine learning algorithms to infer such models. Besides a strong performance, the interpretability of a machine learning model is a desired property to guide the optimization of a compound in later drug discovery stages. Linear support vector machines showed to have a convincing performance on large-scale data sets. The goal of this study is to present a heat map molecule coloring technique to interpret linear support vector machine models. Based on the weights of a linear model, the visualization approach colors each atom and bond of a compound according to its importance for activity.</p> <p>Results</p> <p>We evaluated our approach on a toxicity data set, a chromosome aberration data set, and the maximum unbiased validation data sets. The experiments show that our method sensibly visualizes structure-property and structure-activity relationships of a linear support vector machine model. The coloring of ligands in the binding pocket of several crystal structures of a maximum unbiased validation data set target indicates that our approach assists to determine the correct ligand orientation in the binding pocket. Additionally, the heat map coloring enables the identification of substructures important for the binding of an inhibitor.</p> <p>Conclusions</p> <p>In combination with heat map coloring, linear support vector machine models can help to guide the modification of a compound in later stages of drug discovery. Particularly substructures identified as important by our method might be a starting point for optimization of a lead compound. The heat map coloring should be considered as complementary to structure based modeling approaches. As such, it helps to get a better understanding of the binding mode of an inhibitor.</p

    Comparison of the Predictive Performance and Interpretability of Random Forest and Linear Models on Benchmark Datasets

    Get PDF
    The ability to interpret the predictions made by quantitative structure activity relationships (QSARs) offers a number of advantages. Whilst QSARs built using non-linear modelling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modelling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting non-linear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to two widely used linear modelling approaches: linear Support Vector Machines (SVM), or Support Vector Regression (SVR), and Partial Least Squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions, using novel scoring schemes for assessing Heat Map images of substructural contributions. We critically assess different approaches to interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed, public domain benchmark datasets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modelling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpreting non-linear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random Forest assessed in our article were implemented using Open Source programs, which we have made available to the community. These programs are the rfFC package [https://r-forge.r-project.org/R/?group_id=1725] for the R Statistical Programming Language, along with a Python program HeatMapWrapper [https://doi.org/10.5281/zenodo.495163] for Heat Map generation

    Improving the expressiveness of black-box models for predicting student performance

    Get PDF
    Early prediction systems of student performance can be very useful to guide student learning. For a prediction model to be really useful as an effective aid for learning, it must provide tools to adequately interpret progress, to detect trends and behaviour patterns and to identify the causes of learning problems. White-box and black-box techniques have been described in literature to implement prediction models. White-box techniques require a priori models to explore, which make them easy to interpret but difficult to be generalized and unable to detect unexpected relationships between data. Black-box techniques are easier to generalize and suitable to discover unsuspected relationships but they are cryptic and difficult to be interpreted for most teachers. In this paper a black-box technique is proposed to take advantage of the power and versatility of these methods, while making some decisions about the input data and design of the classifier that provide a rich output data set. A set of graphical tools is also proposed to exploit the output information and provide a meaningful guide to teachers and students. From our experience, a set of tips about how to design a prediction system and the representation of the output information is also provided

    Development and Interpretation of Machine Learning Models for Drug Discovery

    Get PDF
    In drug discovery, domain experts from different fields such as medicinal chemistry, biology, and computer science often collaborate to develop novel pharmaceutical agents. Computational models developed in this process must be correct and reliable, but at the same time interpretable. Their findings have to be accessible by experts from other fields than computer science to validate and improve them with domain knowledge. Only if this is the case, the interdisciplinary teams are able to communicate their scientific results both precisely and intuitively. This work is concerned with the development and interpretation of machine learning models for drug discovery. To this end, it describes the design and application of computational models for specialized use cases, such as compound profiling and hit expansion. Novel insights into machine learning for ligand-based virtual screening are presented, and limitations in the modeling of compound potency values are highlighted. It is shown that compound activity can be predicted based on high-dimensional target profiles, without the presence of molecular structures. Moreover, support vector regression for potency prediction is carefully analyzed, and a systematic misprediction of highly potent ligands is discovered. Furthermore, a key aspect is the interpretation and chemically accessible representation of the models. Therefore, this thesis focuses especially on methods to better understand and communicate modeling results. To this end, two interactive visualizations for the assessment of naive Bayes and support vector machine models on molecular fingerprints are presented. These visual representations of virtual screening models are designed to provide an intuitive chemical interpretation of the results

    Interpreting random forest classification models using a feature contribution method

    Get PDF
    Model interpretation is one of the key aspects of the model evaluation process. The explanation of the relationship between model variables and outputs is relatively easy for statistical models, such as linear regressions, thanks to the availability of model parameters and their statistical significance . For “black box” models, such as random forest, this information is hidden inside the model structure. This work presents an approach for computing feature contributions for random forest classification models. It allows for the determination of the influence of each variable on the model prediction for an individual instance. By analysing feature contributions for a training dataset, the most significant variables can be determined and their typical contribution towards predictions made for individual classes, i.e., class-specific feature contribution “patterns”, are discovered. These patterns represent a standard behaviour of the model and allow for an additional assessment of the model reliability for new data. Interpretation of feature contributions for two UCI benchmark datasets shows the potential of the proposed methodology. The robustness of results is demonstrated through an extensive analysis of feature contributions calculated for a large number of generated random forest models

    Nonlinear Dimensionality Reduction for the Thermodynamics of Small Clusters of Particles

    Get PDF
    This work employs tools and methods from computer science to study clusters comprising a small number N of interacting particles, which are of interest in science, engineering, and nanotechnology. Specifically, the thermodynamics of such clusters is studied using techniques from spectral graph theory (SGT) and machine learning (ML). SGT is used to define the structure of the clusters and ML is used on ensembles of cluster configurations to detect state variables that can be used to model the thermodynamic properties of the system. While the most fundamental description of a cluster is in 3N dimensions, i.e., the Cartesian coordinates of the particles, the ML results demonstrate that sub-spaces of much lower dimension can describe the observed structural motifs. Furthermore, these sub-spaces correlate with meaningful physical variables such as radius of gyration r g and discrete connectivity c, which can be used as state variables in thermodynamic property descriptions. The overarching theme of this thesis is to develop the practice of utilizing data-driven computational techniques to solve problems in natural sciences. Code for this project can be found at https://github.com/AdityaDendukuri/DimReductionThermodynamics
    • …
    corecore