145 research outputs found

    Machine Learning for Kinase Drug Discovery

    Get PDF
    Cancer is one of the major public health issues, causing several million losses every year. Although anti-cancer drugs have been developed and are globally administered, mild to severe side effects are known to occur during treatment. Computer-aided drug discovery has become a cornerstone for unveiling treatments of existing as well as emerging diseases. Computational methods aim to not only speed up the drug design process, but to also reduce time-consuming, costly experiments, as well as in vivo animal testing. In this context, over the last decade especially, deep learning began to play a prominent role in the prediction of molecular activity, property and toxicity. However, there are still major challenges when applying deep learning models in drug discovery. Those challenges include data scarcity for physicochemical tasks, the difficulty of interpreting the prediction made by deep neural networks, and the necessity of open-source and robust workflows to ensure reproducibility and reusability. In this thesis, after reviewing the state-of-the-art in deep learning applied to virtual screening, we address the previously mentioned challenges as follows: Regarding data scarcity in the context of deep learning applied to small molecules, we developed data augmentation techniques based on the SMILES encoding. This linear string notation enumerates the atoms present in a compound by following a path along the molecule graph. Multiplicity of SMILES for a single compound can be reached by traversing the graph using different paths. We applied the developed augmentation techniques to three different deep learning models, including convolutional and recurrent neural networks, and to four property and activity data sets. The results show that augmentation improves the model accuracy independently of the deep learning model, as well as of the data set size. Moreover, we computed the uncertainty of a model by using augmentation at inference time. In this regard, we have shown that the more confident the model is in its prediction, the smaller is the error, implying that a given prediction can be trusted and is close to the target value. The software and associated documentation allows making predictions for novel compounds and have been made freely available. Trusting predictions blindly from algorithms may have serious consequences in areas of healthcare. In this context, better understanding how a neural network classifies a compound based on its input features is highly beneficial by helping to de-risk and optimize compounds. In this research project, we decomposed the inner layers of a deep neural network to identify the toxic substructures, the toxicophores, of a compound that led to the toxicity classification. Using molecular fingerprints —vectors that indicate the presence or absence of a particular atomic environment —we were able to map a toxicity score to each of these substructures. Moreover, we developed a method to visualize in 2D the toxicophores within a compound, the so- called cytotoxicity maps, which could be of great use to medicinal chemists in identifying ways to modify molecules to eliminate toxicity. Not only does the deep learning model reach state-of-the-art results, but the identified toxicophores confirm known toxic substructures, as well as expand new potential candidates. In order to speed up the drug discovery process, the accessibility to robust and modular workflows is extremely advantageous. In this context, the fully open-source TeachOpenCADD project was developed. Significant tasks in both cheminformatics and bioinformatics are implemented in a pedagogical fashion, allowing the material to be used for teaching as well as the starting point for novel research. In this framework, a special pipeline is dedicated to kinases, a family of proteins which are known to be involved in diseases such as cancer. The aim is to gain insights into off-targets, i.e. proteins that are unintentionally affected by a compound, and that can cause adverse effects in treatments. Four measures of kinase similarity are implemented, taking into account sequence, and structural information, as well as protein-ligand interaction, and ligand profiling data. The workflow provides clustering of a set of kinases, which can be further analyzed to understand off-target effects of inhibitors. Results show that analyzing kinases using several perspectives is crucial for the insight into off-target prediction, and gaining a global perspective of the kinome. These novel methods can be exploited in the discovery of new drugs, and more specifically diseases involved in the dysregulation of kinases, such as cancer

    Multivariate Models and Algorithms for Systems Biology

    Get PDF
    Rapid advances in high-throughput data acquisition technologies, such as microarraysand next-generation sequencing, have enabled the scientists to interrogate the expression levels of tens of thousands of genes simultaneously. However, challenges remain in developingeffective computational methods for analyzing data generated from such platforms. In thisdissertation, we address some of these challenges. We divide our work into two parts. Inthe first part, we present a suite of multivariate approaches for a reliable discovery of geneclusters, often interpreted as pathway components, from molecular profiling data with replicated measurements. We translate our goal into learning an optimal correlation structure from replicated complete and incomplete measurements. In the second part, we focus on thereconstruction of signal transduction mechanisms in the signaling pathway components. Wepropose gene set based approaches for inferring the structure of a signaling pathway.First, we present a constrained multivariate Gaussian model, referred to as the informed-case model, for estimating the correlation structure from replicated and complete molecular profiling data. Informed-case model generalizes previously known blind-case modelby accommodating prior knowledge of replication mechanisms. Second, we generalize theblind-case model by designing a two-component mixture model. Our idea is to strike anoptimal balance between a fully constrained correlation structure and an unconstrained one.Third, we develop an Expectation-Maximization algorithm to infer the underlying correlation structure from replicated molecular profiling data with missing (incomplete) measurements.We utilize our correlation estimators for clustering real-world replicated complete and incompletemolecular profiling data sets. The above three components constitute the first partof the dissertation. For the structural inference of signaling pathways, we hypothesize a directed signal pathway structure as an ensemble of overlapping and linear signal transduction events. We then propose two algorithms to reverse engineer the underlying signaling pathway structure using unordered gene sets corresponding to signal transduction events. Throughout we treat gene sets as variables and the associated gene orderings as random.The first algorithm has been developed under the Gibbs sampling framework and the secondalgorithm utilizes the framework of simulated annealing. Finally, we summarize our findingsand discuss possible future directions

    Multivariate Models and Algorithms for Systems Biology

    Get PDF
    Rapid advances in high-throughput data acquisition technologies, such as microarraysand next-generation sequencing, have enabled the scientists to interrogate the expression levels of tens of thousands of genes simultaneously. However, challenges remain in developingeffective computational methods for analyzing data generated from such platforms. In thisdissertation, we address some of these challenges. We divide our work into two parts. Inthe first part, we present a suite of multivariate approaches for a reliable discovery of geneclusters, often interpreted as pathway components, from molecular profiling data with replicated measurements. We translate our goal into learning an optimal correlation structure from replicated complete and incomplete measurements. In the second part, we focus on thereconstruction of signal transduction mechanisms in the signaling pathway components. Wepropose gene set based approaches for inferring the structure of a signaling pathway.First, we present a constrained multivariate Gaussian model, referred to as the informed-case model, for estimating the correlation structure from replicated and complete molecular profiling data. Informed-case model generalizes previously known blind-case modelby accommodating prior knowledge of replication mechanisms. Second, we generalize theblind-case model by designing a two-component mixture model. Our idea is to strike anoptimal balance between a fully constrained correlation structure and an unconstrained one.Third, we develop an Expectation-Maximization algorithm to infer the underlying correlation structure from replicated molecular profiling data with missing (incomplete) measurements.We utilize our correlation estimators for clustering real-world replicated complete and incompletemolecular profiling data sets. The above three components constitute the first partof the dissertation. For the structural inference of signaling pathways, we hypothesize a directed signal pathway structure as an ensemble of overlapping and linear signal transduction events. We then propose two algorithms to reverse engineer the underlying signaling pathway structure using unordered gene sets corresponding to signal transduction events. Throughout we treat gene sets as variables and the associated gene orderings as random.The first algorithm has been developed under the Gibbs sampling framework and the secondalgorithm utilizes the framework of simulated annealing. Finally, we summarize our findingsand discuss possible future directions

    Change detection and landscape similarity comparison using computer vision methods

    Get PDF
    Human-induced disturbances of terrestrial and aquatic ecosystems continue at alarming rates. With the advent of both raw sensor and analysis-ready datasets, the need to monitor ecosystem disturbances is now more imperative than ever; yet the task is becoming increasingly complex with increasing sources and varieties of earth observation data. In this research, computer vision methods and tools are interrogated to understand their capability for comparing spatial patterns. A critical survey of literature provides evidence that computer vision methods are relatively robust to scale and highlights issues involved in parameterization of computer vision models for characterizing significant pattern information in a geographic context. Utilizing two widely used pattern indices to compare spatial patterns in simulated and real-world datasets revealed their potential to detect subtle changes in spatial patterns which would not otherwise be feasible using traditional pixel-level techniques. A texture-based CNN model was developed to extract spatially relevant information for landscape similarity comparison; the CNN feature maps proved to be effective in distinguishing agriculture landscapes from other landscape types (e.g., forest and mountainous landscapes). For real-world human disturbance monitoring, a U-Net CNN was developed and compared with a random forest model. Both modeling frameworks exhibit promising potential to map placer mining disturbance; however, random forests proved simple to train and deploy for placer mapping, while the U-Net may be used to augment RF as it is capable of reducing misclassification errors and will benefit from increasing availability of detailed training data

    CLADAG 2021 BOOK OF ABSTRACTS AND SHORT PAPERS

    Get PDF
    The book collects the short papers presented at the 13th Scientific Meeting of the Classification and Data Analysis Group (CLADAG) of the Italian Statistical Society (SIS). The meeting has been organized by the Department of Statistics, Computer Science and Applications of the University of Florence, under the auspices of the Italian Statistical Society and the International Federation of Classification Societies (IFCS). CLADAG is a member of the IFCS, a federation of national, regional, and linguistically-based classification societies. It is a non-profit, non-political scientific organization, whose aims are to further classification research

    EVALUATING THE USE OF UNMANNED AERIAL SYSTEMS (UAS) FOR COLLECTING THEMATIC MAPPING ACCURACY ASSESSMENT REFERENCE DATA IN NEW ENGLAND FOREST COMMUNITIES

    Get PDF
    To overcome the main drivers of global environmental change, such as land use and land cover change, evolving technologies must be adopted to rapidly and accurately capture, process, analyze, and display a multitude of high resolution spatial variables. Remote sensing technologies continue to advance at an ever-increasing rate to meet end-user needs, now in the form of unmanned aerial systems (UAS or drones). UAS have bridged the gap left by satellite imagery, aerial photography, and even ground measurements in data collection potential for all matters of information. This new platform has already been deployed in many data collection scenarios, being modified to the needs of the end user. With modern remote sensing optics and computer technologies, thematic mapping of complex communities presents a wide variety of classification methods, including both pixel-based and object-based classifiers. One essential component of using the derived thematic data as decision-making information is first validating its accuracy. The process of assessing thematic accuracy over the years has come a long way, with site-specific multivariate analysis error matrices now being the premier evaluation mechanism. In order to perform any evaluation of certainty, or correctness, a comparison to a known standard must be made, this being reference data. Methods for reference data collection in both pixel-based and object-based classification assessments are indeterminate, but can all become quite limiting due to their immense costs. This research project set out to evaluate if the new, low cost UAS platform could collect reference data for use in thematic mapping accuracy assessments. We also evaluated several collection process methods for their efficiency and effectiveness, as the use of UAS is still relatively unknown in its ability to acquire data in densely vegetated landscapes. Collected imagery was calibrated and stitched together by way of structure-from-motion (SfM), attempting calibration and configuration in both Agisoft PhotoScan and Pix4DMapper Pro to form orthomosaic models. Our results showed that flying heights below 100m above the focus area surface, while acquiring ultra-high-detailed imagery, only resulted in a maximum of 62% image calibration when generating spatial models. Flying at our legal maximum flying height of 120m above the surface (just below 400ft), we averaged 97.49% image calibration, and a gsd of 3.23cm/pixel over the 398 ha. sampled. Using a classification scheme based on judging the percent coniferous composition of the sampled units, our results during optimal UAS sampling showed a maximum of 71.43% overall accuracy and 85.71% overall accuracy, respectively, for pixel-based and object-based thematic accuracy assessments, in direct comparison to ground sampled locations. Other randomly sampled procedures for each approach achieved slightly less agreement with ground data classifications. Despite the minor drawbacks brought about by the complexity of the environment, the classification results demonstrated OBIA acquiring exceptional accuracy in reference data collection. Future expansion of the project across more study areas, and larger forest landscapes could uncover increased agreement and efficiency of the UAS platform

    Utvikling av ikke-invasiv overvåking av rovdyr ved hjelp av hierarkiske modeller

    Get PDF
    The development of non-invasive approaches for monitoring wildlife populations made it feasible to obtain ecological parameters across landscapes and populations, rather than a few locations or individuals. The two most popular and widespread non-invasive monitoring methods are camera trapping and genetic sampling. The technical development associated with data collection has been impressive, whilst analytical capabilities have lagged behind. Only recently are we getting close to exploiting the potentials of non-invasively obtained data. The objective of my thesis is to apply modern hierarchical analytical models to several sets of carnivore monitoring data to address a series of conceptually and methodologically connected problems, faced by applied ecologists. The thesis consists of four articles. Two of these include simulations, and all four articles involve model fitting and case studies. The latter target a range of species including wolverine and mesocarnivores in Scandinavia and the Himalayan brown bear. Article I quantifies detectability of mesocarnivores by camera traps and sheds light on the behavioural responses of focal species to detection devices and to olfactory lures as an important aspect of detectability. Article II incorporates multiple data sources with varying levels of information in a data-sparse situation and introduces a multiple observation process model in the spatial capture-recapture framework to estimate population parameters. This model is applied to multi-method monitoring data of a Himalayan brown bear population in Pakistan. The focus in Article III is heterogeneity in the environment and it uncovers sex-specific patterns in wolverine home range size across the species’ range in Norway using solely non-invasively collected DNA data and spatial capture-recapture models. Article IV presents and evaluates an extension of the open-population spatial capture-recapture model to improve inferences on population parameters and showcases its application on wolverine data in central Norway. Hierarchical modelling offers ecologists an intuitive multi-level approach to disentangle observation and ecological processes. All chapters of this thesis include hierarchical models that account for imperfect detection. Depending on the research question, I use these models to estimate time-to-detection of species, population abundance and density, survival, variation in home range size and inter-annual movement. The monitoring methods used during this thesis are often applied to studies of rare or elusive species and data sparsity is another important challenge addressed in this thesis. Bayesian inference Using Gibbs Sampling (BUGS) language facilitates the construction of flexible models that make the incorporation of multiple types of data into one comprehensive analysis comparatively straightforward. The articles included in this thesis showcase how hierarchical models help us use non-invasively collected data to yield answers to a range of questions in applied ecology. Tackling the associated challenges increases our ability to draw inferences that more closely describe the complexity of real-world ecological systems.Utviklingen av ikke-invasive metoder for å overvåke dyrepopulasjoner har gjort det mulig å estimere økologiske parametere på tvers av landskap og populasjoner, snarere enn noen få steder eller individer. De to mest populære og utbredte ikkeinvasive overvåkingsmetodene er viltkameraer og genetisk prøvetaking. Den tekniske utviklingen knyttet til datainnsamling har vært imponerende, mens analytiske evner har hengt etter. Først nylig har vi kommet i nærheten av å utnytte potensialet til ikke-invasivt innsamlede data. Målet med avhandlingen min er å bruke moderne hierarkiske analytiske modeller på flere sett med overvåkningsdata av rovdyr for å adressere en serie konseptuelt og metodisk koblede problemer, som anvendte økologer møter. Oppgaven består av fire artikler. To av disse inkluderer simuleringer, og alle de fire artiklene involverer modelltilpassing og case-studier på en rekke arter, inkludert jerv og mesokarnivorer i Skandinavia og Himalaya brunbjørn.publishedVersio
    corecore