106 research outputs found
Opportunities and obstacles for deep learning in biology and medicine
Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network\u27s prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine
NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS
Technological advances in next-generation sequencing and biomedical imaging have led to a rapid increase in biomedical data dimension and acquisition rate, which is challenging the conventional data analysis strategies. Modern machine learning techniques promise to leverage large data sets for finding hidden patterns within them, and for making accurate predictions. This dissertation aims to design novel machine learning-based models to transform biomedical big data into valuable biological insights. The research presented in this dissertation focuses on three bioinformatics domains: splice junction classification, gene regulatory network reconstruction, and lesion detection in mammograms.
A critical step in defining gene structures and mRNA transcript variants is to accurately identify splice junctions. In the first work, we built the first deep learning-based splice junction classifier, DeepSplice. It outperforms the state-of-the-art classification tools in terms of both classification accuracy and computational efficiency. To uncover transcription factors governing metabolic reprogramming in non-small-cell lung cancer patients, we developed TFmeta, a machine learning approach to reconstruct relationships between transcription factors and their target genes in the second work. Our approach achieves the best performance on benchmark data sets. In the third work, we designed deep learning-based architectures to perform lesion detection in both 2D and 3D whole mammogram images
Enabling high-throughput image analysis with deep learning-based tools
Microscopes are a valuable tool in biological research, facilitating information gathering with different magnification scales, samples and markers in single-cell and whole-population studies. However, image acquisition and analysis are very time-consuming, so efficient solutions are needed for the required speed-up to allow high-throughput microscopy.
Throughout the work presented in this thesis, I developed new computational methods and software packages to facilitate high-throughput microscopy. My work comprised not only the development of these methods themselves but also their integration into the workflow of the lab, starting from automating the microscopy acquisition to deploying scalable analysis services and providing user-friendly local user interfaces.
The main focus of my thesis was YeastMate, a tool for automatic detection and segmentation of yeast cells and sub-type classification of their life-cycle transitions. Development of YeastMate was mainly driven by research on quality control mechanisms of the mitochondrial genome in S. cerevisiae, where yeast cells are imaged during their sexual and asexual reproduction life-cycle stages. YeastMate can automatically detect both single cells and life-cycle transitions, perform segmentation and enable pedigree analysis by determining origin and offspring cells. I developed a novel adaptation of the Mask R-CNN object detection model to integrate the classification of inter-cell connections into the usual detection and segmentation analysis pipelines.
Another part of my work focused on the automation of microscopes themselves using deep learning models to detect wings of D. melanogaster. A microscope was programmed to acquire large overview images and then to acquire detailed images at higher magnification on the detected coordinates of each wing. The implementation of this workflow replaced the process of manually imaging slides, usually taking hours to do so, with a fully automated, end-to-end solution
EGFR and KRAS mutation prediction on lung cancer through medical image processing and artificial intelligence
Lung cancer causes more deaths globally than any other type of cancer. To determine the best treatment, detecting EGFR and KRAS mutations is of interest. However, non-invasive ways to obtain this information are not available. In this study, an ensemble approach is applied to increase the performance of EGFR and KRAS mutation prediction from CT images using a small dataset. A new voting scheme, Selective Class Average Voting (SCAV) is proposed and its performance is assessed both for machine learning models and Convolutional Neural Networks (CNNs). For the EGFR mutation, in the machine learning approach, there was an increase in the Sensitivity from 0.66 to 0.75, and an increase in AUC from 0.68 to 0.70. With the deep learning approach an AUC of 0.846 was obtained with custom CNNs, and with SCAV the Accuracy of the model was increased from 0.80 to 0.857. Finally, when combining the best Custom and Pre-trained CNNs using SCAV an AUC of 0.914 was obtained. For the KRAS mutation both in the machine learning models (0.65 to 0.71 AUC) and the deep learning models (0.739 to 0.778 AUC) a significant increase in performance was found. This increase was even greater with Ensembles of Pre-trained CNNs (0.809 AUC). The results obtained in this work show how to effectively learn from small image datasets to predict EGFR and KRAS mutations, and that using ensembles with SCAV increases the performance of machine learning classifiers and CNNs.DoctoradoDoctor en Ingeniería de Sistemas y Computació
Recommended from our members
Modelling the structural, functional and phenotypic consequences of protein coding mutations
Proteins are integral to all cellular processes and underpin the function of all extant organisms, meaning variants impacting them are a primary cause of phenotypic variation. Protein coding variants are a key area of study in biology, with relevance from structural and molecular biology to population genetics. They are also medically important, impacting inherited genetic diseases, cancer and response to pathogens. Recent advances in highthroughput experimental techniques have opened the door to many new approaches in biology, and protein variants are no exception. Deep mutational scanning experiments exhaustively measure the fitness of variants in a protein, which gives us more experimentally validated mutational consequence measurements than ever before. Such advances, together with ever larger sequence and structure databases, have created an opportunity to apply large scale analyses to coding variation, studying the effect on protein structure, function and phenotype.
In this thesis I perform three large scale variant analyses. First, I use the consequences of variation to learn about protein structure and function. I compile a dataset from 28 deep mutational scanning studies, covering 6291 positions in 30 proteins, and use the consequences of mutation at each position to define a mutational landscape. I show rich biophysical relationships in this landscape and identify functionally distinct positional subtypes of each amino acid. In the second analysis, I explore genotype to phenotype prediction using a dataset of 1011 S. cerevisiae strains, with genotypes, transcriptomics, proteomics and measured phenotypes, and comprehensive gene deletions in four strains. I show knowledge-based
models of mutational consequences and pathway function can be used to associate genes with phenotypes and predict growth phenotypes across 34 growth conditions. However, genetic background is found to have a large effect on variant consequences, to such an extent that the same deletion can be highly significant in one strain and have no effect in another. Finally, I analyse computational variant effect prediction, benchmarking current predictors using deep mutational scanning data. I then develop a new end-to-end deep convolutional neural network predictor that predicts consequences directly from sequence and structure and show it improves on current methods. Together these projects advance our knowledge of protein coding variation and enhance our capacity to link variation to impacts on structure, function and phenotype
Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics
There is a growing attention toward personalized medicine. This is led by a fundamental shift from the ‘one size fits all’ paradigm for treatment of patients with conditions or predisposition to diseases, to one that embraces novel approaches, such as tailored target therapies, to achieve the best possible outcomes. Driven by these, several national and international genome projects have been initiated to reap the benefits of personalized medicine. Exome and targeted sequencing provide a balance between cost and benefit, in contrast to whole genome sequencing (WGS). Whole exome sequencing (WES) targets approximately 3% of the whole genome, which is the basis for protein-coding genes. Nonetheless, it has the characteristics of big data in large deployment. Herein, the application of WES and its relevance in advancing personalized medicine is reviewed. WES is mapped to Big Data “10 Vs” and the resulting challenges discussed. Application of existing biological databases and bioinformatics tools to address the bottleneck in data processing and analysis are presented, including the need for new generation big data analytics for the multi-omics challenges of personalized medicine. This includes the incorporation of artificial intelligence (AI) in the clinical utility landscape of genomic information, and future consideration to create a new frontier toward advancing the field of personalized medicine
Generalisable Methods for Improving CRISPR Efficiency and Outcome Specificity using Machine Learning Algorithms
CRISPR (clustered regularly interspaced short palindromic repeats) based genome editing has become a popular tool for a range of disciplines, including microbiology, agricultural science, and health. Driving these applications is the ability of the "programmable" system to target a predefined location in the genome. A single guide RNA (sgRNA) defines the target through Watson-Crick base pairing, and a class 2 type II CRISPR associated protein 9 (Cas9) nuclease cleaves the target, resulting in a double-strand break (DSB). This activates DNA repair, and depending on the repair pathway initiated, can result in arbitrary insertions/deletions or a predefined variant.
Despite the versatility and ease of design enabled by this RNA-guided nuclease, it lacks specificity, regarding off-target effects, and efficiency, regarding the rate of successful editing outcomes. The overarching hypothesis of my thesis is to solve the disadvantages of CRISPR systems by using machine learning to train generalisable models on existing and novel datasets.
One pathway that demonstrates the need for prediction models is homology directed repair (HDR). HDR enables researchers to induce nearly any editing outcome, however, it is inefficient. And with an incomplete knowledge of its kinetics, no models existed for predicting its efficiency. I generated a novel dataset representing the efficiency of HDR. Using the Random Forests algorithm, I identified the sgRNA and the 3' region of the template to modulate HDR efficiency. This novel finding relates to the kinetics of template interaction during HDR repair.
Even with efficient gene editing, a potential problem is unwanted side effects, such as embryonic lethality. This can be solved by using CRISPR to create conditional knockout alleles, to control when and where knockouts occur. To investigate the efficiency of this process, I used statistical analyses and the Random Forest algorithm to analyse a dataset generated by a consortium of 19 laboratories. I identified the inherent inefficiency of this method as defined by the efficiency of two simultaneous HDR events. Other experimental variables, like reagent concentrations or technician skill level, had no significant influence on efficiency. Because of the unrivalled versatility of this method, I created a statistical model for forecasting the efficiency of this technique from a low number of attempts, aiming to overcome its inherent inefficiency.
While Cas9 is the most cited CRISPR system, alternative CRISPR systems can further expand the gene editing repertoire. To support the uptake of the more-recent Cas12a, I performed a comprehensive comparison between the two nucleases. I found support for Cas12a having a superior specificity. Despite this, editing outcome and efficiency prediction tools for Cas12a were scarce. Aiming to address this, I trained a Cas12a cleavage efficiency prediction model on representative data. This outperformed the current top model despite the dataset being 300x smaller, demonstrating the importance of clean data.
Altogether, this thesis improves the knowledge of different CRISPR gene editing techniques. These findings can enable researchers to design efficient experiments as well as provide researchers guidance where certain techniques may be inherently inefficient. As well as resulting in CUNE (Computational Universal Nucleotide Editor) and Cas12aRF, it also identifies the generalisability of prediction models due to the high degree of influence on efficiency by the sgRNA and repair template design
Recommended from our members
The Pursuit of Hoppiness : Propelling Hop into the Genomic Era
Hop (Humulus lupulus L. var lupulus) is a plant of great cultural significance, used as a medicinal herb for thousands of years, and for flavor and as a preservative in brewing beer. Studies of the medicinal effects of the unique compounds produced by hop have led to interest from the pharmacy and healthcare industries. Although many industries have interest in the plant itself and scientists are interested in the evolution of the Cannabaceae sex chromosomes, little effort has gone into developing genomic resources. H. lupulus is a highly heterozygous, repeat-rich, plant genome with a size of about 2.8 gigabases. This combination presents an immense challenge to studying the genomics of hop. Here we present, a web portal for studying hop genomics, a novel hop genome assembly, gene annotations for both draft genome assemblies, an evolutionary biology study regarding the hop sex chromosomes, and a novel way of modeling transcripts using deep learning. The combination of these manuscripts provides a framework for the future of hop genomics
- …