106 research outputs found

    Opportunities and obstacles for deep learning in biology and medicine

    Get PDF
    Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network\u27s prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine

    NOVEL APPLICATIONS OF MACHINE LEARNING IN BIOINFORMATICS

    Get PDF
    Technological advances in next-generation sequencing and biomedical imaging have led to a rapid increase in biomedical data dimension and acquisition rate, which is challenging the conventional data analysis strategies. Modern machine learning techniques promise to leverage large data sets for finding hidden patterns within them, and for making accurate predictions. This dissertation aims to design novel machine learning-based models to transform biomedical big data into valuable biological insights. The research presented in this dissertation focuses on three bioinformatics domains: splice junction classification, gene regulatory network reconstruction, and lesion detection in mammograms. A critical step in defining gene structures and mRNA transcript variants is to accurately identify splice junctions. In the first work, we built the first deep learning-based splice junction classifier, DeepSplice. It outperforms the state-of-the-art classification tools in terms of both classification accuracy and computational efficiency. To uncover transcription factors governing metabolic reprogramming in non-small-cell lung cancer patients, we developed TFmeta, a machine learning approach to reconstruct relationships between transcription factors and their target genes in the second work. Our approach achieves the best performance on benchmark data sets. In the third work, we designed deep learning-based architectures to perform lesion detection in both 2D and 3D whole mammogram images

    Enabling high-throughput image analysis with deep learning-based tools

    Get PDF
    Microscopes are a valuable tool in biological research, facilitating information gathering with different magnification scales, samples and markers in single-cell and whole-population studies. However, image acquisition and analysis are very time-consuming, so efficient solutions are needed for the required speed-up to allow high-throughput microscopy. Throughout the work presented in this thesis, I developed new computational methods and software packages to facilitate high-throughput microscopy. My work comprised not only the development of these methods themselves but also their integration into the workflow of the lab, starting from automating the microscopy acquisition to deploying scalable analysis services and providing user-friendly local user interfaces. The main focus of my thesis was YeastMate, a tool for automatic detection and segmentation of yeast cells and sub-type classification of their life-cycle transitions. Development of YeastMate was mainly driven by research on quality control mechanisms of the mitochondrial genome in S. cerevisiae, where yeast cells are imaged during their sexual and asexual reproduction life-cycle stages. YeastMate can automatically detect both single cells and life-cycle transitions, perform segmentation and enable pedigree analysis by determining origin and offspring cells. I developed a novel adaptation of the Mask R-CNN object detection model to integrate the classification of inter-cell connections into the usual detection and segmentation analysis pipelines. Another part of my work focused on the automation of microscopes themselves using deep learning models to detect wings of D. melanogaster. A microscope was programmed to acquire large overview images and then to acquire detailed images at higher magnification on the detected coordinates of each wing. The implementation of this workflow replaced the process of manually imaging slides, usually taking hours to do so, with a fully automated, end-to-end solution

    EGFR and KRAS mutation prediction on lung cancer through medical image processing and artificial intelligence

    Get PDF
    Lung cancer causes more deaths globally than any other type of cancer. To determine the best treatment, detecting EGFR and KRAS mutations is of interest. However, non-invasive ways to obtain this information are not available. In this study, an ensemble approach is applied to increase the performance of EGFR and KRAS mutation prediction from CT images using a small dataset. A new voting scheme, Selective Class Average Voting (SCAV) is proposed and its performance is assessed both for machine learning models and Convolutional Neural Networks (CNNs). For the EGFR mutation, in the machine learning approach, there was an increase in the Sensitivity from 0.66 to 0.75, and an increase in AUC from 0.68 to 0.70. With the deep learning approach an AUC of 0.846 was obtained with custom CNNs, and with SCAV the Accuracy of the model was increased from 0.80 to 0.857. Finally, when combining the best Custom and Pre-trained CNNs using SCAV an AUC of 0.914 was obtained. For the KRAS mutation both in the machine learning models (0.65 to 0.71 AUC) and the deep learning models (0.739 to 0.778 AUC) a significant increase in performance was found. This increase was even greater with Ensembles of Pre-trained CNNs (0.809 AUC). The results obtained in this work show how to effectively learn from small image datasets to predict EGFR and KRAS mutations, and that using ensembles with SCAV increases the performance of machine learning classifiers and CNNs.DoctoradoDoctor en Ingeniería de Sistemas y Computació

    Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics

    Get PDF
    There is a growing attention toward personalized medicine. This is led by a fundamental shift from the ‘one size fits all’ paradigm for treatment of patients with conditions or predisposition to diseases, to one that embraces novel approaches, such as tailored target therapies, to achieve the best possible outcomes. Driven by these, several national and international genome projects have been initiated to reap the benefits of personalized medicine. Exome and targeted sequencing provide a balance between cost and benefit, in contrast to whole genome sequencing (WGS). Whole exome sequencing (WES) targets approximately 3% of the whole genome, which is the basis for protein-coding genes. Nonetheless, it has the characteristics of big data in large deployment. Herein, the application of WES and its relevance in advancing personalized medicine is reviewed. WES is mapped to Big Data “10 Vs” and the resulting challenges discussed. Application of existing biological databases and bioinformatics tools to address the bottleneck in data processing and analysis are presented, including the need for new generation big data analytics for the multi-omics challenges of personalized medicine. This includes the incorporation of artificial intelligence (AI) in the clinical utility landscape of genomic information, and future consideration to create a new frontier toward advancing the field of personalized medicine

    Generalisable Methods for Improving CRISPR Efficiency and Outcome Specificity using Machine Learning Algorithms

    Get PDF
    CRISPR (clustered regularly interspaced short palindromic repeats) based genome editing has become a popular tool for a range of disciplines, including microbiology, agricultural science, and health. Driving these applications is the ability of the "programmable" system to target a predefined location in the genome. A single guide RNA (sgRNA) defines the target through Watson-Crick base pairing, and a class 2 type II CRISPR associated protein 9 (Cas9) nuclease cleaves the target, resulting in a double-strand break (DSB). This activates DNA repair, and depending on the repair pathway initiated, can result in arbitrary insertions/deletions or a predefined variant. Despite the versatility and ease of design enabled by this RNA-guided nuclease, it lacks specificity, regarding off-target effects, and efficiency, regarding the rate of successful editing outcomes. The overarching hypothesis of my thesis is to solve the disadvantages of CRISPR systems by using machine learning to train generalisable models on existing and novel datasets. One pathway that demonstrates the need for prediction models is homology directed repair (HDR). HDR enables researchers to induce nearly any editing outcome, however, it is inefficient. And with an incomplete knowledge of its kinetics, no models existed for predicting its efficiency. I generated a novel dataset representing the efficiency of HDR. Using the Random Forests algorithm, I identified the sgRNA and the 3' region of the template to modulate HDR efficiency. This novel finding relates to the kinetics of template interaction during HDR repair. Even with efficient gene editing, a potential problem is unwanted side effects, such as embryonic lethality. This can be solved by using CRISPR to create conditional knockout alleles, to control when and where knockouts occur. To investigate the efficiency of this process, I used statistical analyses and the Random Forest algorithm to analyse a dataset generated by a consortium of 19 laboratories. I identified the inherent inefficiency of this method as defined by the efficiency of two simultaneous HDR events. Other experimental variables, like reagent concentrations or technician skill level, had no significant influence on efficiency. Because of the unrivalled versatility of this method, I created a statistical model for forecasting the efficiency of this technique from a low number of attempts, aiming to overcome its inherent inefficiency. While Cas9 is the most cited CRISPR system, alternative CRISPR systems can further expand the gene editing repertoire. To support the uptake of the more-recent Cas12a, I performed a comprehensive comparison between the two nucleases. I found support for Cas12a having a superior specificity. Despite this, editing outcome and efficiency prediction tools for Cas12a were scarce. Aiming to address this, I trained a Cas12a cleavage efficiency prediction model on representative data. This outperformed the current top model despite the dataset being 300x smaller, demonstrating the importance of clean data. Altogether, this thesis improves the knowledge of different CRISPR gene editing techniques. These findings can enable researchers to design efficient experiments as well as provide researchers guidance where certain techniques may be inherently inefficient. As well as resulting in CUNE (Computational Universal Nucleotide Editor) and Cas12aRF, it also identifies the generalisability of prediction models due to the high degree of influence on efficiency by the sgRNA and repair template design
    corecore