86,414 research outputs found
Recommended from our members
Deep learning in mining biological data
Recent technological advancements in data acquisition tools allowed life scientists to acquire multimodal data from different biological application domains. Categorised in three broad types (i.e., images, signals, and sequences), these data are huge in amount and complex in nature. Mining such enormous amount of data for pattern recognition is a big challenge and requires sophisticated data intensive machine learning techniques. Artificial neural network based learning systems are well known for their pattern recognition capabilities and lately their deep architectures - known as deep learning (DL) - have been successfully applied to solve many complex pattern recognition problems. To investigate how DL - especially its different architectures - has contributed and utilised in the mining of biological data pertaining to those three types, a meta analysis has been performed and the resulting resources have been critically analysed. Focusing on the use of DL to analyse patterns in data from diverse biological domains, this work investigates different DL architectures' applications to these data. This is followed by an exploration of available open access data sources pertaining to the three data types along with popular open source DL tools applicable to these data. Also, comparative investigations of these tools from qualitative, quantitative, and benchmarking perspectives are provided. Finally, some open research challenges in using DL to mine biological data are outlined and a number of possible future perspectives are put forward
Recommended from our members
A novel deep mining model for effective knowledge discovery from omics data
Knowledge discovery from omics data has become a common goal of current approaches to personalised cancer medicine and understanding cancer genotype and phenotype. However, high-throughput biomedical datasets are characterised by high dimensionality and relatively small sample sizes with small signal-to-noise ratios. Extracting and interpreting relevant knowledge from such complex datasets therefore remains a significant challenge for the fields of machine learning and data mining. In this paper, we exploit recent advances in deep learning to mitigate against these limitations on the basis of automatically capturing enough of the meaningful abstractions latent with the available biological samples. Our deep feature learning model is proposed based on a set of non-linear sparse Auto-Encoders that are deliberately constructed in an under-complete manner to detect a small proportion of molecules that can recover a large proportion of variations underlying the data. However, since multiple projections are applied to the input signals, it is hard to interpret which phenotypes were responsible for deriving such predictions. Therefore, we also introduce a novel weight interpretation technique that helps to deconstruct the internal state of such deep learning models to reveal key determinants underlying its latent representations. The outcomes of our experiment provide strong evidence that the proposed deep mining model is able to discover robust biomarkers that are positively and negatively associated with cancers of interest. Since our deep mining model is problem-independent and data-driven, it provides further potential for this research to extend beyond its cognate disciplines
Deep learning methods for mining genomic sequence patterns
Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine.
This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation.
There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns
Graph embedding and geometric deep learning relevance to network biology and structural chemistry
Graphs are used as a model of complex relationships among data in biological science since the advent of systems biology in the early 2000. In particular, graph data analysis and graph data mining play an important role in biology interaction networks, where recent techniques of artificial intelligence, usually employed in other type of networks (e.g., social, citations, and trademark networks) aim to implement various data mining tasks including classification, clustering, recommendation, anomaly detection, and link prediction. The commitment and efforts of artificial intelligence research in network biology are motivated by the fact that machine learning techniques are often prohibitively computational demanding, low parallelizable, and ultimately inapplicable, since biological network of realistic size is a large system, which is characterised by a high density of interactions and often with a non-linear dynamics and a non-Euclidean latent geometry. Currently, graph embedding emerges as the new learning paradigm that shifts the tasks of building complex models for classification, clustering, and link prediction to learning an informative representation of the graph data in a vector space so that many graph mining and learning tasks can be more easily performed by employing efficient non-iterative traditional models (e.g., a linear support vector machine for the classification task). The great potential of graph embedding is the main reason of the flourishing of studies in this area and, in particular, the artificial intelligence learning techniques. In this mini review, we give a comprehensive summary of the main graph embedding algorithms in light of the recent burgeoning interest in geometric deep learning
Computationally Linking Chemical Exposure to Molecular Effects with Complex Data: Comparing Methods to Disentangle Chemical Drivers in Environmental Mixtures and Knowledge-based Deep Learning for Predictions in Environmental Toxicology
Chemical exposures affect the environment and may lead to adverse outcomes in its organisms. Omics-based approaches, like standardised microarray experiments, have expanded the toolbox to monitor the distribution of chemicals and assess the risk to organisms in the environment. The resulting complex data have extended the scope of toxicological knowledge bases and published literature. A plethora of computational approaches have been applied in environmental toxicology considering systems biology and data integration. Still, the complexity of environmental and biological systems given in data challenges investigations of exposure-related effects. This thesis aimed at computationally linking chemical exposure to biological effects on the molecular level considering sources of complex environmental data.
The first study employed data of an omics-based exposure study considering mixture effects in a freshwater environment. We compared three data-driven analyses in their suitability to disentangle mixture effects of chemical exposures to biological effects and their reliability in attributing potentially adverse outcomes to chemical drivers with toxicological databases on gene and pathway levels. Differential gene expression analysis and a network inference approach resulted in toxicologically meaningful outcomes and uncovered individual chemical effects — stand-alone and in combination. We developed an integrative computational strategy to harvest exposure-related gene associations from environmental samples considering mixtures of lowly concentrated compounds. The applied approaches allowed assessing the hazard of chemicals more systematically with correlation-based compound groups.
This dissertation presents another achievement toward a data-driven hypothesis generation for molecular exposure effects. The approach combined text-mining and deep learning. The study was entirely data-driven and involved state-of-the-art computational methods of artificial intelligence. We employed literature-based relational data and curated toxicological knowledge to predict chemical-biomolecule interactions. A word embedding neural network with a subsequent feed-forward network was implemented. Data augmentation and recurrent neural networks were beneficial for training with curated toxicological knowledge. The trained models reached accuracies of up to 94% for unseen test data of the employed knowledge base.
However, we could not reliably confirm known chemical-gene interactions across selected data sources. Still, the predictive models might derive unknown information from toxicological knowledge sources, like literature, databases or omics-based exposure studies. Thus, the deep learning models might allow predicting hypotheses of exposure-related molecular effects.
Both achievements of this dissertation might support the prioritisation of chemicals for testing and an intelligent selection of chemicals for monitoring in future exposure studies.:Table of Contents ... I
Abstract ... V
Acknowledgements ... VII
Prelude ... IX
1 Introduction
1.1 An overview of environmental toxicology ... 2
1.1.1 Environmental toxicology ... 2
1.1.2 Chemicals in the environment ... 4
1.1.3 Systems biological perspectives in environmental toxicology ... 7
Computational toxicology ... 11
1.2.1 Omics-based approaches ... 12
1.2.2 Linking chemical exposure to transcriptional effects ... 14
1.2.3 Up-scaling from the gene level to higher biological organisation levels ... 19
1.2.4 Biomedical literature-based discovery ... 24
1.2.5 Deep learning with knowledge representation ... 27
1.3 Research question and approaches ... 29
2 Methods and Data ... 33
2.1 Linking environmental relevant mixture exposures to transcriptional effects ... 34
2.1.1 Exposure and microarray data ... 34
2.1.2 Preprocessing ... 35
2.1.3 Differential gene expression ... 37
2.1.4 Association rule mining ... 38
2.1.5 Weighted gene correlation network analysis ... 39
2.1.6 Method comparison ... 41
Predicting exposure-related effects on a molecular level ... 44
2.2.1 Input ... 44
2.2.2 Input preparation ... 47
2.2.3 Deep learning models ... 49
2.2.4 Toxicogenomic application ... 54
3 Method comparison to link complex stream water exposures to effects on
the transcriptional level ... 57
3.1 Background and motivation ... 58
3.1.1 Workflow ... 61
3.2 Results ... 62
3.2.1 Data preprocessing ... 62
3.2.2 Differential gene expression analysis ... 67
3.2.3 Association rule mining ... 71
3.2.4 Network inference ... 78
3.2.5 Method comparison ... 84
3.2.6 Application case of method integration ... 87
3.3 Discussion ... 91
3.4 Conclusion ... 99
4 Deep learning prediction of chemical-biomolecule interactions ... 101
4.1 Motivation ... 102
4.1.1Workflow ...105
4.2 Results ... 107
4.2.1 Input preparation ... 107
4.2.2 Model selection ... 110
4.2.3 Model comparison ... 118
4.2.4 Toxicogenomic application ... 121
4.2.5 Horizontal augmentation without tail-padding ...123
4.2.6 Four-class problem formulation ... 124
4.2.7 Training with CTD data ... 125
4.3 Discussion ... 129
4.3.1 Transferring biomedical knowledge towards toxicology ... 129
4.3.2 Deep learning with biomedical knowledge representation ...133
4.3.3 Data integration ...136
4.4 Conclusion ... 141
5 Conclusion and Future perspectives ... 143
5.1 Conclusion ... 143
5.1.1 Investigating complex mixtures in the environment ... 144
5.1.2 Complex knowledge from literature and curated databases predict chemical-
biomolecule interactions ... 145
5.1.3 Linking chemical exposure to biological effects by integrating CTD ... 146
5.2 Future perspectives ... 147
S1 Supplement Chapter 1 ... 153
S1.1 Example of an estrogen bioassay ... 154
S1.2 Types of mode of action ... 154
S1.3 The dogma of molecular biology ... 157
S1.4 Transcriptomics ... 159
S2 Supplement Chapter 3 ... 161
S3 Supplement Chapter 4 ... 175
S3.1 Hyperparameter tuning results ... 176
S3.2 Functional enrichment with predicted chemical-gene interactions and CTD reference pathway genesets ... 179
S3.3 Reduction of learning rate in a model with large word embedding vectors ... 183
S3.4 Horizontal augmentation without tail-padding ... 183
S3.5 Four-relationship classification ... 185
S3.6 Interpreting loss observations for SemMedDB trained models ... 187
List of Abbreviations ... i
List of Figures ... vi
List of Tables ... x
Bibliography ... xii
Curriculum scientiae ... xxxix
Selbständigkeitserklärung ... xlii
Machine Learning Based Applications for Data Visualization, Modeling, Control, and Optimization for Chemical and Biological Systems
This dissertation report covers Yan Ma’s Ph.D. research with applicational studies of machine learning in manufacturing and biological systems. The research work mainly focuses on reaction modeling, optimization, and control using a deep learning-based approaches, and the work mainly concentrates on deep reinforcement learning (DRL). Yan Ma’s research also involves with data mining with bioinformatics. Large-scale data obtained in RNA-seq is analyzed using non-linear dimensionality reduction with Principal Component Analysis (PCA), t-Distributed Stochastic Neighbor Embedding (t-SNE), and Uniform Manifold Approximation and Projection (UMAP), followed by clustering analysis using k-Means and Hierarchical Density-Based Spatial Clustering with Noise (HDBSCAN). This report focuses on 3 case studies with DRL optimization control including a polymerization reaction control with deep reinforcement learning, a bioreactor optimization, and a fed-batch reaction optimization from a reactor at Dow Inc.. In the first study, a data-driven controller based on DRL is developed for a fed-batch polymerization reaction with multiple continuous manipulative variables with continuous control. The second case study is the modeling and optimization of a bioreactor. In this study, a data-driven reaction model is developed using Artificial Neural Network (ANN) to simulate the growth curve and bio-product accumulation of cyanobacteria Plectonema. Then a DRL control agent that optimizes the daily nutrient input is applied to maximize the yield of valuable bio-product C-phycocyanin. C-phycocyanin yield is increased by 52.1% compared to a control group with the same total nutrient content in experimental validation. The third case study is employing the data-driven control scheme for optimization of a reactor from Dow Inc, where a DRL-based optimization framework is established for the optimization of the Multi-Input, Multi-Output (MIMO) reaction system with reaction surrogate modeling. Yan Ma’s research overall shows promising directions for employing the emerging technologies of data-driven methods and deep learning in the field of manufacturing and biological systems. It is demonstrated that DRL is an efficient algorithm in the study of three different reaction systems with both stochastic and deterministic policies. Also, the use of data-driven models in reaction simulation also shows promising results with the non-linear nature and fast computational speed of the neural network models
Heterogeneous Multi-Layered Network Model for Omics Data Integration and Analysis
Advances in next-generation sequencing and high-throughput techniques have enabled the generation of vast amounts of diverse omics data. These big data provide an unprecedented opportunity in biology, but impose great challenges in data integration, data mining, and knowledge discovery due to the complexity, heterogeneity, dynamics, uncertainty, and high-dimensionality inherited in the omics data. Network has been widely used to represent relations between entities in biological system, such as protein-protein interaction, gene regulation, and brain connectivity (i.e. network construction) as well as to infer novel relations given a reconstructed network (aka link prediction). Particularly, heterogeneous multi-layered network (HMLN) has proven successful in integrating diverse biological data for the representation of the hierarchy of biological system. The HMLN provides unparalleled opportunities but imposes new computational challenges on establishing causal genotype-phenotype associations and understanding environmental impact on organisms. In this review, we focus on the recent advances in developing novel computational methods for the inference of novel biological relations from the HMLN. We first discuss the properties of biological HMLN. Then we survey four categories of state-of-the-art methods (matrix factorization, random walk, knowledge graph, and deep learning). Thirdly, we demonstrate their applications to omics data integration and analysis. Finally, we outline strategies for future directions in the development of new HMLN models
Using Neural Networks for Relation Extraction from Biomedical Literature
Using different sources of information to support automated extracting of
relations between biomedical concepts contributes to the development of our
understanding of biological systems. The primary comprehensive source of these
relations is biomedical literature. Several relation extraction approaches have
been proposed to identify relations between concepts in biomedical literature,
namely, using neural networks algorithms. The use of multichannel architectures
composed of multiple data representations, as in deep neural networks, is
leading to state-of-the-art results. The right combination of data
representations can eventually lead us to even higher evaluation scores in
relation extraction tasks. Thus, biomedical ontologies play a fundamental role
by providing semantic and ancestry information about an entity. The
incorporation of biomedical ontologies has already been proved to enhance
previous state-of-the-art results.Comment: Artificial Neural Networks book (Springer) - Chapter 1
- …