4,905 research outputs found

    Deep learning methods for mining genomic sequence patterns

    Get PDF
    Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine. This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation. There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns

    A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites

    Get PDF
    We have developed a new method for identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequences. The method performs significantly better than previous prediction schemes, and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, thoughwith lower precision. Predictions can be made on a publicly available WWW server. Present address: Novo Nordisk A/S, Scientific Computing, Building 9M1, Novo Alle, DK-2880 Bagsværd, Denmark Introduction Signal peptides control the entry of virtually all proteins to the secretory pathway, both in eukaryotes and prokaryotes (von Heijne, 1990; Gierasch, 1989; Rapoport, 1992). They comprise the N--terminal part of the amino acid chain, and are cleaved off while the protein is translocated through the membrane. The common structure of signal peptides from variou..

    CoBaltDB: Complete bacterial and archaeal orfeomes subcellular localization database and associated resources

    Get PDF
    International audienceBACKGROUND: The functions of proteins are strongly related to their localization in cell compartments (for example the cytoplasm or membranes) but the experimental determination of the sub-cellular localization of proteomes is laborious and expensive. A fast and low-cost alternative approach is in silico prediction, based on features of the protein primary sequences. However, biologists are confronted with a very large number of computational tools that use different methods that address various localization features with diverse specificities and sensitivities. As a result, exploiting these computer resources to predict protein localization accurately involves querying all tools and comparing every prediction output; this is a painstaking task. Therefore, we developed a comprehensive database, called CoBaltDB, that gathers all prediction outputs concerning complete prokaryotic proteomes. DESCRIPTION: The current version of CoBaltDB integrates the results of 43 localization predictors for 784 complete bacterial and archaeal proteomes (2.548.292 proteins in total). CoBaltDB supplies a simple user-friendly interface for retrieving and exploring relevant information about predicted features (such as signal peptide cleavage sites and transmembrane segments). Data are organized into three work-sets ("specialized tools", "meta-tools" and "additional tools"). The database can be queried using the organism name, a locus tag or a list of locus tags and may be browsed using numerous graphical and text displays. CONCLUSIONS: With its new functionalities, CoBaltDB is a novel powerful platform that provides easy access to the results of multiple localization tools and support for predicting prokaryotic protein localizations with higher confidence than previously possible. CoBaltDB is available at http://www.umr6026.univ-rennes1.fr/english/home/research/basic/software/cobalten

    Pushing the Boundaries of Biomolecule Characterization through Deep Learning

    Get PDF
    The importance of studying biological molecules in living organisms can hardly be overstated as they regulate crucial processes in living matter of all kinds.Their ubiquitous nature makes them relevant for disease diagnosis, drug development, and for our fundamental understanding of the complex systems of biology.However, due to their small size, they scatter too little light on their own to be directly visible and available for study.Thus, it is necessary to develop characterization methods which enable their elucidation even in the regime of very faint signals. Optical systems, utilizing the relatively low intrusiveness of visible light, constitute one such approach of characterization. However, the optical systems currently capable of analyzing single molecules in the nano-sized regime today either require the species of interest to be tagged with visible labels like fluorescence or chemically restrained on a surface to be analyzed.Ergo, there exist effectively no methods of characterizing very small biomolecules under naturally relevant conditions through unobtrusive probing. Nanofluidic Scattering Microscopy is a method introduced in this thesis which bridges this gap by enabling the real-time label-free size-and-weight determination of freely diffusing molecules directly in small nano-sized channels. However, the molecule signals are so faint, and the background noise so complex with high spatial and temporal variation, that standard methods of data analysis are incapable of elucidating the molecules\u27 properties of relevance in any but the least challenging conditions.To remedy the weak signal, and realize the method\u27s full potential, this thesis\u27 focus is the development of a versatile deep-learning based computer-vision platform to overcome the bottleneck of data analysis. We find that said platform has considerably increased speed, accuracy, precision and limit of detection compared to standard methods, constituting even a lower detection limit than any other method of label-free optical characterization currently available. In this regime, hitherto elusive species of biomolecules become accessible for study, potentially opening up entirely new avenues of biological research. These results, along with many others in the context of deep learning for optical microscopy in biological applications, suggest that deep learning is likely to be pivotal in solving the complex image analysis problems of the present and enabling new regimes of study within microscopy-based research in the near future

    Improving nonlinear search with Self-Organizing Maps - Application to Magnetic Resonance Relaxometry

    Get PDF
    Quantification of myelin in vivo is crucial for the understanding of neurological diseases, like multiple sclerosis (MS). Multi-Component Driven Equilibrium Single Pulse Observation T1 and T2 (mcDESPOT) is a rapid and precise method for determination of the longitudinal and transverse relaxation times in a voxel wise fashion. Briefly, mcDESPOT couples sets of SPGR (spoiled gradient-recalled echo) and bSSFP (fully balance steady-state free precession) data acquired over a range of flip angles (α) with constant interpulse spacing (TR) to derive 6 parameters (free-water T1 and T2, myelin-associated water T1 and T2, relative myelin-associated water volume fraction, and the myelin-associated water proton residence time) based on water exchange models. However, this procedure is computationally expensive and extremely difficult due to the need to find the best fit to the 24 MRI signals volumes in a search of nonlinear 6 dimensional space of model parameters. In this context, the aim of this work is to improve mcDESPOT efficiency and accuracy using tissue information contained in the sets of signals (SPGR and bSSFP) acquired. The basic hypothesis is that similar acquired signals are referred to tissue portions with close features, which translate in similar parameters. This similarity could be used to drive the nonlinear mcDESPOT fitting, leading the optimization algorithm (that is based on a stochastic region contraction approach) to look for a solution (i.e. the 6 parameters vector) also in regions defined by previously computed solutions of others voxels with similar signals. For this reason, we clustered the sets of SPGR and bSSFP using the neural network called Self Organizing Map (SOM), which uses a competitive learning technique to train itself in an unsupervised manner. The similarity information obtained from the SOM was then used to accordingly suggest solutions to the optimization algorithm. A first validation phase with in silico data was performed to evaluate the performances of the SOM and of the modified method, SOM+mcDESPOT. The latter was further validated using real magnetic resonance images. The last step consisted of applying the SOM+mcDESPOT to a group of healthy subjects ( ) and a group of MS patients ( ) to look for differences in myelin-associated water fractions values between the two groups. The validation phases with in silico data verified the initial hypothesis: in more the 74% of the times, the correct solution of a certain voxel is in the space dictated by the cluster which that voxel is mapped to. Adding the information of similar solutions extracted from that cluster helps to improve the signals fitting and the accuracy in the determination of the 7 parameters. This result is still present even if the data are corrupted by a high level of noise (SNR=50). Using real images allowed to confirm the power of SOM+mcDESPOT underlined through the in silico data. The application of SOM+mcDESPOT to the controls and to the MS patients allowed firstly obtaining more feasible results than the traditional mcDESPOT. Moreover, a statistically significant difference of the myelin-associated water fraction values in the normal appearing white matter was found between the two groups: the MS patients, in fact, show lower fraction values compared to the normal subjects, indicating an abnormal presence of myelin in the normal appearing white matter of MS patients. In conclusion, we proposed the novel method SOM+mcDESPOT that is able to extract and exploit the information contained in the MRI signals to drive appropriately the optimization algorithm implemented in mcDESPOT. In so doing, the overall accuracy of the method in both the signals fitting and in the determination of the 7 parameters improves. Thus, the outstanding potentiality of SOM+mcDESPOT could assume a crucial role in improving the indirect quantification of myelin in both healthy subjects and patient
    • …
    corecore