Search CORE

4,905 research outputs found

Deep learning methods for mining genomic sequence patterns

Author: Gao Xin
Publication venue: Digital Commons @ NJIT
Publication date: 31/12/2018
Field of study

Nowadays, with the growing availability of large-scale genomic datasets and advanced computational techniques, more and more data-driven computational methods have been developed to analyze genomic data and help to solve incompletely understood biological problems. Among them, deep learning methods, have been proposed to automatically learn and recognize the functional activity of DNA sequences from genomics data. Techniques for efficient mining genomic sequence pattern will help to improve our understanding of gene regulation, and thus accelerate our progress toward using personal genomes in medicine. This dissertation focuses on the development of deep learning methods for mining genomic sequences. First, we compare the performance between deep learning models and traditional machine learning methods in recognizing various genomic sequence patterns. Through extensive experiments on both simulated data and real genomic sequence data, we demonstrate that an appropriate deep learning model can be generally made for successfully recognizing various genomic sequence patterns. Next, we develop deep learning methods to help solve two specific biological problems, (1) inference of polyadenylation code and (2) tRNA gene detection and functional prediction. Polyadenylation is a pervasive mechanism that has been used by Eukaryotes for regulating mRNA transcription, localization, and translation efficiency. Polyadenylation signals in the plant are particularly noisy and challenging to decipher. A deep convolutional neural network approach DeepPolyA is proposed to predict poly(A) site from the plant Arabidopsis thaliana genomic sequences. It employs various deep neural network architectures and demonstrates its superiority in comparison with competing methods, including classical machine learning algorithms and several popular deep learning models. Transfer RNAs (tRNAs) represent a highly complex class of genes and play a central role in protein translation. There remains a de facto tool, tRNAscan-SE, for identifying tRNA genes encoded in genomes. Despite its popularity and success, tRNAscan-SE is still not powerful enough to separate tRNAs from pseudo-tRNAs, and a significant number of false positives can be output as a result. To address this issue, tRNA-DL, a hybrid combination of convolutional neural network and recurrent neural network approach is proposed. It is shown that the proposed method can help to reduce the false positive rate of the state-of-art tRNA prediction tool tRNAscan-SE substantially. Coupled with tRNAscan-SE, tRNA-DL can serve as a useful complementary tool for tRNA annotation. Taken together, the experiments and applications demonstrate the superiority of deep learning in automatic feature generation for characterizing genomic sequence patterns

Digital Commons @ New Jersey Institute of Technology (NJIT)

A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites

Author: Brunak Søren
Engelbrecht Jacob
Nielsen Henrik
von Heijne Gunnar
Publication venue
Publication date: 01/01/1997
Field of study

We have developed a new method for identification of signal peptides and their cleavage sites based on neural networks trained on separate sets of prokaryotic and eukaryotic sequences. The method performs significantly better than previous prediction schemes, and can easily be applied on genome-wide data sets. Discrimination between cleaved signal peptides and uncleaved N-terminal signal-anchor sequences is also possible, thoughwith lower precision. Predictions can be made on a publicly available WWW server. Present address: Novo Nordisk A/S, Scientific Computing, Building 9M1, Novo Alle, DK-2880 Bagsværd, Denmark Introduction Signal peptides control the entry of virtually all proteins to the secretory pathway, both in eukaryotes and prokaryotes (von Heijne, 1990; Gierasch, 1989; Rapoport, 1992). They comprise the N--terminal part of the amino acid chain, and are cleaved off while the protein is translocated through the membrane. The common structure of signal peptides from variou..

CiteSeerX

Online Research Database In Technology

CoBaltDB: Complete bacterial and archaeal orfeomes subcellular localization database and associated resources

Author: Avner Stéphane
Barloy-Hubler Frédérique
Goudenège David
Lucchetti-Miganeh Céline
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

International audienceBACKGROUND: The functions of proteins are strongly related to their localization in cell compartments (for example the cytoplasm or membranes) but the experimental determination of the sub-cellular localization of proteomes is laborious and expensive. A fast and low-cost alternative approach is in silico prediction, based on features of the protein primary sequences. However, biologists are confronted with a very large number of computational tools that use different methods that address various localization features with diverse specificities and sensitivities. As a result, exploiting these computer resources to predict protein localization accurately involves querying all tools and comparing every prediction output; this is a painstaking task. Therefore, we developed a comprehensive database, called CoBaltDB, that gathers all prediction outputs concerning complete prokaryotic proteomes. DESCRIPTION: The current version of CoBaltDB integrates the results of 43 localization predictors for 784 complete bacterial and archaeal proteomes (2.548.292 proteins in total). CoBaltDB supplies a simple user-friendly interface for retrieving and exploring relevant information about predicted features (such as signal peptide cleavage sites and transmembrane segments). Data are organized into three work-sets ("specialized tools", "meta-tools" and "additional tools"). The database can be queried using the organism name, a locus tag or a list of locus tags and may be browsed using numerous graphical and text displays. CONCLUSIONS: With its new functionalities, CoBaltDB is a novel powerful platform that provides easy access to the results of multiple localization tools and support for predicting prokaryotic protein localizations with higher confidence than previously possible. CoBaltDB is available at http://www.umr6026.univ-rennes1.fr/english/home/research/basic/software/cobalten

Springer - Publisher Connector

PubMed Central

HAL-Rennes 1

Recommended from our members

Deep Learning for Single-Molecule Science

Author: Bengio Y
Bishop C M
Chang S
Chang S
Coates A
Deng L
Eduardo Alonso
Glorot X
Goodfellow I
Gregory Slabaugh
Hebb D O
Hinton G E
Hinton G E
Minsky M
Mitchell T
Nair V
Saon G
Schwenk H
Simonyan K
Simonyan K
SM Masudur R Al-Arif
Tim Albrecht
Werbos P
Widrow B
Yosinski J
Zeiler M D
Publication venue: 'IOP Publishing'
Publication date: 18/09/2017
Field of study

Exploring and making predictions based on single-molecule data can be challenging, not only due to the sheer size of the datasets, but also because a priori knowledge about the signal characteristics is typically limited and poor signal-to-noise ratio. For example, hypothesis-driven data exploration, informed by an expectation of the signal characteristics, can lead to interpretation bias or loss of information. Equally, even when the different data categories are known, e.g., the four bases in DNA sequencing, it is often difficult to know how to make best use of the available information content. The latest developments in Machine Learning (ML), so-called Deep Learning (DL) offers an interesting, new avenues to address such challenges. In some applications, such as speech and image recognition, DL has been able to outperform conventional Machine Learning strategies and even human performance. However, to date DL has not been applied much in single-molecule science, presumably in part because relatively little is known about the 'internal workings' of such DL tools within single-molecule science as a field. In this Tutorial, we make an attempt to illustrate in a step-by-step guide how one of those, a Convolutional Neural Network, may be used for base calling in DNA sequencing applications. We compare it with a Support Vector Machine as a more conventional ML method, and and discuss some of the strengths and weaknesses of the approach. In particular, a 'deep' neural network has many features of a 'black box', which has important implications on how we look at and interpret data

City Research Online

Crossref

University of Birmingham Research Portal

Pushing the Boundaries of Biomolecule Characterization through Deep Learning

Author: Klein Moberg Henrik
Publication venue
Publication date: 01/01/2023
Field of study

The importance of studying biological molecules in living organisms can hardly be overstated as they regulate crucial processes in living matter of all kinds.Their ubiquitous nature makes them relevant for disease diagnosis, drug development, and for our fundamental understanding of the complex systems of biology.However, due to their small size, they scatter too little light on their own to be directly visible and available for study.Thus, it is necessary to develop characterization methods which enable their elucidation even in the regime of very faint signals. Optical systems, utilizing the relatively low intrusiveness of visible light, constitute one such approach of characterization. However, the optical systems currently capable of analyzing single molecules in the nano-sized regime today either require the species of interest to be tagged with visible labels like fluorescence or chemically restrained on a surface to be analyzed.Ergo, there exist effectively no methods of characterizing very small biomolecules under naturally relevant conditions through unobtrusive probing. Nanofluidic Scattering Microscopy is a method introduced in this thesis which bridges this gap by enabling the real-time label-free size-and-weight determination of freely diffusing molecules directly in small nano-sized channels. However, the molecule signals are so faint, and the background noise so complex with high spatial and temporal variation, that standard methods of data analysis are incapable of elucidating the molecules\u27 properties of relevance in any but the least challenging conditions.To remedy the weak signal, and realize the method\u27s full potential, this thesis\u27 focus is the development of a versatile deep-learning based computer-vision platform to overcome the bottleneck of data analysis. We find that said platform has considerably increased speed, accuracy, precision and limit of detection compared to standard methods, constituting even a lower detection limit than any other method of label-free optical characterization currently available. In this regime, hitherto elusive species of biomolecules become accessible for study, potentially opening up entirely new avenues of biological research. These results, along with many others in the context of deep learning for optical microscopy in biological applications, suggest that deep learning is likely to be pivotal in solving the complex image analysis problems of the present and enabling new regimes of study within microscopy-based research in the near future

Chalmers Research

Recommended from our members

Scoring functions for protein docking and drug design

Author: Viswanath Shruthi
Publication venue
Publication date: 26/06/2014
Field of study

textPredicting the structure of complexes formed by two interacting proteins is an important problem in computation structural biology. Proteins perform many of their functions by binding to other proteins. The structure of protein-protein complexes provides atomic details about protein function and biochemical pathways, and can help in designing drugs that inhibit binding. Docking computationally models the structure of protein-protein complexes, given three-dimensional structures of the individual chains. Protein docking methods have two phases. In the first phase, a comprehensive, coarse search is performed for optimally docked models. In the second refinement and reranking phase, the models from the first phase are refined and reranked, with the expectation of extracting a small set of accurate models from the pool of thousands of models obtained from the first phase. In this thesis, new algorithms are developed for the refinement and reranking phase of docking. New scoring functions, or potentials, that rank models are developed. These potentials are learnt using large-scale machine learning methods based on mathematical programming. The procedure for learning these potentials involves examining hundreds of thousands of correct and incorrect models. In this thesis, hierarchical constraints were introduced into the learning algorithm. First, an atomic potential was developed using this learning procedure. A refinement procedure involving side-chain remodeling and conjugate gradient-based minimization was introduced. The refinement procedure combined with the atomic potential was shown to improve docking accuracy significantly. Second, a hydrogen bond potential, was developed. Molecular dynamics-based sampling combined with the hydrogen bond potential improved docking predictions. Third, mathematical programming compared favorably to SVMs and neural networks in terms of accuracy, training and test time for the task of designing potentials to rank docking models. The methods described in this thesis are implemented in the docking package DOCK/PIERR. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer's disease.R. DOCK/PIERR was shown to be among the best automated docking methods in community wide assessments. Finally, DOCK/PIERR was extended to predict membrane protein complexes. A membrane-based score was added to the reranking phase, and shown to improve the accuracy of docking. This docking algorithm for membrane proteins was used to study the dimers of amyloid precursor protein, implicated in Alzheimer’s disease.Computer Science

Texas ScholarWorks

Improving nonlinear search with Self-Organizing Maps - Application to Magnetic Resonance Relaxometry

Author: Laura Gaetano
Publication venue
Publication date: 01/01/2012
Field of study

Quantification of myelin in vivo is crucial for the understanding of neurological diseases, like multiple sclerosis (MS). Multi-Component Driven Equilibrium Single Pulse Observation T1 and T2 (mcDESPOT) is a rapid and precise method for determination of the longitudinal and transverse relaxation times in a voxel wise fashion. Briefly, mcDESPOT couples sets of SPGR (spoiled gradient-recalled echo) and bSSFP (fully balance steady-state free precession) data acquired over a range of flip angles (α) with constant interpulse spacing (TR) to derive 6 parameters (free-water T1 and T2, myelin-associated water T1 and T2, relative myelin-associated water volume fraction, and the myelin-associated water proton residence time) based on water exchange models. However, this procedure is computationally expensive and extremely difficult due to the need to find the best fit to the 24 MRI signals volumes in a search of nonlinear 6 dimensional space of model parameters. In this context, the aim of this work is to improve mcDESPOT efficiency and accuracy using tissue information contained in the sets of signals (SPGR and bSSFP) acquired. The basic hypothesis is that similar acquired signals are referred to tissue portions with close features, which translate in similar parameters. This similarity could be used to drive the nonlinear mcDESPOT fitting, leading the optimization algorithm (that is based on a stochastic region contraction approach) to look for a solution (i.e. the 6 parameters vector) also in regions defined by previously computed solutions of others voxels with similar signals. For this reason, we clustered the sets of SPGR and bSSFP using the neural network called Self Organizing Map (SOM), which uses a competitive learning technique to train itself in an unsupervised manner. The similarity information obtained from the SOM was then used to accordingly suggest solutions to the optimization algorithm. A first validation phase with in silico data was performed to evaluate the performances of the SOM and of the modified method, SOM+mcDESPOT. The latter was further validated using real magnetic resonance images. The last step consisted of applying the SOM+mcDESPOT to a group of healthy subjects ( ) and a group of MS patients ( ) to look for differences in myelin-associated water fractions values between the two groups. The validation phases with in silico data verified the initial hypothesis: in more the 74% of the times, the correct solution of a certain voxel is in the space dictated by the cluster which that voxel is mapped to. Adding the information of similar solutions extracted from that cluster helps to improve the signals fitting and the accuracy in the determination of the 7 parameters. This result is still present even if the data are corrupted by a high level of noise (SNR=50). Using real images allowed to confirm the power of SOM+mcDESPOT underlined through the in silico data. The application of SOM+mcDESPOT to the controls and to the MS patients allowed firstly obtaining more feasible results than the traditional mcDESPOT. Moreover, a statistically significant difference of the myelin-associated water fraction values in the normal appearing white matter was found between the two groups: the MS patients, in fact, show lower fraction values compared to the normal subjects, indicating an abnormal presence of myelin in the normal appearing white matter of MS patients. In conclusion, we proposed the novel method SOM+mcDESPOT that is able to extract and exploit the information contained in the MRI signals to drive appropriately the optimization algorithm implemented in mcDESPOT. In so doing, the overall accuracy of the method in both the signals fitting and in the determination of the 7 parameters improves. Thus, the outstanding potentiality of SOM+mcDESPOT could assume a crucial role in improving the indirect quantification of myelin in both healthy subjects and patient

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino