1,526 research outputs found

    ProLanGO: Protein Function Prediction Using Neural~Machine Translation Based on a Recurrent Neural Network

    Full text link
    With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language "ProLan" to the protein function language "GOLan", and build a neural machine translation model based on recurrent neural networks to translate "ProLan" language to "GOLan" language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.Comment: 13 pages, 5 figure

    The Amphioxus SoxB Family: Implications for the Evolution of Vertebrate Placodes

    Get PDF
    Cranial placodes are regions of thickened ectoderm that give rise to sense organs and ganglia in the vertebrate head. Homologous structures are proposed to exist in urochordates, but have not been found in cephalochordates, suggesting the first chordates lacked placodes. SoxB genes are expressed in discrete subsets of vertebrate placodes. To investigate how placodes arose and diversified in the vertebrate lineage we isolated the complete set of SoxB genes from amphioxus and analyzed their expression in embryos and larvae. We find that while amphioxus possesses a single SoxB2 gene, it has three SoxB1 paralogs. Like vertebrate SoxB1 genes, one of these paralogs is expressed in non-neural ectoderm destined to give rise to sensory cells. When considered in the context of other amphioxus placode marker orthologs, amphioxus SoxB1 expression suggests a diversity of sensory cell types utilizing distinct placode-type gene programs was present in the first chordates. Our data supports a model for placode evolution and diversification whereby the full complement of vertebrate placodes evolved by serial recruitment of distinct sensory cell specification programs to anterior pre-placodal ectoderm

    Fast protein superfamily classification using principal component null space analysis.

    Get PDF
    The protein family classification problem, which consists of determining the family memberships of given unknown protein sequences, is very important for a biologist for many practical reasons, such as drug discovery, prediction of molecular functions and medical diagnosis. Neural networks and Bayesian methods have performed well on the protein classification problem, achieving accuracy ranging from 90% to 98% while running relatively slowly in the learning stage. In this thesis, we present a principal component null space analysis (PCNSA) linear classifier to the problem and report excellent results compared to those of neural networks and support vector machines. The two main parameters of PCNSA are linked to the high dimensionality of the dataset used, and were optimized in an exhaustive manner to maximize accuracy. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .F74. Source: Masters Abstracts International, Volume: 44-03, page: 1400. Thesis (M.Sc.)--University of Windsor (Canada), 2005

    Kernel methods in genomics and computational biology

    Full text link
    Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

    Jeeva: Enterprise Grid-enabled Web Portal for Protein Secondary Structure Prediction

    Get PDF
    This paper presents a Grid portal for protein secondary structure prediction developed by using services of Aneka, a .NET-based enterprise Grid technology. The portal is used by research scientists to discover new prediction structures in a parallel manner. An SVM (Support Vector Machine)-based prediction algorithm is used with 64 sample protein sequences as a case study to demonstrate the potential of enterprise Grids.Comment: 7 page

    Automated Protein Structure Classification: A Survey

    Full text link
    Classification of proteins based on their structure provides a valuable resource for studying protein structure, function and evolutionary relationships. With the rapidly increasing number of known protein structures, manual and semi-automatic classification is becoming ever more difficult and prohibitively slow. Therefore, there is a growing need for automated, accurate and efficient classification methods to generate classification databases or increase the speed and accuracy of semi-automatic techniques. Recognizing this need, several automated classification methods have been developed. In this survey, we overview recent developments in this area. We classify different methods based on their characteristics and compare their methodology, accuracy and efficiency. We then present a few open problems and explain future directions.Comment: 14 pages, Technical Report CSRG-589, University of Toront

    Bounded Coordinate-Descent for Biological Sequence Classification in High Dimensional Predictor Space

    Full text link
    We present a framework for discriminative sequence classification where the learner works directly in the high dimensional predictor space of all subsequences in the training set. This is possible by employing a new coordinate-descent algorithm coupled with bounding the magnitude of the gradient for selecting discriminative subsequences fast. We characterize the loss functions for which our generic learning algorithm can be applied and present concrete implementations for logistic regression (binomial log-likelihood loss) and support vector machines (squared hinge loss). Application of our algorithm to protein remote homology detection and remote fold recognition results in performance comparable to that of state-of-the-art methods (e.g., kernel support vector machines). Unlike state-of-the-art classifiers, the resulting classification models are simply lists of weighted discriminative subsequences and can thus be interpreted and related to the biological problem

    Protein structure search and local structure characterization

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Structural similarities among proteins can provide valuable insight into their functional mechanisms and relationships. As the number of available three-dimensional (3D) protein structures increases, a greater variety of studies can be conducted with increasing efficiency, among which is the design of protein structural alphabets. Structural alphabets allow us to characterize local structures of proteins and describe the global folding structure of a protein using a one-dimensional (1D) sequence. Thus, 1D sequences can be used to identify structural similarities among proteins using standard sequence alignment tools such as BLAST or FASTA.</p> <p>Results</p> <p>We used self-organizing maps in combination with a minimum spanning tree algorithm to determine the optimum size of a structural alphabet and applied the k-means algorithm to group protein fragnts into clusters. The centroids of these clusters defined the structural alphabet. We also developed a flexible matrix training system to build a substitution matrix (TRISUM-169) for our alphabet. Based on FASTA and using TRISUM-169 as the substitution matrix, we developed the SA-FAST alignment tool. We compared the performance of SA-FAST with that of various search tools in database-scale search tasks and found that SA-FAST was highly competitive in all tests conducted. Further, we evaluated the performance of our structural alphabet in recognizing specific structural domains of EGF and EGF-like proteins. Our method successfully recovered more EGF sub-domains using our structural alphabet than when using other structural alphabets. SA-FAST can be found at <url>http://140.113.166.178/safast/</url>.</p> <p>Conclusion</p> <p>The goal of this project was two-fold. First, we wanted to introduce a modular design pipeline to those who have been working with structural alphabets. Secondly, we wanted to open the door to researchers who have done substantial work in biological sequences but have yet to enter the field of protein structure research. Our experiments showed that by transforming the structural representations from 3D to 1D, several 1D-based tools can be applied to structural analysis, including similarity searches and structural motif finding.</p
    • …
    corecore