Search CORE

16,389 research outputs found

Recommended from our members

Modeling aspects of the language of life through transfer-learning protein sequences

Author: Dallago Christian
Elnaggar Ahmed
Heinzinger Michael
Matthes Florian
Nechaev Dmitrii
Rost Burkhard
Wang Yu
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

Background Predicting protein function and structure from sequence is one important challenge for computational biology. For 26 years, most state-of-the-art approaches combined machine learning and evolutionary information. However, for some applications retrieving related proteins is becoming too time-consuming. Additionally, evolutionary information is less powerful for small families, e.g. for proteins from the Dark Proteome. Both these problems are addressed by the new methodology introduced here. Results We introduced a novel way to represent protein sequences as continuous vectors (embeddings) by using the language model ELMo taken from natural language processing. By modeling protein sequences, ELMo effectively captured the biophysical properties of the language of life from unlabeled big data (UniRef50). We refer to these new embeddings as SeqVec (Sequence-to-Vector) and demonstrate their effectiveness by training simple neural networks for two different tasks. At the per-residue level, secondary structure (Q3 = 79% ± 1, Q8 = 68% ± 1) and regions with intrinsic disorder (MCC = 0.59 ± 0.03) were predicted significantly better than through one-hot encoding or through Word2vec-like approaches. At the per-protein level, subcellular localization was predicted in ten classes (Q10 = 68% ± 1) and membrane-bound were distinguished from water-soluble proteins (Q2 = 87% ± 1). Although SeqVec embeddings generated the best predictions from single sequences, no solution improved over the best existing method using evolutionary information. Nevertheless, our approach improved over some popular methods using evolutionary information and for some proteins even did beat the best. Thus, they prove to condense the underlying principles of protein sequences. Overall, the important novelty is speed: where the lightning-fast HHblits needed on average about two minutes to generate the evolutionary information for a target protein, SeqVec created embeddings on average in 0.03 s. As this speed-up is independent of the size of growing sequence databases, SeqVec provides a highly scalable approach for the analysis of big data in proteomics, i.e. microbiome or metaproteome analysis. Conclusion Transfer-learning succeeded to extract information from unlabeled sequence databases relevant for various protein prediction tasks. SeqVec modeled the language of life, namely the principles underlying protein sequences better than any features suggested by textbooks and prediction methods. The exception is evolutionary information, however, that information is not available on the level of a single sequence

Columbia University Academic Commons

Novel machine learning approaches revolutionize protein knowledge

Author: Bordin Nicola
Dallago Christian
Heinzinger Michael
Kim Stephanie
Littmann Maria
Orengo Christine
Rauer Clemens
Rost Burkhard
Steinegger Martin
Publication venue: 'Elsevier BV'
Publication date: 01/04/2023
Field of study

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific communit

Biblos-e Archivo

Novel machine learning approaches revolutionize protein knowledge

Author: Bordin Nicola
Dallago Christian
Heinzinger Michael
Kim Stephanie
Littmann Maria
Orengo Christine
Rauer Clemens
Rost Burkhard
Steinegger Martin
Publication venue: 'Elsevier BV'
Publication date: 08/12/2022
Field of study

Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Appraisal Skills Program (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community

UCL Discovery

The Biosemiotic Approach in Biology : Theoretical Bases and Applied Models

Author: El-Hani Charbel
Emmeche Claus
Kull Kalevi
Queiroz Joao
Publication venue
Publication date: 01/01/2011
Field of study

Biosemiotics is a growing fi eld that investigates semiotic processes in the living realm in an attempt to combine the fi ndings of the biological sciences and semiotics. Semiotic processes are more or less what biologists have typically referred to as “ signals, ” “ codes, ”and “ information processing ”in biosystems, but these processes are here understood under the more general notion of semiosis, that is, the production, action, and interpretation of signs. Thus, biosemiotics can be seen as biology interpreted as a study of living sign systems — which also means that semiosis or sign process can be seen as the very nature of life itself. In other words, biosemiotics is a field of research investigating semiotic processes (meaning, signification, communication, and habit formation in living systems) and the physicochemical preconditions for sign action and interpretation. (...

PhilPapers

PEvoLM: Protein Sequence Evolutionary Information Language Model

Author: Arab Issar
Publication venue
Publication date: 16/08/2023
Field of study

With the exponential increase of the protein sequence databases over time, multiple-sequence alignment (MSA) methods, like PSI-BLAST, perform exhaustive and time-consuming database search to retrieve evolutionary information. The resulting position-specific scoring matrices (PSSMs) of such search engines represent a crucial input to many machine learning (ML) models in the field of bioinformatics and computational biology. A protein sequence is a collection of contiguous tokens or characters called amino acids (AAs). The analogy to natural language allowed us to exploit the recent advancements in the field of Natural Language Processing (NLP) and therefore transfer NLP state-of-the-art algorithms to bioinformatics. This research presents an Embedding Language Model (ELMo), converting a protein sequence to a numerical vector representation. While the original ELMo trained a 2-layer bidirectional Long Short-Term Memory (LSTMs) network following a two-path architecture, one for the forward and the second for the backward pass, by merging the idea of PSSMs with the concept of transfer-learning, this work introduces a novel bidirectional language model (bi-LM) with four times less free parameters and using rather a single path for both passes. The model was trained not only on predicting the next AA but also on the probability distribution of the next AA derived from similar, yet different sequences as summarized in a PSSM, simultaneously for multi-task learning, hence learning evolutionary information of protein sequences as well. The network architecture and the pre-trained model are made available as open source under the permissive MIT license on GitHub at https://github.com/issararab/PEvoLM.Comment:

arXiv.org e-Print Archive

Linguistically inspired roadmap for building biologically reliable protein language models

Author: Akbar Rahmad
Greiff Victor
Haug Dag Trygve Truslew
Robert Philippe A.
Sandve Geir Kjetil
Swiatczak Bartlomiej
Vu Mai Ha
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/04/2023
Field of study

Deep neural-network-based language models (LMs) are increasingly applied to large-scale protein sequence data to predict protein function. However, being largely black-box models and thus challenging to interpret, current protein LM approaches do not contribute to a fundamental understanding of sequence-function mappings, hindering rule-based biotherapeutic drug development. We argue that guidance drawn from linguistics, a field specialized in analytical rule extraction from natural language data, can aid with building more interpretable protein LMs that are more likely to learn relevant domain-specific rules. Differences between protein sequence data and linguistic sequence data require the integration of more domain-specific knowledge in protein LMs compared to natural language LMs. Here, we provide a linguistics-based roadmap for protein LM pipeline choices with regard to training data, tokenization, token embedding, sequence embedding, and model interpretation. Incorporating linguistic ideas into protein LMs enables the development of next-generation interpretable machine-learning models with the potential of uncovering the biological mechanisms underlying sequence-function relationships.Comment: 27 pages, 4 figure

arXiv.org e-Print Archive

7th German Conference on Chemoinformatics: 25 CIC-Workshop : Goslar, Germany, 6 - 8 November 2011 ; meeting abstracts / Edited by Frank Oellien, Uli Fechner and Thomas Engel

Author: Engel Thomas
Fechner Uli
Oellien Frank
Publication venue
Publication date: 01/05/2012
Field of study

Hochschulschriftenserver - Universität Frankfurt am Main