Search CORE

40,289 research outputs found

Selection of sequence motifs and generative Hopfield-Potts models for protein familiesilies

Author: Shimagaki Kai
Weigt Martin
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2019
Field of study

Statistical models for families of evolutionary related proteins have recently gained interest: in particular pairwise Potts models, as those inferred by the Direct-Coupling Analysis, have been able to extract information about the three-dimensional structure of folded proteins, and about the effect of amino-acid substitutions in proteins. These models are typically requested to reproduce the one- and two-point statistics of the amino-acid usage in a protein family, {\em i.e.}~to capture the so-called residue conservation and covariation statistics of proteins of common evolutionary origin. Pairwise Potts models are the maximum-entropy models achieving this. While being successful, these models depend on huge numbers of {\em ad hoc} introduced parameters, which have to be estimated from finite amount of data and whose biophysical interpretation remains unclear. Here we propose an approach to parameter reduction, which is based on selecting collective sequence motifs. It naturally leads to the formulation of statistical sequence models in terms of Hopfield-Potts models. These models can be accurately inferred using a mapping to restricted Boltzmann machines and persistent contrastive divergence. We show that, when applied to protein data, even 20-40 patterns are sufficient to obtain statistically close-to-generative models. The Hopfield patterns form interpretable sequence motifs and may be used to clusterize amino-acid sequences into functional sub-families. However, the distributed collective nature of these motifs intrinsically limits the ability of Hopfield-Potts models in predicting contact maps, showing the necessity of developing models going beyond the Hopfield-Potts models discussed here.Comment: 26 pages, 16 figures, to app. in PR

arXiv.org e-Print Archive

HAL Descartes

HAL-INSU

Hal-Diderot

Segmenting DNA sequence into words based on statistical language model

Author: Wang Liang
Publication venue
Publication date: 26/02/2012
Field of study

This paper presents a novel method to segment/decode DNA sequences based on n-gram statistical language model. Firstly, we find the length of most DNA “words” is 12 to 15 bps by analyzing the genomes of 12 model species. The bound of language entropy of DNA sequence is about 1.5674 bits. After building an n-gram biology languages model, we design an unsupervised ‘probability approach to word segmentation’ method to segment the DNA sequences. The benchmark of segmenting method is also proposed. In cross segmenting test, we find different genomes may use the similar language, but belong to different branches, just like the English and French/Latin. We present some possible applications of this method at last

Nature Precedings

Inverse Statistical Physics of Protein Sequences: A Key Issues Review

Author: Cocco Simona
Feinauer Christoph
Figliuzzi Matteo
Monasson Remi
Weigt Martin
Publication venue: 'IOP Publishing'
Publication date: 03/03/2017
Field of study

In the course of evolution, proteins undergo important changes in their amino acid sequences, while their three-dimensional folded structure and their biological function remain remarkably conserved. Thanks to modern sequencing techniques, sequence data accumulate at unprecedented pace. This provides large sets of so-called homologous, i.e.~evolutionarily related protein sequences, to which methods of inverse statistical physics can be applied. Using sequence data as the basis for the inference of Boltzmann distributions from samples of microscopic configurations or observables, it is possible to extract information about evolutionary constraints and thus protein function and structure. Here we give an overview over some biologically important questions, and how statistical-mechanics inspired modeling approaches can help to answer them. Finally, we discuss some open questions, which we expect to be addressed over the next years.Comment: 18 pages, 7 figure

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Crossref

HAL-Inserm

HAL-INSU

Maximum entropy models capture melodic styles

Author: Loreto Vittorio
Pachet François
Sakellariou Jason
Tria Francesca
Publication venue
Publication date: 11/10/2016
Field of study

We introduce a Maximum Entropy model able to capture the statistics of melodies in music. The model can be used to generate new melodies that emulate the style of the musical corpus which was used to train it. Instead of using the

n-

body interactions of

(n-1)-

order Markov models, traditionally used in automatic music generation, we use a

k-

nearest neighbour model with pairwise interactions only. In that way, we keep the number of parameters low and avoid over-fitting problems typical of Markov models. We show that long-range musical phrases don't need to be explicitly enforced using high-order Markov interactions, but can instead emerge from multiple, competing, pairwise interactions. We validate our Maximum Entropy model by contrasting how much the generated sequences capture the style of the original corpus without plagiarizing it. To this end we use a data-compression approach to discriminate the levels of borrowing and innovation featured by the artificial sequences. The results show that our modelling scheme outperforms both fixed-order and variable-order Markov models. This shows that, despite being based only on pairwise interactions, this Maximum Entropy scheme opens the possibility to generate musically sensible alterations of the original phrases, providing a way to generate innovation

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza

Simultaneous identification of specifically interacting paralogs and inter-protein contacts by Direct-Coupling Analysis

Author: Baldassi Carlo
Gueudré Thomas
Pagnani Andrea
Weigt Martin
Zamparo Marco
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 01/01/2016
Field of study

Understanding protein-protein interactions is central to our understanding of almost all complex biological processes. Computational tools exploiting rapidly growing genomic databases to characterize protein-protein interactions are urgently needed. Such methods should connect multiple scales from evolutionary conserved interactions between families of homologous proteins, over the identification of specifically interacting proteins in the case of multiple paralogs inside a species, down to the prediction of residues being in physical contact across interaction interfaces. Statistical inference methods detecting residue-residue coevolution have recently triggered considerable progress in using sequence data for quaternary protein structure prediction; they require, however, large joint alignments of homologous protein pairs known to interact. The generation of such alignments is a complex computational task on its own; application of coevolutionary modeling has in turn been restricted to proteins without paralogs, or to bacterial systems with the corresponding coding genes being co-localized in operons. Here we show that the Direct-Coupling Analysis of residue coevolution can be extended to connect the different scales, and simultaneously to match interacting paralogs, to identify inter-protein residue-residue contacts and to discriminate interacting from noninteracting families in a multiprotein system. Our results extend the potential applications of coevolutionary analysis far beyond cases treatable so far.Comment: Main Text 19 pages Supp. Inf. 16 page

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

Crossref

PubMed Central

Archivio istituzionale della ricerca - Università di Bari

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

PORTO Publications Open Repository TOrino

Exploring Cognitive States: Methods for Detecting Physiological Temporal Fingerprints

Author: Adams Stephen
Harrivel Angela R.
Kennedy Kellie D.
Napoli Nicholas J.
Paliwal Mudit
Scherer William T.
Stephens Chad L.
Publication venue
Publication date
Field of study

Cognitive state detection and its relationship to observable physiologically telemetry has been utilized for many human-machine and human-cybernetic applications. This paper aims at understanding and addressing if there are unique psychophysiological patterns over time, a physiological temporal fingerprint, that is associated with specific cognitive states. This preliminary work involves commercial airline pilots completing experimental benchmark task inductions of three cognitive states: 1) Channelized Attention (CA); 2) High Workload (HW); and 3) Low Workload (LW). We approach this objective by modeling these "fingerprints" through the use of Hidden Markov Models and Entropy analysis to evaluate if the transitions over time are complex or rhythmic/predictable by nature. Our results indicate that cognitive states do have unique complexity of physiological sequences that are statistically different from other cognitive states. More specifically, CA has a significantly higher temporal psychophysiological complexity than HW and LW in EEG and ECG telemetry signals. With regards to respiration telemetry, CA has a lower temporal psychophysiological complexity than HW and LW. Through our preliminary work, addressing this unique underpinning can inform whether these underlying dynamics can be utilized to understand how humans transition between cognitive states and for improved detection of cognitive states

NASA Technical Reports Server