2,427 research outputs found

    Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

    Get PDF
    A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%–70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively contextindependent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time

    Evolution of foot-and-mouth disease virus intra-sample sequence diversity during serial transmission in bovine hosts

    Get PDF
    RNA virus populations within samples are highly heterogeneous, containing a large number of minority sequence variants which can potentially be transmitted to other susceptible hosts. Consequently, consensus genome sequences provide an incomplete picture of the within- and between-host viral evolutionary dynamics during transmission. Foot-and-mouth disease virus (FMDV) is an RNA virus that can spread from primary sites of replication, via the systemic circulation, to found distinct sites of local infection at epithelial surfaces. Viral evolution in these different tissues occurs independently, each of them potentially providing a source of virus to seed subsequent transmission events. This study employed the Illumina Genome Analyzer platform to sequence 18 FMDV samples collected from a chain of sequentially infected cattle. These data generated snap-shots of the evolving viral population structures within different animals and tissues. Analyses of the mutation spectra revealed polymorphisms at frequencies >0.5% at between 21 and 146 sites across the genome for these samples, while 13 sites acquired mutations in excess of consensus frequency (50%). Analysis of polymorphism frequency revealed that a number of minority variants were transmitted during host-to-host infection events, while the size of the intra-host founder populations appeared to be smaller. These data indicate that viral population complexity is influenced by small intra-host bottlenecks and relatively large inter-host bottlenecks. The dynamics of minority variants are consistent with the actions of genetic drift rather than strong selection. These results provide novel insights into the evolution of FMDV that can be applied to reconstruct both intra- and inter-host transmission routes

    From in vitro evolution to protein structure

    Get PDF
    In the nanoscale, the machinery of life is mainly composed by macromolecules and macromolecular complexes that through their shapes create a network of interconnected mechanisms of biological processes. The relationship between shape and function of a biological molecule is the foundation of structural biology, that aims at studying the structure of a protein or a macromolecular complex to unveil the molecular mechanism through which it exerts its function. What about the reverse: is it possible by exploiting the function for which a protein was naturally selected to deduce the protein structure? To this aim we developed a method, called CAMELS (Coupling Analysis by Molecular Evolution Library Sequencing), able to obtain the structural features of a protein from an artificial selection based on that protein function. With CAMELS we tried to reconstruct the TEM-1 beta lactamase fold exclusively by generating and sequencing large libraries of mutational variants. Theoretically with this method it is possible to reconstruct the structure of a protein regardless of the species of origin or the phylogenetical time of emergence when a functional phenotypic selection of a protein is available. CAMELS allows us to obtain protein structures without needing to purify the protein beforehand

    Codon Bias Patterns of E.coliE.coli's Interacting Proteins

    Get PDF
    Synonymous codons, i.e., DNA nucleotide triplets coding for the same amino acid, are used differently across the variety of living organisms. The biological meaning of this phenomenon, known as codon usage bias, is still controversial. In order to shed light on this point, we propose a new codon bias index, CompAICompAI, that is based on the competition between cognate and near-cognate tRNAs during translation, without being tuned to the usage bias of highly expressed genes. We perform a genome-wide evaluation of codon bias for E.coliE.coli, comparing CompAICompAI with other widely used indices: tAItAI, CAICAI, and NcNc. We show that CompAICompAI and tAItAI capture similar information by being positively correlated with gene conservation, measured by ERI, and essentiality, whereas, CAICAI and NcNc appear to be less sensitive to evolutionary-functional parameters. Notably, the rate of variation of tAItAI and CompAICompAI with ERI allows to obtain sets of genes that consistently belong to specific clusters of orthologous genes (COGs). We also investigate the correlation of codon bias at the genomic level with the network features of protein-protein interactions in E.coliE.coli. We find that the most densely connected communities of the network share a similar level of codon bias (as measured by CompAICompAI and tAItAI). Conversely, a small difference in codon bias between two genes is, statistically, a prerequisite for the corresponding proteins to interact. Importantly, among all codon bias indices, CompAICompAI turns out to have the most coherent distribution over the communities of the interactome, pointing to the significance of competition among cognate and near-cognate tRNAs for explaining codon usage adaptation

    Statistical Complexity Analysis of Turing Machine tapes with Fixed Algorithmic Complexity Using the Best-Order Markov Model

    Get PDF
    Sources that generate symbolic sequences with algorithmic nature may differ in statistical complexity because they create structures that follow algorithmic schemes, rather than generating symbols from a probabilistic function assuming independence. In the case of Turing machines, this means that machines with the same algorithmic complexity can create tapes with different statistical complexity. In this paper, we use a compression-based approach to measure global and local statistical complexity of specific Turing machine tapes with the same number of states and alphabet. Both measures are estimated using the best-order Markov model. For the global measure, we use the Normalized Compression (NC), while, for the local measures, we define and use normal and dynamic complexity profiles to quantify and localize lower and higher regions of statistical complexity. We assessed the validity of our methodology on synthetic and real genomic data showing that it is tolerant to increasing rates of editions and block permutations. Regarding the analysis of the tapes, we localize patterns of higher statistical complexity in two regions, for a different number of machine states. We show that these patterns are generated by a decrease of the tape's amplitude, given the setting of small rule cycles. Additionally, we performed a comparison with a measure that uses both algorithmic and statistical approaches (BDM) for analysis of the tapes. Naturally, BDM is efficient given the algorithmic nature of the tapes. However, for a higher number of states, BDM is progressively approximated by our methodology. Finally, we provide a simple algorithm to increase the statistical complexity of a Turing machine tape while retaining the same algorithmic complexity. We supply a publicly available implementation of the algorithm in C++ language under the GPLv3 license. All results can be reproduced in full with scripts provided at the repository.Peer reviewe
    corecore