2,223 research outputs found

    Seven clusters in genomic triplet distributions

    Get PDF
    Motivation: In several recent papers new algorithms were proposed for detecting coding regions without requiring learning dataset of already known genes. In this paper we studied cluster structure of several genomes in the space of codon usage. This allowed to interpret some of the results obtained in other studies and propose a simpler method, which is, nevertheless, fully functional. Results: Several complete genomic sequences were analyzed, using visualization of tables of triplet counts in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions. Awareness of the existence of this structure allows development of methods for the segmentation of sequences into regions with the same coding phase and non-coding regions. This method may be completely unsupervised or use some external information. Since the method does not need extraction of ORFs, it can be applied even for unassembled genomes. Accuracy calculated on the base-pair level (both sensitivity and specificity) exceeds 90%. This is not worse as compared to such methods as HMM, however, has the advantage to be much simpler and clear

    Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences

    Get PDF
    Coding information is the main source of heterogeneity (non-randomness) in the sequences of bacterial genomes. This information can be naturally modeled by analysing cluster structures in the ``in-phase'' triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties. The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic ``pure'' types of this model, observed in nature: ``parallel triangles'', ``perpendicular triangles'', degenerated case and the flower-like type. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea). All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site: http://www.ihes.fr/~zinovyev/7clusters The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification

    PCA and K-Means decipher genome

    Full text link
    In this paper, we aim to give a tutorial for undergraduate students studying statistical methods and/or bioinformatics. The students will learn how data visualization can help in genomic sequence analysis. Students start with a fragment of genetic text of a bacterial genome and analyze its structure. By means of principal component analysis they ``discover'' that the information in the genome is encoded by non-overlapping triplets. Next, they learn how to find gene positions. This exercise on PCA and K-Means clustering enables active study of the basic bioinformatics notions. Appendix 1 contains program listings that go along with this exercise. Appendix 2 includes 2D PCA plots of triplet usage in moving frame for a series of bacterial genomes from GC-poor to GC-rich ones. Animated 3D PCA plots are attached as separate gif files. Topology (cluster structure) and geometry (mutual positions of clusters) of these plots depends clearly on GC-content.Comment: 18 pages, with program listings for MatLab, PCA analysis of genomes and additional animated 3D PCA plot

    Maximum entropy models for antibody diversity

    Full text link
    Recognition of pathogens relies on families of proteins showing great diversity. Here we construct maximum entropy models of the sequence repertoire, building on recent experiments that provide a nearly exhaustive sampling of the IgM sequences in zebrafish. These models are based solely on pairwise correlations between residue positions, but correctly capture the higher order statistical properties of the repertoire. Exploiting the interpretation of these models as statistical physics problems, we make several predictions for the collective properties of the sequence ensemble: the distribution of sequences obeys Zipf's law, the repertoire decomposes into several clusters, and there is a massive restriction of diversity due to the correlations. These predictions are completely inconsistent with models in which amino acid substitutions are made independently at each site, and are in good agreement with the data. Our results suggest that antibody diversity is not limited by the sequences encoded in the genome, and may reflect rapid adaptation to antigenic challenges. This approach should be applicable to the study of the global properties of other protein families

    Metabolic and Chaperone Gene Loss Marks the Origin of Animals: Evidence for Hsp104 and Hsp78 Sharing Mitochondrial Clients

    Full text link
    The evolution of animals involved acquisition of an emergent gene repertoire for gastrulation. Whether loss of genes also co-evolved with this developmental reprogramming has not yet been addressed. Here, we identify twenty-four genetic functions that are retained in fungi and choanoflagellates but undetectable in animals. These lost genes encode: (i) sixteen distinct biosynthetic functions; (ii) the two ancestral eukaryotic ClpB disaggregases, Hsp78 and Hsp104, which function in the mitochondria and cytosol, respectively; and (iii) six other assorted functions. We present computational and experimental data that are consistent with a joint function for the differentially localized ClpB disaggregases, and with the possibility of a shared client/chaperone relationship between the mitochondrial Fe/S homoaconitase encoded by the lost LYS4 gene and the two ClpBs. Our analyses lead to the hypothesis that the evolution of gastrulation-based multicellularity in animals led to efficient extraction of nutrients from dietary sources, loss of natural selection for maintenance of energetically expensive biosynthetic pathways, and subsequent loss of their attendant ClpB chaperones.Comment: This is a reformatted version from the recent official publication in PLoS ONE (2015). This version differs substantially from first three arXiV versions. This version uses a fixed-width font for DNA sequences as was done in the earlier arXiv versions but which is missing in the official PLoS ONE publication. The title has also been shortened slightly from the official publicatio
    • …
    corecore