2,223 research outputs found
Seven clusters in genomic triplet distributions
Motivation: In several recent papers new algorithms were proposed for detecting coding regions without requiring learning dataset of already known genes. In this paper we studied cluster structure of several genomes in the space of codon usage. This allowed to interpret some of the results obtained in other studies and propose a simpler method, which is, nevertheless, fully
functional.
Results: Several complete genomic sequences were analyzed, using visualization of tables of triplet counts in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions. Awareness of the existence of this structure allows development of methods for the segmentation of sequences into regions with the same coding phase and non-coding regions.
This method may be completely unsupervised or use some external information. Since the method does not need extraction of ORFs, it can be applied even for unassembled genomes. Accuracy calculated on the base-pair level (both sensitivity and specificity) exceeds 90%. This is not worse as compared to such methods as HMM, however, has the advantage to be much simpler and clear
Four basic symmetry types in the universal 7-cluster structure of 143 complete bacterial genomic sequences
Coding information is the main source of heterogeneity
(non-randomness) in the sequences of bacterial genomes. This
information can be naturally modeled by analysing cluster structures in the ``in-phase'' triplet distributions of relatively short genomic fragments (200-400bp). We found a universal 7-cluster structure in all 143 completely sequenced bacterial genomes available in Genbank in August 2004, and explained its properties.
The 7-cluster structure is responsible for the main part of sequence heterogeneity in bacterial genomes. In this sense, our 7 clusters is the basic model of bacterial genome sequence. We demonstrated that there are four basic ``pure'' types of this model, observed in nature: ``parallel triangles'', ``perpendicular triangles'',
degenerated case and the flower-like type. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy (more precisely, by two similar functions, one for eubacterial genomes and the other one for archaea).
All 143 cluster animated 3D-scatters are collected in a database and is made available on our web-site:
http://www.ihes.fr/~zinovyev/7clusters
The finding can be readily introduced into any software for gene prediction, sequence alignment or bacterial genomes classification
PCA and K-Means decipher genome
In this paper, we aim to give a tutorial for undergraduate students studying
statistical methods and/or bioinformatics. The students will learn how data
visualization can help in genomic sequence analysis. Students start with a
fragment of genetic text of a bacterial genome and analyze its structure. By
means of principal component analysis they ``discover'' that the information in
the genome is encoded by non-overlapping triplets. Next, they learn how to find
gene positions. This exercise on PCA and K-Means clustering enables active
study of the basic bioinformatics notions. Appendix 1 contains program listings
that go along with this exercise. Appendix 2 includes 2D PCA plots of triplet
usage in moving frame for a series of bacterial genomes from GC-poor to GC-rich
ones. Animated 3D PCA plots are attached as separate gif files. Topology
(cluster structure) and geometry (mutual positions of clusters) of these plots
depends clearly on GC-content.Comment: 18 pages, with program listings for MatLab, PCA analysis of genomes
and additional animated 3D PCA plot
Maximum entropy models for antibody diversity
Recognition of pathogens relies on families of proteins showing great
diversity. Here we construct maximum entropy models of the sequence repertoire,
building on recent experiments that provide a nearly exhaustive sampling of the
IgM sequences in zebrafish. These models are based solely on pairwise
correlations between residue positions, but correctly capture the higher order
statistical properties of the repertoire. Exploiting the interpretation of
these models as statistical physics problems, we make several predictions for
the collective properties of the sequence ensemble: the distribution of
sequences obeys Zipf's law, the repertoire decomposes into several clusters,
and there is a massive restriction of diversity due to the correlations. These
predictions are completely inconsistent with models in which amino acid
substitutions are made independently at each site, and are in good agreement
with the data. Our results suggest that antibody diversity is not limited by
the sequences encoded in the genome, and may reflect rapid adaptation to
antigenic challenges. This approach should be applicable to the study of the
global properties of other protein families
Metabolic and Chaperone Gene Loss Marks the Origin of Animals: Evidence for Hsp104 and Hsp78 Sharing Mitochondrial Clients
The evolution of animals involved acquisition of an emergent gene repertoire
for gastrulation. Whether loss of genes also co-evolved with this developmental
reprogramming has not yet been addressed. Here, we identify twenty-four genetic
functions that are retained in fungi and choanoflagellates but undetectable in
animals. These lost genes encode: (i) sixteen distinct biosynthetic functions;
(ii) the two ancestral eukaryotic ClpB disaggregases, Hsp78 and Hsp104, which
function in the mitochondria and cytosol, respectively; and (iii) six other
assorted functions. We present computational and experimental data that are
consistent with a joint function for the differentially localized ClpB
disaggregases, and with the possibility of a shared client/chaperone
relationship between the mitochondrial Fe/S homoaconitase encoded by the lost
LYS4 gene and the two ClpBs. Our analyses lead to the hypothesis that the
evolution of gastrulation-based multicellularity in animals led to efficient
extraction of nutrients from dietary sources, loss of natural selection for
maintenance of energetically expensive biosynthetic pathways, and subsequent
loss of their attendant ClpB chaperones.Comment: This is a reformatted version from the recent official publication in
PLoS ONE (2015). This version differs substantially from first three arXiV
versions. This version uses a fixed-width font for DNA sequences as was done
in the earlier arXiv versions but which is missing in the official PLoS ONE
publication. The title has also been shortened slightly from the official
publicatio
- …