16,211 research outputs found
Mapping the Space of Genomic Signatures
We propose a computational method to measure and visualize interrelationships
among any number of DNA sequences allowing, for example, the examination of
hundreds or thousands of complete mitochondrial genomes. An "image distance" is
computed for each pair of graphical representations of DNA sequences, and the
distances are visualized as a Molecular Distance Map: Each point on the map
represents a DNA sequence, and the spatial proximity between any two points
reflects the degree of structural similarity between the corresponding
sequences. The graphical representation of DNA sequences utilized, Chaos Game
Representation (CGR), is genome- and species-specific and can thus act as a
genomic signature. Consequently, Molecular Distance Maps could inform species
identification, taxonomic classifications and, to a certain extent,
evolutionary history. The image distance employed, Structural Dissimilarity
Index (DSSIM), implicitly compares the occurrences of oligomers of length up to
(herein ) in DNA sequences. We computed DSSIM distances for more than
5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional
Scaling (MDS) to obtain Molecular Distance Maps that visually display the
sequence relatedness in various subsets, at different taxonomic levels. This
general-purpose method does not require DNA sequence homology and can thus be
used to compare similar or vastly different DNA sequences, genomic or
computer-generated, of the same or different lengths. We illustrate potential
uses of this approach by applying it to several taxonomic subsets: phylum
Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class
Amphibia, and order Primates. This analysis of an extensive dataset confirms
that the oligomer composition of full mtDNA sequences can be a source of
taxonomic information.Comment: 14 pages, 7 figures. arXiv admin note: substantial text overlap with
arXiv:1307.375
BOOL-AN: A method for comparative sequence analysis and phylogenetic reconstruction
A novel discrete mathematical approach is proposed as an additional tool for molecular systematics which does not require prior statistical assumptions concerning the evolutionary process. The method is based on algorithms generating mathematical representations directly from DNA/RNA or protein sequences, followed by the output of numerical (scalar or vector) and visual characteristics (graphs). The binary encoded sequence information is transformed into a compact analytical form, called the Iterative Canonical Form (or ICF) of Boolean functions, which can then be used as a generalized molecular descriptor. The method provides raw vector data for calculating different distance matrices, which in turn can be analyzed by neighbor-joining or UPGMA to derive a phylogenetic tree, or by principal coordinates analysis to get an ordination scattergram. The new method and the associated software for inferring phylogenetic trees are called the Boolean analysis or BOOL-AN
Qubism: self-similar visualization of many-body wavefunctions
A visualization scheme for quantum many-body wavefunctions is described,
which we have termed qubism. Its main property is its recursivity: increasing
the number of qubits reflects in an increase in the image resolution. Thus, the
plots are typically fractal. As examples, we provide images for the ground
states of commonly used Hamiltonians in condensed matter and cold atom physics,
such as Heisenberg or ITF. Many features of the wavefunction, such as
magnetization, correlations and criticality, can be visualized as properties of
the images. In particular, factorizability can be easily spotted, and a way to
estimate the entanglement entropy from the image is provided
An investigation into inter- and intragenomic variations of graphic genomic signatures
We provide, on an extensive dataset and using several different distances,
confirmation of the hypothesis that CGR patterns are preserved along a genomic
DNA sequence, and are different for DNA sequences originating from genomes of
different species. This finding lends support to the theory that CGRs of
genomic sequences can act as graphic genomic signatures. In particular, we
compare the CGR patterns of over five hundred different 150,000 bp genomic
sequences originating from the genomes of six organisms, each belonging to one
of the kingdoms of life: H. sapiens, S. cerevisiae, A. thaliana, P. falciparum,
E. coli, and P. furiosus. We also provide preliminary evidence of this method's
applicability to closely related species by comparing H. sapiens (chromosome
21) sequences and over one hundred and fifty genomic sequences, also 150,000 bp
long, from P. troglodytes (Animalia; chromosome Y), for a total length of more
than 101 million basepairs analyzed. We compute pairwise distances between CGRs
of these genomic sequences using six different distances, and construct
Molecular Distance Maps that visualize all sequences as points in a
two-dimensional or three-dimensional space, to simultaneously display their
interrelationships. Our analysis confirms that CGR patterns of DNA sequences
from the same genome are in general quantitatively similar, while being
different for DNA sequences from genomes of different species. Our analysis of
the performance of the assessed distances uses three different quality measures
and suggests that several distances outperform the Euclidean distance, which
has so far been almost exclusively used for such studies. In particular we show
that, for this dataset, DSSIM (Structural Dissimilarity Index) and the
descriptor distance (introduced here) are best able to classify genomic
sequences.Comment: 14 pages, 6 figures, 5 table
Graphical Representation of Biological Sequences
Sequence comparison is one of the most fundamental tasks in bioinformatics. For biological sequence comparison, alignment is the most profitable method when the sequence lengths are not so large. However, as the time complexity of the alignment is the square order of the sequence length, the alignment requires a large amount of computational time for comparison of sequences of large size. Therefore, so-called alignment-free sequence comparison methods are needed for comparison between such as whole genome sequences in practical time. In this chapter, we reviewed the graphical representation of biological sequences, which is one of the major alignment-free sequence comparison methods. The notable effects of weighting during the course of the graphical representation introduced first by the author and co-workers were also mentioned
Sparse approaches for the exact distribution of patterns in long state sequences generated by a Markov source
We present two novel approaches for the computation of the exact distribution
of a pattern in a long sequence. Both approaches take into account the sparse
structure of the problem and are two-part algorithms. The first approach relies
on a partial recursion after a fast computation of the second largest
eigenvalue of the transition matrix of a Markov chain embedding. The second
approach uses fast Taylor expansions of an exact bivariate rational
reconstruction of the distribution. We illustrate the interest of both
approaches on a simple toy-example and two biological applications: the
transcription factors of the Human Chromosome 5 and the PROSITE signatures of
functional motifs in proteins. On these example our methods demonstrate their
complementarity and their hability to extend the domain of feasibility for
exact computations in pattern problems to a new level
On Map Representations of DNA
We have constructed graphical (qualitative and visual) representations of DNA sequences as 2D
maps and their numerical (quantitative and computational) analysis. The maps are obtained by transforming
the four-letter sequences (where letters represent the four nucleic bases) via a spiral representation
over triangular and square cells grids into a four-color map. The so constructed maps are then represented
by distance matrices. We consider the use of several matrix invariants as DNA descriptors for determining
the degree of similarity of a selection of DNA sequences. (doi: 10.5562/cca2338
- …