42,676 research outputs found
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
The EM Algorithm and the Rise of Computational Biology
In the past decade computational biology has grown from a cottage industry
with a handful of researchers to an attractive interdisciplinary field,
catching the attention and imagination of many quantitatively-minded
scientists. Of interest to us is the key role played by the EM algorithm during
this transformation. We survey the use of the EM algorithm in a few important
computational biology problems surrounding the "central dogma"; of molecular
biology: from DNA to RNA and then to proteins. Topics of this article include
sequence motif discovery, protein sequence alignment, population genetics,
evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Integration of Biological Sources: Exploring the Case of Protein Homology
Data integration is a key issue in the domain of bioin- formatics, which deals with huge amounts of heteroge- neous biological data that grows and changes rapidly. This paper serves as an introduction in the field of bioinformatics and the biological concepts it deals with, and an exploration of the integration problems a bioinformatics scientist faces. We examine ProGMap, an integrated protein homology system used by bioin- formatics scientists at Wageningen University, and several use cases related to protein homology. A key issue we identify is the huge manual effort required to unify source databases into a single resource. Un- certain databases are able to contain several possi- ble worlds, and it has been proposed that they can be used to significantly reduce initial integration efforts. We propose several directions for future work where uncertain databases can be applied to bioinformatics, with the goal of furthering the cause of bioinformatics integration
Mapping the Space of Genomic Signatures
We propose a computational method to measure and visualize interrelationships
among any number of DNA sequences allowing, for example, the examination of
hundreds or thousands of complete mitochondrial genomes. An "image distance" is
computed for each pair of graphical representations of DNA sequences, and the
distances are visualized as a Molecular Distance Map: Each point on the map
represents a DNA sequence, and the spatial proximity between any two points
reflects the degree of structural similarity between the corresponding
sequences. The graphical representation of DNA sequences utilized, Chaos Game
Representation (CGR), is genome- and species-specific and can thus act as a
genomic signature. Consequently, Molecular Distance Maps could inform species
identification, taxonomic classifications and, to a certain extent,
evolutionary history. The image distance employed, Structural Dissimilarity
Index (DSSIM), implicitly compares the occurrences of oligomers of length up to
(herein ) in DNA sequences. We computed DSSIM distances for more than
5 million pairs of complete mitochondrial genomes, and used Multi-Dimensional
Scaling (MDS) to obtain Molecular Distance Maps that visually display the
sequence relatedness in various subsets, at different taxonomic levels. This
general-purpose method does not require DNA sequence homology and can thus be
used to compare similar or vastly different DNA sequences, genomic or
computer-generated, of the same or different lengths. We illustrate potential
uses of this approach by applying it to several taxonomic subsets: phylum
Vertebrata, (super)kingdom Protista, classes Amphibia-Insecta-Mammalia, class
Amphibia, and order Primates. This analysis of an extensive dataset confirms
that the oligomer composition of full mtDNA sequences can be a source of
taxonomic information.Comment: 14 pages, 7 figures. arXiv admin note: substantial text overlap with
arXiv:1307.375
Recommended from our members
Soft topographic map for clustering and classification of bacteria
In this work a new method for clustering and building a
topographic representation of a bacteria taxonomy is presented. The method is based on the analysis of stable parts of the genome, the so-called āhousekeeping genesā. The proposed method generates topographic maps of the bacteria taxonomy, where relations among different
type strains can be visually inspected and verified. Two well known DNA alignement algorithms are applied to the genomic sequences. Topographic maps are optimized to represent the similarity among the sequences according to their evolutionary distances. The experimental analysis is carried out on 147 type strains of the Gammaprotebacteria
class by means of the 16S rRNA housekeeping gene. Complete sequences of the gene have been retrieved from the NCBI public database. In the experimental tests the maps show clusters of homologous type strains and present some singular cases potentially due to incorrect classification
or erroneous annotations in the database
A Two-Phase Dynamic Programming Algorithm Tool for DNA Sequences
Sequence alignment has to do with the arrangement of DNA, RNA, and protein sequences to identify areas of similarity. Technic ally, it
involves the arrangement of the primary sequences of DNA, RNA, or protein to identify regions of similarity that may be a consequence of
functional, structural, or evolutionary relationships between the sequences. Similarity may be a consequence of functional, s tructural, or
evolutionary relationships between the sequences. If two sequences in an alignment share a common ancestor, mismatches can be
interpreted as mutations, and gaps as insertions. Such information becomes of great use in vital areas such as the study of d iseases,
genomics and generally in the biological sciences. Thus, sequence alignment presents not just an exciting field of study, but a field of
great importance to mankind. In this light, we extensively studied about seventy (70) existing sequence alignment tools available to us.
Most of these tools are not user friendly and cannot be used by biologists. The few tools that attempted both Local and Global algorithms
are not ready available freely. We therefore implemented a sequence alignment tool (CU-Aligner) in an understandable, user-friendly and
portable way, with click-of-a-button simplicity. This is done utilizing the Needleman-Wunsh and Smith-Waterman algorithms for global
and local alignments, respectively which focuses primarily on DNA sequences. Our aligner is implemented in the Java language in both
application and applet mode and has been efficient on all windows operating systems
Parametric Alignment of Drosophila Genomes
The classic algorithms of Needleman--Wunsch and Smith--Waterman find a
maximum a posteriori probability alignment for a pair hidden Markov model
(PHMM). In order to process large genomes that have undergone complex genome
rearrangements, almost all existing whole genome alignment methods apply fast
heuristics to divide genomes into small pieces which are suitable for
Needleman--Wunsch alignment. In these alignment methods, it is standard
practice to fix the parameters and to produce a single alignment for subsequent
analysis by biologists.
Our main result is the construction of a whole genome parametric alignment of
Drosophila melanogaster and Drosophila pseudoobscura. Parametric alignment
resolves the issue of robustness to changes in parameters by finding all
optimal alignments for all possible parameters in a PHMM. Our alignment draws
on existing heuristics for dividing whole genomes into small pieces for
alignment, and it relies on advances we have made in computing convex polytopes
that allow us to parametrically align non-coding regions using biologically
realistic models. We demonstrate the utility of our parametric alignment for
biological inference by showing that cis-regulatory elements are more conserved
between Drosophila melanogaster and Drosophila pseudoobscura than previously
thought. We also show how whole genome parametric alignment can be used to
quantitatively assess the dependence of branch length estimates on alignment
parameters.
The alignment polytopes, software, and supplementary material can be
downloaded at http://bio.math.berkeley.edu/parametric/.Comment: 19 pages, 3 figure
- ā¦