4,511 research outputs found
MAVID: Constrained ancestral alignment of multiple sequences
We describe a new global multiple alignment program capable of aligning a
large number of genomic regions. Our progressive alignment approach
incorporates the following ideas: maximum-likelihood inference of ancestral
sequences, automatic guide-tree construction, protein based anchoring of
ab-initio gene predictions, and constraints derived from a global homology map
of the sequences. We have implemented these ideas in the MAVID program, which
is able to accurately align multiple genomic regions up to megabases long.
MAVID is able to effectively align divergent sequences, as well as incomplete
unfinished sequences. We demonstrate the capabilities of the program on the
benchmark CFTR region which consists of 1.8Mb of human sequence and 20
orthologous regions in marsupials, birds, fish, and mammals. Finally, we
describe two large MAVID alignments: an alignment of all the available HIV
genomes and a multiple alignment of the entire human, mouse and rat genomes
Optimization of miRNA-seq data preprocessing.
The past two decades of microRNA (miRNA) research has solidified the role of these small non-coding RNAs as key regulators of many biological processes and promising biomarkers for disease. The concurrent development in high-throughput profiling technology has further advanced our understanding of the impact of their dysregulation on a global scale. Currently, next-generation sequencing is the platform of choice for the discovery and quantification of miRNAs. Despite this, there is no clear consensus on how the data should be preprocessed before conducting downstream analyses. Often overlooked, data preprocessing is an essential step in data analysis: the presence of unreliable features and noise can affect the conclusions drawn from downstream analyses. Using a spike-in dilution study, we evaluated the effects of several general-purpose aligners (BWA, Bowtie, Bowtie 2 and Novoalign), and normalization methods (counts-per-million, total count scaling, upper quartile scaling, Trimmed Mean of M, DESeq, linear regression, cyclic loess and quantile) with respect to the final miRNA count data distribution, variance, bias and accuracy of differential expression analysis. We make practical recommendations on the optimal preprocessing methods for the extraction and interpretation of miRNA count data from small RNA-sequencing experiments
In search of lost introns
Many fundamental questions concerning the emergence and subsequent evolution
of eukaryotic exon-intron organization are still unsettled. Genome-scale
comparative studies, which can shed light on crucial aspects of eukaryotic
evolution, require adequate computational tools.
We describe novel computational methods for studying spliceosomal intron
evolution. Our goal is to give a reliable characterization of the dynamics of
intron evolution. Our algorithmic innovations address the identification of
orthologous introns, and the likelihood-based analysis of intron data. We
discuss a compression method for the evaluation of the likelihood function,
which is noteworthy for phylogenetic likelihood problems in general. We prove
that after preprocessing time, subsequent evaluations take time almost surely in the Yule-Harding random model of -taxon
phylogenies, where is the input sequence length.
We illustrate the practicality of our methods by compiling and analyzing a
data set involving 18 eukaryotes, more than in any other study to date. The
study yields the surprising result that ancestral eukaryotes were fairly
intron-rich. For example, the bilaterian ancestor is estimated to have had more
than 90% as many introns as vertebrates do now
Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study
Background: Phylogenetic reconstruction is a necessary first step in many analyses which use whole genome sequence data from bacterial populations. There are many available methods to infer phylogenies, and these have various advantages and disadvantages, but few unbiased comparisons of the range of approaches have been made. Methods: We simulated data from a defined "true tree" using a realistic evolutionary model. We built phylogenies from this data using a range of methods, and compared reconstructed trees to the true tree using two measures, noting the computational time needed for different phylogenetic reconstructions. We also used real data from Streptococcus pneumoniae alignments to compare individual core gene trees to a core genome tree. Results: We found that, as expected, maximum likelihood trees from good quality alignments were the most accurate, but also the most computationally intensive. Using less accurate phylogenetic reconstruction methods, we were able to obtain results of comparable accuracy; we found that approximate results can rapidly be obtained using genetic distance based methods. In real data we found that highly conserved core genes, such as those involved in translation, gave an inaccurate tree topology, whereas genes involved in recombination events gave inaccurate branch lengths. We also show a tree-of-trees, relating the results of different phylogenetic reconstructions to each other. Conclusions: We recommend three approaches, depending on requirements for accuracy and computational time. Quicker approaches that do not perform full maximum likelihood optimisation may be useful for many analyses requiring a phylogeny, as generating a high quality input alignment is likely to be the major limiting factor of accurate tree topology. We have publicly released our simulated data and code to enable further comparisons
Parametric Alignment of Drosophila Genomes
The classic algorithms of Needleman--Wunsch and Smith--Waterman find a
maximum a posteriori probability alignment for a pair hidden Markov model
(PHMM). In order to process large genomes that have undergone complex genome
rearrangements, almost all existing whole genome alignment methods apply fast
heuristics to divide genomes into small pieces which are suitable for
Needleman--Wunsch alignment. In these alignment methods, it is standard
practice to fix the parameters and to produce a single alignment for subsequent
analysis by biologists.
Our main result is the construction of a whole genome parametric alignment of
Drosophila melanogaster and Drosophila pseudoobscura. Parametric alignment
resolves the issue of robustness to changes in parameters by finding all
optimal alignments for all possible parameters in a PHMM. Our alignment draws
on existing heuristics for dividing whole genomes into small pieces for
alignment, and it relies on advances we have made in computing convex polytopes
that allow us to parametrically align non-coding regions using biologically
realistic models. We demonstrate the utility of our parametric alignment for
biological inference by showing that cis-regulatory elements are more conserved
between Drosophila melanogaster and Drosophila pseudoobscura than previously
thought. We also show how whole genome parametric alignment can be used to
quantitatively assess the dependence of branch length estimates on alignment
parameters.
The alignment polytopes, software, and supplementary material can be
downloaded at http://bio.math.berkeley.edu/parametric/.Comment: 19 pages, 3 figure
Recommended from our members
PATTERNA: transcriptome-wide search for functional RNA elements via structural data signatures.
Establishing a link between RNA structure and function remains a great challenge in RNA biology. The emergence of high-throughput structure profiling experiments is revolutionizing our ability to decipher structure, yet principled approaches for extracting information on structural elements directly from these data sets are lacking. We present PATTERNA, an unsupervised pattern recognition algorithm that rapidly mines RNA structure motifs from profiling data. We demonstrate that PATTERNA detects motifs with an accuracy comparable to commonly used thermodynamic models and highlight its utility in automating data-directed structure modeling from large data sets. PATTERNA is versatile and compatible with diverse profiling techniques and experimental conditions
The Mathematics of Phylogenomics
The grand challenges in biology today are being shaped by powerful
high-throughput technologies that have revealed the genomes of many organisms,
global expression patterns of genes and detailed information about variation
within populations. We are therefore able to ask, for the first time,
fundamental questions about the evolution of genomes, the structure of genes
and their regulation, and the connections between genotypes and phenotypes of
individuals. The answers to these questions are all predicated on progress in a
variety of computational, statistical, and mathematical fields.
The rapid growth in the characterization of genomes has led to the
advancement of a new discipline called Phylogenomics. This discipline results
from the combination of two major fields in the life sciences: Genomics, i.e.,
the study of the function and structure of genes and genomes; and Molecular
Phylogenetics, i.e., the study of the hierarchical evolutionary relationships
among organisms and their genomes. The objective of this article is to offer
mathematicians a first introduction to this emerging field, and to discuss
specific mathematical problems and developments arising from phylogenomics.Comment: 41 pages, 4 figure
- âŠ