13 research outputs found
Parametric Alignment of Drosophila Genomes
The classic algorithms of Needleman--Wunsch and Smith--Waterman find a
maximum a posteriori probability alignment for a pair hidden Markov model
(PHMM). In order to process large genomes that have undergone complex genome
rearrangements, almost all existing whole genome alignment methods apply fast
heuristics to divide genomes into small pieces which are suitable for
Needleman--Wunsch alignment. In these alignment methods, it is standard
practice to fix the parameters and to produce a single alignment for subsequent
analysis by biologists.
Our main result is the construction of a whole genome parametric alignment of
Drosophila melanogaster and Drosophila pseudoobscura. Parametric alignment
resolves the issue of robustness to changes in parameters by finding all
optimal alignments for all possible parameters in a PHMM. Our alignment draws
on existing heuristics for dividing whole genomes into small pieces for
alignment, and it relies on advances we have made in computing convex polytopes
that allow us to parametrically align non-coding regions using biologically
realistic models. We demonstrate the utility of our parametric alignment for
biological inference by showing that cis-regulatory elements are more conserved
between Drosophila melanogaster and Drosophila pseudoobscura than previously
thought. We also show how whole genome parametric alignment can be used to
quantitatively assess the dependence of branch length estimates on alignment
parameters.
The alignment polytopes, software, and supplementary material can be
downloaded at http://bio.math.berkeley.edu/parametric/.Comment: 19 pages, 3 figure
Towards the Human Genotope
The human genotope is the convex hull of all allele frequency vectors that
can be obtained from the genotypes present in the human population. In this
paper we take a few initial steps towards a description of this object, which
may be fundamental for future population based genetics studies. Here we use
data from the HapMap Project, restricted to two ENCODE regions, to study a
subpolytope of the human genotope. We study three different approaches for
obtaining informative low-dimensional projections of this subpolytope. The
projections are specified by projection onto few tag SNPs, principal component
analysis, and archetypal analysis. We describe the application of our geometric
approach to identifying structure in populations based on single nucleotide
polymorphisms
Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version
Although computationally aligning sequence is a crucial step in the vast
majority of comparative genomics studies our understanding of alignment biases
still needs to be improved. To infer true structural or homologous regions
computational alignments need further evaluation. It has been shown that the
accuracy of aligned positions can drop substantially in particular around gaps.
Here we focus on re-evaluation of score-based alignments with affine gap
penalty costs. We exploit their relationships with pair hidden Markov models
and develop efficient algorithms by which to identify gaps which are
significant in terms of length and multiplicity. We evaluate our statistics
with respect to the well-established structural alignments from SABmark and
find that indel reliability substantially increases with their significance in
particular in worst-case twilight zone alignments. This points out that our
statistics can reliably complement other methods which mostly focus on the
reliability of match positions.Comment: 17 pages, 7 figure
REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila
The identification and study of the cis-regulatory elements that control gene expression are important areas of biological research, but few resources exist to facilitate large-scale bioinformatics studies of cis-regulation in metazoan species. Drosophila melanogaster, with its well-annotated genome, exceptional resources for comparative genomics and long history of experimental studies of transcriptional regulation, represents the ideal system for regulatory bioinformatics. We have merged two existing Drosophila resources, the REDfly database of cis-regulatory modules and the FlyReg database of transcription factor binding sites (TFBSs), into a single integrated database containing extensive annotation of empirically validated cis-regulatory modules and their constituent binding sites. With the enhanced functionality made possible through this integration of TFBS data into REDfly, together with additional improvements to the REDfly infrastructure, we have constructed a one-stop portal for Drosophila cis-regulatory data that will serve as a powerful resource for both computational and experimental studies of transcriptional regulation. REDfly is freely accessible at http://redfly.ccr.buffalo.edu
REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila
The identification and study of the cis-regulatory elements that control gene expression are important areas of biological research, but few resources exist to facilitate large-scale bioinformatics studies of cis-regulation in metazoan species. Drosophila melanogaster, with its well-annotated genome, exceptional resources for comparative genomics and long history of experimental studies of transcriptional regulation, represents the ideal system for regulatory bioinformatics. We have merged two existing Drosophila resources, the REDfly database of cis-regulatory modules and the FlyReg database of transcription factor binding sites (TFBSs), into a single integrated database containing extensive annotation of empirically validated cis-regulatory modules and their constituent binding sites. With the enhanced functionality made possible through this integration of TFBS data into REDfly, together with additional improvements to the REDfly infrastructure, we have constructed a one-stop portal for Drosophila cis-regulatory data that will serve as a powerful resource for both computational and experimental studies of transcriptional regulation. REDfly is freely accessible at http://redfly.ccr.buffalo.edu
Lower Bounds for Optimal Alignments of Binary Sequences
In parametric sequence alignment, optimal alignments of two sequences are
computed as a function of the penalties for mismatches and spaces, producing
many different optimal alignments. Here we give a 3/(2^{7/3}\pi^{2/3})n^{2/3}
+O(n^{1/3} \log n) lower bound on the maximum number of distinct optimal
alignment summaries of length-n binary sequences. This shows that the upper
bound given by Gusfield et. al. is tight over all alphabets, thereby disproving
the "square root of n conjecture". Thus the maximum number of distinct optimal
alignment summaries (i.e. vertices of the alignment polytope) over all pairs of
length-n sequences is Theta(n^{2/3}).Comment: 12 pages, 3 figures, submitted to Discrete Applied Mathematic
Parameters for accurate genome alignment
<p>Abstract</p> <p>Background</p> <p>Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed.</p> <p>Results</p> <p>We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases.</p> <p>Conclusions</p> <p>These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours <url>http://last.cbrc.jp/</url>.</p