Search CORE

13 research outputs found

Parametric Alignment of Drosophila Genomes

Author: Dewey Colin
Huggins Peter
Pachter Lior
Sturmfels Bernd
Woods Kevin
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2005
Field of study

The classic algorithms of Needleman--Wunsch and Smith--Waterman find a maximum a posteriori probability alignment for a pair hidden Markov model (PHMM). In order to process large genomes that have undergone complex genome rearrangements, almost all existing whole genome alignment methods apply fast heuristics to divide genomes into small pieces which are suitable for Needleman--Wunsch alignment. In these alignment methods, it is standard practice to fix the parameters and to produce a single alignment for subsequent analysis by biologists. Our main result is the construction of a whole genome parametric alignment of Drosophila melanogaster and Drosophila pseudoobscura. Parametric alignment resolves the issue of robustness to changes in parameters by finding all optimal alignments for all possible parameters in a PHMM. Our alignment draws on existing heuristics for dividing whole genomes into small pieces for alignment, and it relies on advances we have made in computing convex polytopes that allow us to parametrically align non-coding regions using biologically realistic models. We demonstrate the utility of our parametric alignment for biological inference by showing that cis-regulatory elements are more conserved between Drosophila melanogaster and Drosophila pseudoobscura than previously thought. We also show how whole genome parametric alignment can be used to quantitatively assess the dependence of branch length estimates on alignment parameters. The alignment polytopes, software, and supplementary material can be downloaded at http://bio.math.berkeley.edu/parametric/.Comment: 19 pages, 3 figure

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Caltech Authors

Towards the Human Genotope

Author: Huggins Peter
Pachter Lior
Sturmfels Bernd
Publication venue
Publication date: 25/12/2006
Field of study

The human genotope is the convex hull of all allele frequency vectors that can be obtained from the genotypes present in the human population. In this paper we take a few initial steps towards a description of this object, which may be fundamental for future population based genetics studies. Here we use data from the HapMap Project, restricted to two ENCODE regions, to study a subpolytope of the human genotope. We study three different approaches for obtaining informative low-dimensional projections of this subpolytope. The projections are specified by projection onto few tag SNPs, principal component analysis, and archetypal analysis. We describe the application of our geometric approach to identifying structure in populations based on single nucleotide polymorphisms

arXiv.org e-Print Archive

CiteSeerX

Caltech Authors

Pair HMM based gap statistics for re-evaluation of indels in alignments with affine gap penalties: Extended Version

Author: Sahinalp S. Cenk
Salari Raheleh
Schönhuth Alexander
Publication venue
Publication date: 11/06/2010
Field of study

Although computationally aligning sequence is a crucial step in the vast majority of comparative genomics studies our understanding of alignment biases still needs to be improved. To infer true structural or homologous regions computational alignments need further evaluation. It has been shown that the accuracy of aligned positions can drop substantially in particular around gaps. Here we focus on re-evaluation of score-based alignments with affine gap penalty costs. We exploit their relationships with pair hidden Markov models and develop efficient algorithms by which to identify gaps which are significant in terms of length and multiplicity. We evaluate our statistics with respect to the well-established structural alignments from SABmark and find that indel reliability substantially increases with their significance in particular in worst-case twilight zone alignments. This points out that our statistics can reliably complement other methods which mostly focus on the reliability of match positions.Comment: 17 pages, 7 figure

arXiv.org e-Print Archive

CWI's Institutional Repository

REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila

Author: Brody
C. M. Bergman
De Renzis
Down
Drysdale
Eilbeck
Li
Lyne
M. S. Halfon
Macdonald
Papatsenko
Pollard
S. M. Gallo
Sandmann
Sandmann
Stein
Tomancak
Wray
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

The identification and study of the cis-regulatory elements that control gene expression are important areas of biological research, but few resources exist to facilitate large-scale bioinformatics studies of cis-regulation in metazoan species. Drosophila melanogaster, with its well-annotated genome, exceptional resources for comparative genomics and long history of experimental studies of transcriptional regulation, represents the ideal system for regulatory bioinformatics. We have merged two existing Drosophila resources, the REDfly database of cis-regulatory modules and the FlyReg database of transcription factor binding sites (TFBSs), into a single integrated database containing extensive annotation of empirically validated cis-regulatory modules and their constituent binding sites. With the enhanced functionality made possible through this integration of TFBS data into REDfly, together with additional improvements to the REDfly infrastructure, we have constructed a one-stop portal for Drosophila cis-regulatory data that will serve as a powerful resource for both computational and experimental studies of transcriptional regulation. REDfly is freely accessible at http://redfly.ccr.buffalo.edu

Crossref

PubMed Central

The University of Manchester - Institutional Repository

REDfly 2.0: an integrated database of cis-regulatory modules and transcription factor binding sites in Drosophila

Author: M. S. Halfon
S. M. Gallo
C. M. Bergman
Wray
Li
Brody
De Renzis
Sandmann
Sandmann
Macdonald
Papatsenko
Down
Pollard
Lyne
Stein
Drysdale
Tomancak
Eilbeck
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

Crossref

DBIS EPub

PubMed Central

The University of Manchester - Institutional Repository

University of Twente Research Information

Lower Bounds for Optimal Alignments of Binary Sequences

Author: Cynthia Vinzant
Fernández-Baca
Fernández-Baca
Gusfield
Gusfield
Pachter
Waterman
Publication venue: 'Elsevier BV'
Publication date: 18/01/2011
Field of study

In parametric sequence alignment, optimal alignments of two sequences are computed as a function of the penalties for mismatches and spaces, producing many different optimal alignments. Here we give a 3/(2^{7/3}\pi^{2/3})n^{2/3} +O(n^{1/3} \log n) lower bound on the maximum number of distinct optimal alignment summaries of length-n binary sequences. This shows that the upper bound given by Gusfield et. al. is tight over all alphabets, thereby disproving the "square root of n conjecture". Thus the maximum number of distinct optimal alignment summaries (i.e. vertices of the alignment polytope) over all pairs of length-n sequences is Theta(n^{2/3}).Comment: 12 pages, 3 figures, submitted to Discrete Applied Mathematic

arXiv.org e-Print Archive

Crossref

Parameters for accurate genome alignment

Author: A Morgulis
A Morgulis
A Schwartz
A Stark
B Paten
CH Yuh
CN Dewey
D Gusfield
D Karolchik
D States
DA Pollard
E Kim
EH Margulies
F Chiaromonte
G Benson
G Lunter
G Lunter
I Holmes
J Ruan
J Wang
JC Wootton
JE Janecka
JO Kriegs
JT Reese
KD Pruitt
KM Wong
LA Newberg
LE Carvalho
M Brudno
M Hamada
Martin C Frith
MC Frith
Michiaki Hamada
MS Waterman
Paul Horton
PP Gardner
R Durbin
RC Friedman
RK Bradley
S Karlin
S Kumar
S Miyazawa
S Schwartz
S Sheetlin
SF Altschul
SF Altschul
TJ Treangen
W Huang
WJ Kent
WJ Kent
YK Yu
Z Zhang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Genome sequence alignments form the basis of much research. Genome alignment depends on various mundane but critical choices, such as how to mask repeats and which score parameters to use. Surprisingly, there has been no large-scale assessment of these choices using real genomic data. Moreover, rigorous procedures to control the rate of spurious alignment have not been employed. Results We have assessed 495 combinations of score parameters for alignment of animal, plant, and fungal genomes. As our gold-standard of accuracy, we used genome alignments implied by multiple alignments of proteins and of structural RNAs. We found the HOXD scoring schemes underlying alignments in the UCSC genome database to be far from optimal, and suggest better parameters. Higher values of the X-drop parameter are not always better. E-values accurately indicate the rate of spurious alignment, but only if tandem repeats are masked in a non-standard way. Finally, we show that γ-centroid (probabilistic) alignment can find highly reliable subsets of aligned bases. Conclusions These results enable more accurate genome alignment, with reliability measures for local alignments and for individual aligned bases. This study was made possible by our new software, LAST, which can align vertebrate genomes in a few hours <url>http://last.cbrc.jp/</url>.</p

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Parametric Analysis of RNA Branching Configurations

Author: Christine E. Heitsch
Valerie Hower
Publication venue: Springer Nature
Publication date: 01/01/2011
Field of study

Springer - Publisher Connector