207 research outputs found
Are There Rearrangement Hotspots in the Human Genome?
In a landmark paper, Nadeau and Taylor [18] formulated the random breakage model (RBM) of chromosome evolution that postulates that there are no rearrangement hotspots in the human genome. In the next two decades, numerous studies with progressively increasing levels of resolution made RBM the de facto theory of chromosome evolution. Despite the fact that RBM had prophetic prediction power, it was recently refuted by Pevzner and Tesler [4], who introduced the fragile breakage model (FBM), postulating that the human genome is a mosaic of solid regions (with low propensity for rearrangements) and fragile regions (rearrangement hotspots). However, the rebuttal of RBM caused a controversy and led to a split among researchers studying genome evolution. In particular, it remains unclear whether some complex rearrangements (e.g., transpositions) can create an appearance of rearrangement hotspots. We contribute to the ongoing debate by analyzing multi-break rearrangements that break a genome into multiple fragments and further glue them together in a new order. In particular, we demonstrate that (1) even if transpositions were a dominant force in mammalian evolution, the arguments in favor of FBM still stand, and (2) the ‘‘gene deletion’’ argument against FBM is flawed
Colored de Bruijn Graphs and the Genome Halving Problem
Breakpoint graph analysis is a key algorithmic technique in studies of genome rearrangements. However, breakpoint graphs are defined only for genomes without duplicated genes, thus limiting their applications in rearrangement analysis. We discuss a connection between the breakpoint graphs and de Bruijn graphs that leads to a generalization of the notion of breakpoint graph for genomes with duplicated genes. We further use the generalized breakpoint graphs to study the Genome Halving Problem (first introduced and solved by Nadia El-Mabrouk and David Sankoff). The El-Mabrouk-Sankoff algorithm is rather complex, and, in this paper, we present an alternative approach that is based on generalized breakpoint graphs. The generalized breakpoint graphs make the El-Mabrouk-Sankoff result more transparent and promise to be useful in future studies of genome rearrangements
Sum-of-squares lower bounds for planted clique
Finding cliques in random graphs and the closely related "planted" clique
variant, where a clique of size k is planted in a random G(n, 1/2) graph, have
been the focus of substantial study in algorithm design. Despite much effort,
the best known polynomial-time algorithms only solve the problem for k ~
sqrt(n).
In this paper we study the complexity of the planted clique problem under
algorithms from the Sum-of-squares hierarchy. We prove the first average case
lower bound for this model: for almost all graphs in G(n,1/2), r rounds of the
SOS hierarchy cannot find a planted k-clique unless k > n^{1/2r} (up to
logarithmic factors). Thus, for any constant number of rounds planted cliques
of size n^{o(1)} cannot be found by this powerful class of algorithms. This is
shown via an integrability gap for the natural formulation of maximum clique
problem on random graphs for SOS and Lasserre hierarchies, which in turn follow
from degree lower bounds for the Positivestellensatz proof system.
We follow the usual recipe for such proofs. First, we introduce a natural
"dual certificate" (also known as a "vector-solution" or "pseudo-expectation")
for the given system of polynomial equations representing the problem for every
fixed input graph. Then we show that the matrix associated with this dual
certificate is PSD (positive semi-definite) with high probability over the
choice of the input graph.This requires the use of certain tools. One is the
theory of association schemes, and in particular the eigenspaces and
eigenvalues of the Johnson scheme. Another is a combinatorial method we develop
to compute (via traces) norm bounds for certain random matrices whose entries
are highly dependent; we hope this method will be useful elsewhere
What is the difference between the breakpoint graph and the de Bruijn graph?
The breakpoint graph and the de Bruijn graph are two key data structures in the studies of genome
rearrangements and genome assembly. However, the classical breakpoint graphs are defined on two genomes (represented as sequences of synteny blocks), while the classical de Bruijn graphs are defined on a single genome (represented as DNA strings). Thus, the connection between these two graph models is not explicit. We generalize the notions of both the breakpoint graph and the de Bruijn graph, and make it transparent that the breakpoint graph and the de Bruijn graph are mathematically equivalent. The explicit description of the connection between these important data structures provides a bridge between two previously separated bioinformatics communities studying genome rearrangements and genome assembly
SpectroGene: A Tool for Proteogenomic Annotations Using Top-Down Spectra
In the past decade, proteogenomics has emerged as a valuable technique that contributes to the state-of-the-art in genome annotation; however, previous proteogenomic studies were limited to bottom-up mass spectrometry and did not take advantage of top-down approaches. We show that top-down proteogenomics allows one to address the problems that remained beyond the reach of traditional bottom-up proteogenomics. In particular, we show that top-down proteogenomics leads to the discovery of previously unannotated genes even in extensively studied bacterial genomes and present SpectroGene, a software tool for genome annotation using top-down tandem mass spectra. We further show that top-down proteogenomics searches (against the six-frame translation of a genome) identify nearly all proteoforms found in traditional top-down proteomics searches (against the annotated proteome). SpectroGene is freely available at http://github.com/fenderglass/SpectroGene
The Fragile Breakage versus Random Breakage Models of Chromosome Evolution
For many years, studies of chromosome evolution were dominated by the random breakage theory, which implies that there are no rearrangement hot spots in the human genome. In 2003, Pevzner and Tesler argued against the random breakage model and proposed an alternative “fragile breakage” model of chromosome evolution. In 2004, Sankoff and Trinh argued against the fragile breakage model and raised doubts that Pevzner and Tesler provided any evidence of rearrangement hot spots. We investigate whether Sankoff and Trinh indeed revealed a flaw in the arguments of Pevzner and Tesler. We show that Sankoff and Trinh's synteny block identification algorithm makes erroneous identifications even in small toy examples and that their parameters do not reflect the realities of the comparative genomic architecture of human and mouse. We further argue that if Sankoff and Trinh had fixed these problems, their arguments in support of the random breakage model would disappear. Finally, we study the link between rearrangements and regulatory regions and argue that long regulatory regions and inhomogeneity of gene distribution in mammalian genomes may be responsible for the breakpoint reuse phenomenon
De novo Inference of Diversity Genes and Analysis of Non-canonical V(DD)J Recombination in Immunoglobulins
The V(D)J recombination forms the immunoglobulin genes by joining the variable (V), diversity (D), and joining (J) germline genes. Since variations in germline genes have been linked to various diseases, personalized immunogenomics aims at finding alleles of germline genes across various patients. Although recent studies described algorithms for de novo inference of V and J genes from immunosequencing data, they stopped short of solving a more difficult problem of reconstructing D genes that form the highly divergent CDR3 regions and provide the most important contribution to the antigen binding. We present the IgScout algorithm for de novo D gene reconstruction and apply it to reveal new alleles of human D genes and previously unknown D genes in camel, an important model organism in immunology. We further analyze non-canonical V(DD)J recombination that results in unusually long CDR3s with tandem fused IGHD genes and thus expands the diversity of the antibody repertoires. We demonstrate that tandem CDR3s represent a consistent and functional feature of all analyzed immunosequencing datasets, reveal ultra-long CDR3s, and shed light on the mechanism responsible for their formation
Whole Genome Duplications and Contracted Breakpoint Graphs
The genome halving problem, motivated by the whole genome duplication events in molecular evolution, was solved by El-Mabrouk and Sankoff in the pioneering paper [SIAM J. Comput., 32 (2003), pp. 754–792]. The El-Mabrouk–Sankoff algorithm is rather complex, inspiring a quest for a simpler solution. An alternative approach to the genome halving problem based on the notion of the contracted breakpoint graph was recently proposed in [M. A. Alekseyev and P. A. Pevzner, IEEE/ACM Trans. Comput. Biol. Bioinformatics, 4 (2007), pp. 98–107]. This new technique reveals that while the El-Mabrouk–Sankoff result is correct in most cases, it does not hold in the case of unichromosomal genomes. This raises a problem of correcting a flaw in the El- Mabrouk–Sankoff analysis and devising an algorithm that deals adequately with all genomes. In this paper we efficiently classify all genomes into two classes and show that while the El-Mabrouk–Sankoff theorem holds for the first class, it is incorrect for the second class. The crux of our analysis is a new combinatorial invariant defined on duplicated permutations. Using this invariant we were able to come up with a full proof of the genome halving theorem and a polynomial algorithm for the genome halving problem
Recommended from our members
Identifying Repeat Domains in Large Genomes
We present a graph-based method for the analysis of repeat families in a repeat library. We build a repeat domain graph that decomposes a repeat library into repeat domains, short subsequences shared by multiple repeat families, and reveals the mosaic structure of repeat families. Our method recovers documented mosaic repeat structures and suggests additional putative ones. Our method is useful for elucidating the evolutionary history of repeats and annotating de novo generated repeat libraries
- …