6,930 research outputs found
Simple gene assembly as a rewriting of directed overlap-inclusion graphs
The simple intramolecular model for gene assembly in ciliates consists of three molecular operations, simple Id, simple hi and simple dlad. Mathematical models in terms of signed permutations and signed strings proved limited in capturing some of the combinatorial details of the simple gene assembly process. Brijder and Hoogeboom introduced a new model in terms of overlap-inclusion graphs which could describe two of the three operations of the model and their combinatorial properties. To capture the third operation, we extended their framework to directed overlap-inclusion (DOI) graphs in Azimi et al. (2011) [1]. In this paper we introduce DOI graph-based rewriting rules that capture all three operations of the simple gene assembly model and prove that they are equivalent to the string-based formalization of the model. (C) 2012 Elsevier B.V. All rights reserved
The Fibers and Range of Reduction Graphs in Ciliates
The biological process of gene assembly has been modeled based on three types
of string rewriting rules, called string pointer rules, defined on so-called
legal strings. It has been shown that reduction graphs, graphs that are based
on the notion of breakpoint graph in the theory of sorting by reversal, for
legal strings provide valuable insights into the gene assembly process. We
characterize which legal strings obtain the same reduction graph (up to
isomorphism), and moreover we characterize which graphs are (isomorphic to)
reduction graphs.Comment: 24 pages, 13 figure
Safe and complete contig assembly via omnitigs
Contig assembly is the first stage that most assemblers solve when
reconstructing a genome from a set of reads. Its output consists of contigs --
a set of strings that are promised to appear in any genome that could have
generated the reads. From the introduction of contigs 20 years ago, assemblers
have tried to obtain longer and longer contigs, but the following question was
never solved: given a genome graph (e.g. a de Bruijn, or a string graph),
what are all the strings that can be safely reported from as contigs? In
this paper we finally answer this question, and also give a polynomial time
algorithm to find them. Our experiments show that these strings, which we call
omnitigs, are 66% to 82% longer on average than the popular unitigs, and 29% of
dbSNP locations have more neighbors in omnitigs than in unitigs.Comment: Full version of the paper in the proceedings of RECOMB 201
Canonical, Stable, General Mapping using Context Schemes
Motivation: Sequence mapping is the cornerstone of modern genomics. However,
most existing sequence mapping algorithms are insufficiently general.
Results: We introduce context schemes: a method that allows the unambiguous
recognition of a reference base in a query sequence by testing the query for
substrings from an algorithmically defined set. Context schemes only map when
there is a unique best mapping, and define this criterion uniformly for all
reference bases. Mappings under context schemes can also be made stable, so
that extension of the query string (e.g. by increasing read length) will not
alter the mapping of previously mapped positions. Context schemes are general
in several senses. They natively support the detection of arbitrary complex,
novel rearrangements relative to the reference. They can scale over orders of
magnitude in query sequence length. Finally, they are trivially extensible to
more complex reference structures, such as graphs, that incorporate additional
variation. We demonstrate empirically the existence of high performance context
schemes, and present efficient context scheme mapping algorithms.
Availability and Implementation: The software test framework created for this
work is available from
https://registry.hub.docker.com/u/adamnovak/sequence-graphs/.
Contact: [email protected]
Supplementary Information: Six supplementary figures and one supplementary
section are available with the online version of this article.Comment: Submission for Bioinformatic
- …