17,959 research outputs found
Sensitive Long-Indel-Aware Alignment of Sequencing Reads
The tremdendous advances in high-throughput sequencing technologies have made
population-scale sequencing as performed in the 1000 Genomes project and the
Genome of the Netherlands project possible. Next-generation sequencing has
allowed genom-wide discovery of variations beyond single-nucleotide
polymorphisms (SNPs), in particular of structural variations (SVs) like
deletions, insertions, duplications, translocations, inversions, and even more
complex rearrangements. Here, we design a read aligner with special emphasis on
the following properties: (1) high sensitivity, i.e. find all (reasonable)
alignments; (2) ability to find (long) indels; (3) statistically sound
alignment scores; and (4) runtime fast enough to be applied to whole genome
data. We compare performance to BWA, bowtie2, stampy and find that our methods
is especially advantageous on reads containing larger indels
Bacterial microevolution and the Pangenome
The comparison of multiple genome sequences sampled from a bacterial population reveals considerable diversity in both the core and the accessory parts of the pangenome. This diversity can be analysed in terms of microevolutionary events that took place since the genomes shared a common ancestor, especially deletion, duplication, and recombination. We review the basic modelling ingredients used implicitly or explicitly when performing such a pangenome analysis. In particular, we describe a basic neutral phylogenetic framework of bacterial pangenome microevolution, which is not incompatible with evaluating the role of natural selection. We survey the different ways in which pangenome data is summarised in order to be included in microevolutionary models, as well as the main methodological approaches that have been proposed to reconstruct pangenome microevolutionary history
Parameter estimation in pair hidden Markov models
This paper deals with parameter estimation in pair hidden Markov models
(pair-HMMs). We first provide a rigorous formalism for these models and discuss
possible definitions of likelihoods. The model being biologically motivated,
some restrictions with respect to the full parameter space naturally occur.
Existence of two different Information divergence rates is established and
divergence property (namely positivity at values different from the true one)
is shown under additional assumptions. This yields consistency for the
parameter in parametrization schemes for which the divergence property holds.
Simulations illustrate different cases which are not covered by our results.Comment: corrected typo
CLEVER: Clique-Enumerating Variant Finder
Next-generation sequencing techniques have facilitated a large scale analysis
of human genetic variation. Despite the advances in sequencing speeds, the
computational discovery of structural variants is not yet standard. It is
likely that many variants have remained undiscovered in most sequenced
individuals. Here we present a novel internal segment size based approach,
which organizes all, including also concordant reads into a read alignment
graph where max-cliques represent maximal contradiction-free groups of
alignments. A specifically engineered algorithm then enumerates all max-cliques
and statistically evaluates them for their potential to reflect insertions or
deletions (indels). For the first time in the literature, we compare a large
range of state-of-the-art approaches using simulated Illumina reads from a
fully annotated genome and present various relevant performance statistics. We
achieve superior performance rates in particular on indels of sizes 20--100,
which have been exposed as a current major challenge in the SV discovery
literature and where prior insert size based approaches have limitations. In
that size range, we outperform even split read aligners. We achieve good
results also on real data where we make a substantial amount of correct
predictions as the only tool, which complement the predictions of split-read
aligners. CLEVER is open source (GPL) and available from
http://clever-sv.googlecode.com.Comment: 30 pages, 8 figure
Fast Hierarchical Clustering and Other Applications of Dynamic Closest Pairs
We develop data structures for dynamic closest pair problems with arbitrary
distance functions, that do not necessarily come from any geometric structure
on the objects. Based on a technique previously used by the author for
Euclidean closest pairs, we show how to insert and delete objects from an
n-object set, maintaining the closest pair, in O(n log^2 n) time per update and
O(n) space. With quadratic space, we can instead use a quadtree-like structure
to achieve an optimal time bound, O(n) per update. We apply these data
structures to hierarchical clustering, greedy matching, and TSP heuristics, and
discuss other potential applications in machine learning, Groebner bases, and
local improvement algorithms for partition and placement problems. Experiments
show our new methods to be faster in practice than previously used heuristics.Comment: 20 pages, 9 figures. A preliminary version of this paper appeared at
the 9th ACM-SIAM Symp. on Discrete Algorithms, San Francisco, 1998, pp.
619-628. For source code and experimental results, see
http://www.ics.uci.edu/~eppstein/projects/pairs
Implementation of a Human-Computer Interface for Computer Assisted Translation and Handwritten Text Recognition
A human-computer interface is developed to provide services of computer assisted machine translation (CAT) and computer assisted transcription of handwritten text images (CATTI). The back-end machine translation (MT) and handwritten text recognition (HTR) systems are provided by the Pattern Recognition and Human Language Technology (PRHLT) research group. The idea is to provide users with easy to use tools to convert interactive translation and transcription feasible tasks. The assisted service is provided by remote servers with CAT or CATTI capabilities. The interface supplies the user with tools for efficient local edition: deletion, insertion and substitution.Ocampo Sepúlveda, JC. (2009). Implementation of a Human-Computer Interface for Computer Assisted Translation and Handwritten Text Recognition. http://hdl.handle.net/10251/14318Archivo delegad
- …