Search CORE

146 research outputs found

CoCoNUT: an efficient system for the comparison and analysis of genomes

Author: A Darling
A Kasprzyk
B Haas
B Ma
B Mau
B Morgenstern
B Raphael
C Wawra
DR Bentley
E Mardis
E Ohlebusch
E Passarge
E Sonnhammer
Enno Ohlebusch
G Bourque
G Gremme
I Ovcharenko
J Krumsiek
J Peterson
J Thompson
L Florea
M Abouelhoda
M Abouelhoda
M Abouelhoda
M Abouelhoda
M Abouelhoda
M Blanchette
M Brudno
M Clamp
M Höhl
M Kellis
M Margulies
Mohamed I Abouelhoda
P Chain
R Staden
S Altschul
S Karlin
S Kurtz
S Ranganathan
S Schwartz
S Schwartz
S Shibuya
Stefan Kurtz
T Treangen
T Vision
T Wu
The Arabidopsis Genome Initiative
W Kent
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

On the suitability of suffix arrays for lempel-ziv data compression

Author: D. Gusfield
D. Salomon
E. McCreight
E. Ukkonen
J. Karkainen
J. Storer
J. Ziv
K. Sadakane
M. Abouelhoda
U. Manber
Publication venue: Springer-Verlag Berlin
Publication date: 01/01/2009
Field of study

Lossless compression algorithms of the Lempel-Ziv (LZ) family are widely used nowadays. Regarding time and memory requirements, LZ encoding is much more demanding than decoding. In order to speed up the encoding process, efficient data structures, like suffix trees, have been used. In this paper, we explore the use of suffix arrays to hold the dictionary of the LZ encoder, and propose an algorithm to search over it. We show that the resulting encoder attains roughly the same compression ratios as those based on suffix trees. However, the amount of memory required by the suffix array is fixed, and much lower than the variable amount of memory used by encoders based on suffix trees (which depends on the text to encode). We conclude that suffix arrays, when compared to suffix trees in terms of the trade-off among time, memory, and compression ratio, may be preferable in scenarios (e.g., embedded systems) where memory is at a premium and high speed is not critical

Repositório Científico do Instituto Politécnico de Lisboa

Crossref

SeqAn An efficient, generic C++ library for sequence analysis

Author: A Darling
A Fabri
A Halpern
Andreas Döring
C Notredame
D Butt
D Vandevoorde
David Weese
DS Hirschberg
EW Myers
EW Myers
G Myers
G Navarro
J Dutheil
J Kececioglu
J Stajich
JC Venter
K Czarnecki
K Mehlhorn
Knut Reinert
M Abouelhoda
M Abouelhoda
M Brudno
M Höhl
M Li
M Pocock
M Wilson
MH Austern
MH Overmars
MI Abouelhoda
N Saitou
O Gotoh
P Bieganski
P Weiner
R Giegerich
RJ Mural
S Burkhardt
S Burkhardt
S Kurtz
SB Needleman
SF Altschul
TH Cormen
Tobias Rausch
U Manber
W Vahrson
WR Pitt
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use. Results To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use. Conclusion We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Repository: Freie Universität Berlin (FU), Math Department (fu_mi_publications)

PubMed Central

Lightweight Lempel-Ziv Parsing

Author: D. Okanohara
D. Okanohara
E. Ohlebusch
E. Ohlebusch
G. Chen
G. Navarro
G. Navarro
J. Barbay
J. Fischer
J. Kärkkäinen
J. Ziv
M. Crochemore
M.I. Abouelhoda
P. Ferragina
P. Ferragina
R. Cánovas
S. Kreft
S. Kuruppu
T. Gagie
T. Kasai
T. Starikovskaya
U. Manber
W.I. Chang
Publication venue
Publication date: 01/01/2013
Field of study

We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

arXiv.org e-Print Archive

Crossref

On finding minimal absent words

Author: Armando J Pinho
C Acquisti
D Gusfield
DK Kim
DK Kim
E Ukkonen
EM McCreight
F Shi
G Hampikian
J Herold
J Kärkkäinen
João MOS Rodrigues
M Burrows
MI Abouelhoda
MI Abouelhoda
P Weiner
Paulo JSG Ferreira
S Kurtz
Sara P Garcia
T Kasai
U Manber
U Manber
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The problem of finding the shortest absent words in DNA data has been recently addressed, and algorithms for its solution have been described. It has been noted that longer absent words might also be of interest, but the existing algorithms only provide generic absent words by trivially extending the shortest ones. Results We show how absent words relate to the repetitions and structure of the data, and define a new and larger class of absent words, called minimal absent words, that still captures the essential properties of the shortest absent words introduced in recent works. The words of this new class are minimal in the sense that if their leftmost or rightmost character is removed, then the resulting word is no longer an absent word. We describe an algorithm for generating minimal absent words that, in practice, runs in approximately linear time. An implementation of this algorithm is publicly available at <url>ftp://www.ieeta.pt/~ap/maws</url>. Conclusion Because the set of minimal absent words that we propose is much larger than the set of the shortest absent words, it is potentially more useful for applications that require a richer variety of absent words. Nevertheless, the number of minimal absent words is still manageable since it grows at most linearly with the string size, unlike generic absent words that grow exponentially. Both the algorithm and the concepts upon which it depends shed additional light on the structure of absent words and complement the existing studies on the topic.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Simultaneous identification of long similar substrings in large sets of sequences

Author: A Lefebvre
Burghardt Wittig
E Check
Friedrich Möller
J Kleffe
Jürgen Kleffe
M Abouelhoda
M Hiller
M Höhl
PE Warburton
R Sorek
S Burkhardt
S Kurtz
S Kurtz
S Mielordt
T Hamborg
W Kent
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Sequence comparison faces new challenges today, with many complete genomes and large libraries of transcripts known. Gene annotation pipelines match these sequences in order to identify genes and their alternative splice forms. However, the software currently available cannot simultaneously compare sets of sequences as large as necessary especially if errors must be considered. Results We therefore present a new algorithm for the identification of almost perfectly matching substrings in very large sets of sequences. Its implementation, called ClustDB, is considerably faster and can handle 16 times more data than VMATCH, the most memory efficient exact program known today. ClustDB simultaneously generates large sets of exactly matching substrings of a given minimum length as seeds for a novel method of match extension with errors. It generates alignments of maximum length with a considered maximum number of errors within each overlapping window of a given size. Such alignments are not optimal in the usual sense but faster to calculate and often more appropriate than traditional alignments for genomic sequence comparisons, EST and full-length cDNA matching, and genomic sequence assembly. The method is used to check the overlaps and to reveal possible assembly errors for 1377 <it>Medicago truncatula </it>BAC-size sequences published at <url>http://www.medicago.org/genome/assembly_table.php?chr=1</url>. Conclusion The program ClustDB proves that window alignment is an efficient way to find long sequence sections of homogenous alignment quality, as expected in case of random errors, and to detect systematic errors resulting from sequence contaminations. Such inserts are systematically overlooked in long alignments controlled by only tuning penalties for mismatches and gaps. ClustDB is freely available for academic use.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Clustering Documents with Maximal Substrings

Author: D. Blei
D. Zhang
K. Nigam
M. Abouelhoda
T. Chumwatana
T. Kasai
Y. Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

This paper provides experimental results showing that we can use maximal substrings as elementary building blocks of documents in place of the words extracted by a current state-of-the-art supervised word extraction. Maximal substrings are defined as the substrings each giving a smaller number of occurrences even by appending only one character to its head or tail. The main feature of maximal substrings is that they can be extracted quite efficiently in an unsupervised manner. We extract maximal substrings from a document set and represent each document as a bag of maximal substrings. We also obtain a bag of words representation by using a state-of-the-art supervised word extraction over the same document set. We then apply the same document clustering method to both representations and obtain two clustering results for a comparison of their quality. We adopt a Bayesian document clustering based on Dirichlet compound multinomials for avoiding overfitting. Our experiment shows that the clustering quality achieved with maximal substrings is acceptable enough to use them in place of the words extracted by a supervised word extraction

Crossref

Nagasaki University's Academic Output SITE: NAOSITE

Institutional Repositories DataBase (IRDB)

Nagasaki university's Academic Output SITE

Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments

Author: A Szalkowski
A Wozniak
C Camacho
DJ States
J Daily
J Fischer
Jeff Daily
L Wang
M Farrar
M Zhao
MI Abouelhoda
O Gotoh
S Henikoff
SF Altschul
T Rognes
T Rognes
The UniProt Consortium
Y Liu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A fast algorithm for the multiple genome rearrangement problem with weighted reversals and transpositions

Author: A Bergeron
A Caprara
A Caprara
B Bourque
B Moret
B Moret
D Bader
D Sankoff
D Sankoff
E Tannier
Enno Ohlebusch
G Fritzsch
J Tang
M Bader
M Bader
M Bernt
M Blanchette
M Blanchette
M Cosner
Martin Bader
Mohamed I Abouelhoda
N Eriksen
P Pevzner
S Hannenhalli
S Wu
S Wu
T Hartman
T Liu
V Bafna
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Due to recent progress in genome sequencing, more and more data for phylogenetic reconstruction based on rearrangement distances between genomes become available. However, this phylogenetic reconstruction is a very challenging task. For the most simple distance measures (the breakpoint distance and the reversal distance), the problem is NP-hard even if one considers only three genomes. Results In this paper, we present a new heuristic algorithm that directly constructs a phylogenetic tree w.r.t. the weighted reversal and transposition distance. Experimental results on previously published datasets show that constructing phylogenetic trees in this way results in better trees than constructing the trees w.r.t. the reversal distance, and recalculating the weight of the trees with the weighted reversal and transposition distance. An implementation of the algorithm can be obtained from the authors. Conclusion The possibility of creating phylogenetic trees directly w.r.t. the weighted reversal and transposition distance results in biologically more realistic scenarios. Our algorithm can solve today's most challenging biological datasets in a reasonable amount of time.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Minimal Absent Words in Four Human Genome Assemblies

Author: A Dembo
AJ Pinho
Armando J. Pinho
C Acquisti
D Gusfield
G Hampikian
J Herold
JR Lupski
K Ning
M Burrows
MI Abouelhoda
P Jaccard
R Li
S Gnerre
S Levy
Sara P. Garcia
SP Garcia
T Kasai
Z Khan
Zhanjiang Liu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Minimal absent words have been computed in genomes of organisms from all domains of life. Here, we aim to contribute to the catalogue of human genomic variation by investigating the variation in number and content of minimal absent words within a species, using four human genome assemblies. We compare the reference human genome GRCh37 assembly, the HuRef assembly of the genome of Craig Venter, the NA12878 assembly from cell line GM12878, and the YH assembly of the genome of a Han Chinese individual. We find the variation in number and content of minimal absent words between assemblies more significant for large and very large minimal absent words, where the biases of sequencing and assembly methodologies become more pronounced. Moreover, we find generally greater similarity between the human genome assemblies sequenced with capillary-based technologies (GRCh37 and HuRef) than between the human genome assemblies sequenced with massively parallel technologies (NA12878 and YH). Finally, as expected, we find the overall variation in number and content of minimal absent words within a species to be generally smaller than the variation between species

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central