121 research outputs found
On the Use of Suffix Arrays for Memory-Efficient Lempel-Ziv Data Compression
Much research has been devoted to optimizing algorithms of the Lempel-Ziv
(LZ) 77 family, both in terms of speed and memory requirements. Binary search
trees and suffix trees (ST) are data structures that have been often used for
this purpose, as they allow fast searches at the expense of memory usage.
In recent years, there has been interest on suffix arrays (SA), due to their
simplicity and low memory requirements. One key issue is that an SA can solve
the sub-string problem almost as efficiently as an ST, using less memory. This
paper proposes two new SA-based algorithms for LZ encoding, which require no
modifications on the decoder side. Experimental results on standard benchmarks
show that our algorithms, though not faster, use 3 to 5 times less memory than
the ST counterparts. Another important feature of our SA-based algorithms is
that the amount of memory is independent of the text to search, thus the memory
that has to be allocated can be defined a priori. These features of low and
predictable memory requirements are of the utmost importance in several
scenarios, such as embedded systems, where memory is at a premium and speed is
not critical. Finally, we point out that the new algorithms are general, in the
sense that they are adequate for applications other than LZ compression, such
as text retrieval and forward/backward sub-string search.Comment: 10 pages, submited to IEEE - Data Compression Conference 200
Bug or Not? Bug Report Classification Using N-Gram IDF
Previous studies have found that a significant number of bug reports are
misclassified between bugs and non-bugs, and that manually classifying bug
reports is a time-consuming task. To address this problem, we propose a bug
reports classification model with N-gram IDF, a theoretical extension of
Inverse Document Frequency (IDF) for handling words and phrases of any length.
N-gram IDF enables us to extract key terms of any length from texts, these key
terms can be used as the features to classify bug reports. We build
classification models with logistic regression and random forest using features
from N-gram IDF and topic modeling, which is widely used in various software
engineering tasks. With a publicly available dataset, our results show that our
N-gram IDF-based models have a superior performance than the topic-based models
on all of the evaluated cases. Our models show promising results and have a
potential to be extended to other software engineering tasks.Comment: 5 pages, ICSME 201
Computing Lempel-Ziv Factorization Online
We present an algorithm which computes the Lempel-Ziv factorization of a word
of length on an alphabet of size online in the
following sense: it reads starting from the left, and, after reading each
characters of , updates the Lempel-Ziv
factorization. The algorithm requires bits of space and O(n
\log^2 n) time. The basis of the algorithm is a sparse suffix tree combined
with wavelet trees
Computing regularities in strings
Regularities in strings model many phenomena and thus form the subject of extensive mathematical studies . Perhaps the most conspicuous regularities in strings are those that manifest themselves in the form of repeated subpatterns. In this paper, we study several forms of regularities of strings, that is, repeats, multirepeats, repetitions and runs. We present their similarities and differences by discussing their forms and properties and we explore the existing computation algorithms. We also discuss several data structures useful for computing regularities
A comprehensive evaluation of alignment algorithms in the context of RNA-seq.
Transcriptome sequencing (RNA-Seq) overcomes limitations of previously used RNA quantification methods and provides one experimental framework for both high-throughput characterization and quantification of transcripts at the nucleotide level. The first step and a major challenge in the analysis of such experiments is the mapping of sequencing reads to a transcriptomic origin including the identification of splicing events. In recent years, a large number of such mapping algorithms have been developed, all of which have in common that they require algorithms for aligning a vast number of reads to genomic or transcriptomic sequences. Although the FM-index based aligner Bowtie has become a de facto standard within mapping pipelines, a much larger number of possible alignment algorithms have been developed also including other variants of FM-index based aligners. Accordingly, developers and users of RNA-seq mapping pipelines have the choice among a large number of available alignment algorithms. To provide guidance in the choice of alignment algorithms for these purposes, we evaluated the performance of 14 widely used alignment programs from three different algorithmic classes: algorithms using either hashing of the reference transcriptome, hashing of reads, or a compressed FM-index representation of the genome. Here, special emphasis was placed on both precision and recall and the performance for different read lengths and numbers of mismatches and indels in a read. Our results clearly showed the significant reduction in memory footprint and runtime provided by FM-index based aligners at a precision and recall comparable to the best hash table based aligners. Furthermore, the recently developed Bowtie 2 alignment algorithm shows a remarkable tolerance to both sequencing errors and indels, thus, essentially making hash-based aligners obsolete
- …