7 research outputs found
Practical methods for constructing suffix trees
Sequence datasets are ubiquitous in modern life-science applications, and querying sequences is a common and critical operation in many of these applications. The suffix tree is a versatile data structure that can be used to evaluate a wide variety of queries on sequence datasets, including evaluating exact and approximate string matches, and finding repeat patterns. However, methods for constructing suffix trees are often very time-consuming, especially for suffix trees that are large and do not fit in the available main memory. Even when the suffix tree fits in memory, it turns out that the processor cache behavior of theoretically optimal suffix tree construction methods is poor, resulting in poor performance. Currently, there are a large number of algorithms for constructing suffix trees, but the practical tradeoffs in using these algorithms for different scenarios are not well characterized.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/47869/1/778_2005_Article_154.pd
Declarative Querying For Biological Sequences.
Life science research labs today manage increasing volumes of sequence
data. Much of the data management and querying today is accomplished
procedurally using Perl, Python, or Java programs that integrate data
from different sources and query tools. The dangers of this procedural
approach are well known to the database community-- a) severe
limitations on the ability to rapidly express queries and b)
inefficient query plans due to the lack of sophisticated optimization
tools. This situation is likely to get worse with advances in
high-throughput technologies that make it easier to quickly produce
vast amounts of sequence data. The need for a declarative and
efficient system to manage and query biological sequence data is
urgent. To address this need, we designed the Periscope/SQ system.
Periscope/SQ extends current relational systems to enable
sophisticated queries on sequence data and can optimize and execute
these queries efficiently.
This thesis describes the problems that need to be solved to make it
possible to build the Periscope/SQ system. First, we describe the
algebraic framework which forms the backbone of Periscope/SQ. Second,
we describe algorithms to construct large scale suffix tree indexes
for efficiently answering sequence queries. Third, we describe
techniques for selectivity estimation and optimization in the context
of queries over biological sequences. Next, we demonstrate how some of
the techniques developed for Periscope/SQ can be applied to produce a
powerful mining algorithm that we call FLAME. Finally, we
describe GeneFinder, a biological application built on top of
Periscope/SQ. GeneFinder is currently being used to predict the targets of
transcription factors.
Today, genomic and proteomic sequences are the most abundantly
available source of high-quality biological data. By making it possible to
declaratively and efficiently query vast amount of sequence data,
Periscope/SQ opens the door to vast improvements in the pace of
bioinformatics research.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/55670/2/tatas_1.pd
A disk-resident suffix tree index and generic framework for managing tunable indexes
This thesis introduces two related technologies. The first is a disk-resident index for biological sequence data, and the second is a framework and toolkit for the management of operational parameters for applications of which this index is typical. The Top-Compressed Suffix Tree is a novel data structure that can be used to provide a scalable, disk-resident index for large sequences. This data structure is based on the suffix tree, but has been designed to overcome the problems associated with using such structures on secondary memory. Top-Compressed Suffix Trees can be constructed incrementally, allowing indexes to be created that are larger than the amount of available main memory. Correspondingly, querying such an index only requires part of the data structure to be resident in main memory, thus allowing support for on-demand faulting and eviction of index sections during search. Such an index may be of great benefit to scientists requiring efficient access to vast repositories of genomic data. The Generic Index Development and Operation Framework (GIDOF) is a framework and toolkit that supports various tasks relating to the management of operational parameters. The performance of an index's implementation is typically influenced by several operational parameters parameters that must be tuned carefully if optimum performance is to be obtained. Indexes implemented using GIDOF can be structured in such a way that values of selected operational parameters can be adjusted; resulting in an index implementation that can be tuned to suit a given workload or system environment. This thesis presents a detailed description of the design of both the Top-Compressed Suffix Tree and the algorithms that operate over it. Extensive performance measurements are then presented and discussed, covering such aspects of index performance as construction time, average query performance and the size of the completed index. An overview of the GIDOF parameter model and toolkit is then given together with examples of how this framework can be used to manage tunable indexes, such as the Top-Compressed Suffix Tree