Life science research labs today manage increasing volumes of sequence
data. Much of the data management and querying today is accomplished
procedurally using Perl, Python, or Java programs that integrate data
from different sources and query tools. The dangers of this procedural
approach are well known to the database community-- a) severe
limitations on the ability to rapidly express queries and b)
inefficient query plans due to the lack of sophisticated optimization
tools. This situation is likely to get worse with advances in
high-throughput technologies that make it easier to quickly produce
vast amounts of sequence data. The need for a declarative and
efficient system to manage and query biological sequence data is
urgent. To address this need, we designed the Periscope/SQ system.
Periscope/SQ extends current relational systems to enable
sophisticated queries on sequence data and can optimize and execute
these queries efficiently.
This thesis describes the problems that need to be solved to make it
possible to build the Periscope/SQ system. First, we describe the
algebraic framework which forms the backbone of Periscope/SQ. Second,
we describe algorithms to construct large scale suffix tree indexes
for efficiently answering sequence queries. Third, we describe
techniques for selectivity estimation and optimization in the context
of queries over biological sequences. Next, we demonstrate how some of
the techniques developed for Periscope/SQ can be applied to produce a
powerful mining algorithm that we call FLAME. Finally, we
describe GeneFinder, a biological application built on top of
Periscope/SQ. GeneFinder is currently being used to predict the targets of
transcription factors.
Today, genomic and proteomic sequences are the most abundantly
available source of high-quality biological data. By making it possible to
declaratively and efficiently query vast amount of sequence data,
Periscope/SQ opens the door to vast improvements in the pace of
bioinformatics research.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/55670/2/tatas_1.pd