197 research outputs found
One-dimensional and multi-dimensional substring selectivity estimation
With the increasing importance of XML, LDAP directories, and text-based information sources on the Internet, there is an ever-greater need to evaluate queries involving (sub)string matching. In many cases, matches need to be on multiple attributes/dimensions, with correlations between the multiple dimensions. Effective query optimization in this context requires good selectivity estimates. In this paper, we use pruned count-suffix trees (PSTs) as the basic data structure for substring selectivity estimation. For the 1-D problem, we present a novel technique called MO (Maximal Overlap). We then develop and analyze two 1-D estimation algorithms, MOC and MOLC, based on MO and a constraint-based characterization of all possible completions of a given PST. For the k -D problem, we first generalize PSTs to multiple dimensions and develop a space- and time-efficient probabilistic algorithm to construct k -D PSTs directly. We then show how to extend MO to multiple dimensions. Finally, we demonstrate, both analytically and experimentally, that MO is both practical and substantially superior to competing algorithms.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/42330/1/778-9-3-214_00090214.pd
QuickSel: Quick Selectivity Learning with Mixture Models
Estimating the selectivity of a query is a key step in almost any cost-based
query optimizer. Most of today's databases rely on histograms or samples that
are periodically refreshed by re-scanning the data as the underlying data
changes. Since frequent scans are costly, these statistics are often stale and
lead to poor selectivity estimates. As an alternative to scans, query-driven
histograms have been proposed, which refine the histograms based on the actual
selectivities of the observed queries. Unfortunately, these approaches are
either too costly to use in practice---i.e., require an exponential number of
buckets---or quickly lose their advantage as they observe more queries.
In this paper, we propose a selectivity learning framework, called QuickSel,
which falls into the query-driven paradigm but does not use histograms.
Instead, it builds an internal model of the underlying data, which can be
refined significantly faster (e.g., only 1.9 milliseconds for 300 queries).
This fast refinement allows QuickSel to continuously learn from each query and
yield increasingly more accurate selectivity estimates over time. Unlike
query-driven histograms, QuickSel relies on a mixture model and a new
optimization algorithm for training its model. Our extensive experiments on two
real-world datasets confirm that, given the same target accuracy, QuickSel is
34.0x-179.4x faster than state-of-the-art query-driven histograms, including
ISOMER and STHoles. Further, given the same space budget, QuickSel is
26.8%-91.8% more accurate than periodically-updated histograms and samples,
respectively
Efficient similarity-based operations for data integration
Similarity-based operations, similarity join, similarity grouping, data integrationMagdeburg, Univ., Fak. für Informatik, Diss., 2004von Eike Schalleh
Estimating Answer Sizes for XML Queries
Abstract. Estimating the sizes of query results, and intermediate results, is crucial to many aspects of query processing. In particular, it is necessary for effective query optimization. Even at the user level, predictions of the total result size can be valuable in “next-step ” decisions, such as query refinement. This paper proposes a technique to obtain query result size estimates effectively in an XML database. Queries in XML frequently specify structural patterns, requiring specific relationships between selected elements. Whereas traditional techniques can estimate the number of nodes (XML elements) that will satisfy a node-specific predicate in the query pattern, such estimates cannot easily be combined to provide estimates for the entire query pattern, since element occurrences are expected to have high correlation. We propose a solution based on a novel histogram encoding of element occurrence position. With such position histograms, we are able to obtain estimates of sizes for complex pattern queries, as well as for simpler intermediate patterns that may be evaluated in alternative query plans, by means of a position histogram join (pH-join) algorithm that we introduce. We extend our technique to exploit schema information regarding allowable structure (the no-overlap property) through the use of a coverage histogram. We present an extensive experimental evaluation using several XML data sets, both real and synthetic, with a variety of queries. Our results demonstrate that accurate and robust estimates can be achieved, with limited space, and at a miniscule computational cost. These techniques have been implemented in the context of the TIMBER native XML database [22] at the University of Michigan.
Efficient Indexing for Structured and Unstructured Data
The collection of digital data is growing at an exponential rate. Data originates from wide range of data sources such as text feeds, biological sequencers, internet traffic over routers, through sensors and many other sources. To mine intelligent information from these sources, users have to query the data. Indexing techniques aim to reduce the query time by preprocessing the data. Diversity of data sources in real world makes it imperative to develop application specific indexing solutions based on the data to be queried. Data can be structured i.e., relational tables or unstructured i.e., free text. Moreover, increasingly many applications need to seamlessly analyze both kinds of data making data integration a central issue. Integrating text with structured data needs to account for missing values, errors in the data etc. Probabilistic models have been proposed recently for this purpose. These models are also useful for applications where uncertainty is inherent in data e.g. sensor networks. This dissertation aims to propose efficient indexing solutions for several problems that lie at the intersection of database and information retrieval such as joining ranked inputs, full-text documents searching etc. Other well-known problems of ranked retrieval and pattern matching are also studied under probabilistic settings. For each problem, the worst-case theoretical bounds of the proposed solutions are established and/or their practicality is demonstrated by thorough experimentation
- …