53 research outputs found
Column Imprints: A Secondary Index Structure
Large scale data warehouses rely heavily on secondary indexes,
such as bitmaps and b-trees, to limit access to slow IO devices.
However, with the advent of large main memory systems, cache
conscious secondary indexes are needed to improve also the transfer
bandwidth between memory and cpu. In this paper, we introduce
column imprint, a simple but efficient cache conscious secondary
index. A column imprint is a collection of many small bit
vectors, each indexing the data points of a single cacheline. An
imprint is used during query evaluation to limit data access and
thus minimize memory traffic. The compression for imprints is
cpu friendly and exploits the empirical observation that data often
exhibits local clustering or partial ordering as a side-effect of the
construction process. Most importantly, column imprint compression
remains effective and robust even in the case of unclustered
data, while other state-of-the-art solutions fail. We conducted an
extensive experimental evaluation to assess the applicability and
the performance impact of the column imprints. The storage overhead,
when experimenting with real world datasets, is just a few
percent over the size of the columns being indexed. The evaluation
time for over 40000 range queries of varying selectivity revealed
the efficiency of the proposed index compar
Heuristics-based query optimisation for SPARQL
Query optimization in RDF Stores is a challenging problem as SPARQL queries typically contain many more joins than equivalent relational plans, and hence lead to a large join order search space. In such cases, cost-based query optimization often is not possible. One practical reason for this is that statistics typically are missing in web scale setting such as the Linked Open Datasets (LOD). The more profound reason is that due to the absence of schematic structure in RDF, join-hit ratio estimation requires complicated forms of correlated join statistics; and currently there are no methods to identify the relevant correlations beforehand. For this reason, the use of good heuristics is essential in SPARQL query optimization, even in the case that are partially used with cost-based statistics (i.e., hybrid query optimization). In this paper we describe a set of useful heuristics for SPARQL query optimizers. We present these in the context of a new Heuristic SPARQL Planner (HSP) that is capable of exploiting the syntactic and the structural variations of the triple patterns in a SPARQL query in order to choose an execution plan without the need of any cost model. For this, we define the variable graph and we show a reduction of the SPARQL query optimization problem to the maximum weight independent set problem.
We implemented our planner on top of the MonetDB open source column-store and evaluated its effectiveness against the state-ofthe-art RDF-3X engine as well as comparing the plan quality with
a relational (SQL) equivalent of the benchmarks
A database system with amnesia
Big Data comes with huge challenges. Its volume and velocity makes handling, curating, and analytical processing a costly affair. Even to simply “look at” the data within an a priori defined budget and with a guaranteed interactive response time might be impossible to achieve. Commonly applied scale-out approaches will hit the technology and monetary wall soon, if not done so already. Likewise, blindly rejecting data when the channels are full, or reducing the data resolution at the source, might lead to loss of valuable observations. An army of well-educated database administrators or full software stack architects might deal with these challenges albeit at substantial cost. This calls for a mostly knobless DBMS with a fundamental change in database management. Data rotting has been proposed as a direction to find a solution [10, 11]. For the sake of storage management and responsiveness, it lets the DBMS semi-autonomously rot away data. Rotting is based on the systems own unwillingness to keep old data as easily accessible as fresh data. This paper sheds more light on the opportunities and potential impacts of this radical departure in data management. Specifically, we study the case where a DBMS selectively forgets tuples (by marking them inactive) under various amnesia scenarios and with different implementation strategies. Our ultimate goal is to use the findings of this study to morph an existing data management engine to serve demanding big data scientific applications with well-chosen built-in data a
Column-store support for RDF data management: not all swans are white
This paper reports on the results of an independent evaluation of the
techniques presented in the VLDB 2007 paper "Scalable Semantic Web Data
Management Using Vertical Partitioning", authored by D. Abadi, A. Marcus, S.
R. Madden, and K. Hollenbach. We revisit the proposed benchmark and examine
both the data and query space coverage. The benchmark is extended to cover a
larger portion of the query space in a canonical way. Repeatability of the
experiments is assessed using the code base obtained from the authors.
Inspired by the proposed vertically-partitioned storage solution for RDF
data and the performance figures using a column-store, we conduct a
complementary analy- sis of state-of-the-art RDF storage solutions. To this
end, we employ MonetDB/SQL, a fully-functional open source column-store, and
a well-known --- for its performance --- commercial row-store DBMS.We
implement two relational RDF storage solutions – triple-store and
vertically-partitioned --- in both systems. This allows us to expand the
scope of with the performance characterization along both dimensions ---
triple-store vs. vertically-partitioned and row-store vs. column-store ---
individually, before analyzing their combined effects. A detailed report of
the experimental test-bed, as well as an in-depth analysis of the parameters
involved, clarify the scope of the solution originally presented and
position the results in a broader context by covering more systems
Scientific discovery through weighted sampling
Scientific discovery has shifted from being an
exercise of theory and computation, to become the exploration
of an ocean of observational data. Scientists explore data
originated from modern scientific instruments in order to
discover interesting aspects of it and formulate their hypothesis.
Such workloads press for new database functionality. We aim
at
sampling
scientific databases to create many different
impres-
sions
of the data, on which the scientists can quickly evaluate
exploratory queries. However, scientific databases introduce
different challenges for sample construction compared to
classical business analytical applications. We propose
adaptive
weighted sampling
as an alternative to uniform sampling. With
weighted sampling
only the most informative data is being
sampled, thus more relevant data to the scientific discovery is
available to examine a hypothesis. Relevant data is considered
to be the focal points of the scientific search, and can be defined
either a priori with the use of functions, or by monitoring
the query workload. We study such query workloads, and we
detail different families of weight functions. Finally, we give a
quantitative and qualitative evaluation of weighted sampling
- …