72,739 research outputs found
Estimating the compression fraction of an index using sampling
Data compression techniques such as null suppression
and dictionary compression are commonly used in today’s
database systems. In order to effectively leverage compression, it
is necessary to have the ability to efficiently and accurately
estimate the size of an index if it were to be compressed. Such an
analysis is critical if automated physical design tools are to be
extended to handle compression. Several database systems today
provide estimators for this problem based on random sampling.
While this approach is efficient, there is no previous work that
analyses its accuracy. In this paper, we analyse the problem of
estimating the compressed size of an index from the point of view
of worst-case guarantees. We show that the simple estimator
implemented by several database systems has several “good”
cases even though the estimator itself is agnostic to the internals
of the specific compression algorithm.
efficiently. The naïve method of actually building and
compressing the index in order to estimate its size, while
highly accurate is prohibitively inefficient.
Thus, we need to be able to accurately estimate the
compressed size of an index without incurring the cost of
actually compressing it. This problem is challenging because
the size of the compressed index can depend significantly on
the data distribution as well as the compression technique
used. This is in contrast with the estimation of the size of an
uncompressed index in physical database design tools which
can be derived in a straightforward manner from the schema
(which defines the size of the corresponding column) and the
number of rows in the table
Intelligent Data Storage and Retrieval for Design Optimisation – an Overview
This paper documents the findings of a literature review conducted by the Sir Lawrence Wackett Centre for Aerospace Design Technology at RMIT University. The review investigates aspects of a proposed system for intelligent design optimisation. Such a system would be capable of efficiently storing (and compressing if required) a range of types of design data into an intelligent database. This database would be accessed by the system during subsequent design processes, allowing for search of relevant design data for re-use in later designs, allowing it to become very efficient in reducing the time for later designs as the database grows in size. Extensive research has been performed, in both theoretical aspects of the project, and practical examples of current similar systems. This research covers the areas of database systems, database queries, representation and compression of design data, geometric representation and heuristic methods for design applications.
Scalable RDF Data Compression using X10
The Semantic Web comprises enormous volumes of semi-structured data elements.
For interoperability, these elements are represented by long strings. Such
representations are not efficient for the purposes of Semantic Web applications
that perform computations over large volumes of information. A typical method
for alleviating the impact of this problem is through the use of compression
methods that produce more compact representations of the data. The use of
dictionary encoding for this purpose is particularly prevalent in Semantic Web
database systems. However, centralized implementations present performance
bottlenecks, giving rise to the need for scalable, efficient distributed
encoding schemes. In this paper, we describe an encoding implementation based
on the asynchronous partitioned global address space (APGAS) parallel
programming model. We evaluate performance on a cluster of up to 384 cores and
datasets of up to 11 billion triples (1.9 TB). Compared to the state-of-art
MapReduce algorithm, we demonstrate a speedup of 2.6-7.4x and excellent
scalability. These results illustrate the strong potential of the APGAS model
for efficient implementation of dictionary encoding and contributes to the
engineering of larger scale Semantic Web applications
Investigations on path indexing for graph databases
Graph databases have become an increasingly popular choice for the management of the massive network data sets arising in many contemporary applications. We investigate the effectiveness of path indexing for accelerating query processing in graph database systems, using as an exemplar the widely used open-source Neo4j graph database. We present a novel path index design which supports efficient ordered access to paths in a graph dataset. Our index is fully persistent and designed for external memory storage and retrieval. We also describe a compression scheme that exploits the limited differences between consecutive keys in the index, as well as a workload-driven approach to indexing. We demonstrate empirically the speed-ups achieved by our implementation, showing that the path index yields query run-times from 2x up to 8000x faster than Neo4j. Empirical evaluation also shows that our scheme leads to smaller indexes than using general-purpose LZ4 compression. The complete stand-alone implementation of our index, as well as supporting tooling such as a bulk-loader, are provided as open source for further research and development
Rhythmic Representations: Learning Periodic Patterns for Scalable Place Recognition at a Sub-Linear Storage Cost
Robotic and animal mapping systems share many challenges and characteristics:
they must function in a wide variety of environmental conditions, enable the
robot or animal to navigate effectively to find food or shelter, and be
computationally tractable from both a speed and storage perspective. With
regards to map storage, the mammalian brain appears to take a diametrically
opposed approach to all current robotic mapping systems. Where robotic mapping
systems attempt to solve the data association problem to minimise
representational aliasing, neurons in the brain intentionally break data
association by encoding large (potentially unlimited) numbers of places with a
single neuron. In this paper, we propose a novel method based on supervised
learning techniques that seeks out regularly repeating visual patterns in the
environment with mutually complementary co-prime frequencies, and an encoding
scheme that enables storage requirements to grow sub-linearly with the size of
the environment being mapped. To improve robustness in challenging real-world
environments while maintaining storage growth sub-linearity, we incorporate
both multi-exemplar learning and data augmentation techniques. Using large
benchmark robotic mapping datasets, we demonstrate the combined system
achieving high-performance place recognition with sub-linear storage
requirements, and characterize the performance-storage growth trade-off curve.
The work serves as the first robotic mapping system with sub-linear storage
scaling properties, as well as the first large-scale demonstration in
real-world environments of one of the proposed memory benefits of these
neurons.Comment: Pre-print of article that will appear in the IEEE Robotics and
Automation Letter
- …