3,036 research outputs found
Reordering Columns for Smaller Indexes
Column-oriented indexes-such as projection or bitmap indexes-are compressed
by run-length encoding to reduce storage and increase speed. Sorting the tables
improves compression. On realistic data sets, permuting the columns in the
right order before sorting can reduce the number of runs by a factor of two or
more. Unfortunately, determining the best column order is NP-hard. For many
cases, we prove that the number of runs in table columns is minimized if we
sort columns by increasing cardinality. Experimentally, sorting based on
Hilbert space-filling curves is poor at minimizing the number of runs.Comment: to appear in Information Science
Histogram-Aware Sorting for Enhanced Word-Aligned Compression in Bitmap Indexes
Bitmap indexes must be compressed to reduce input/output costs and minimize
CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use
techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid
(WAH) compression. These techniques are sensitive to the order of the rows: a
simple lexicographical sort can divide the index size by 9 and make indexes
several times faster. We investigate reordering heuristics based on computed
attribute-value histograms. Simply permuting the columns of the table based on
these histograms can increase the sorting efficiency by 40%.Comment: To appear in proceedings of DOLAP 200
Reordering Rows for Better Compression: Beyond the Lexicographic Order
Sorting database tables before compressing them improves the compression
rate. Can we do better than the lexicographical order? For minimizing the
number of runs in a run-length encoding compression scheme, the best approaches
to row-ordering are derived from traveling salesman heuristics, although there
is a significant trade-off between running time and compression. A new
heuristic, Multiple Lists, which is a variant on Nearest Neighbor that trades
off compression for a major running-time speedup, is a good option for very
large tables. However, for some compression schemes, it is more important to
generate long runs rather than few runs. For this case, another novel
heuristic, Vortex, is promising. We find that we can improve run-length
encoding up to a factor of 3 whereas we can improve prefix coding by up to 80%:
these gains are on top of the gains due to lexicographically sorting the table.
We prove that the new row reordering is optimal (within 10%) at minimizing the
runs of identical values within columns, in a few cases.Comment: to appear in ACM TOD
IMPROVING MOLECULAR FINGERPRINT SIMILARITY VIA ENHANCED FOLDING
Drug discovery depends on scientists finding similarity in molecular fingerprints to the drug target. A new way to improve the accuracy of molecular fingerprint folding is presented. The goal is to alleviate a growing challenge due to excessively long fingerprints. This improved method generates a new shorter fingerprint that is more accurate than the basic folded fingerprint. Information gathered during preprocessing is used to determine an optimal attribute order. The most commonly used blocks of bits can then be organized and used to generate a new improved fingerprint for more optimal folding. We thenapply the widely usedTanimoto similarity search algorithm to benchmark our results. We show an improvement in the final results using this method to generate an improved fingerprint when compared against other traditional folding methods
Towards a Holistic Integration of Spreadsheets with Databases: A Scalable Storage Engine for Presentational Data Management
Spreadsheet software is the tool of choice for interactive ad-hoc data
management, with adoption by billions of users. However, spreadsheets are not
scalable, unlike database systems. On the other hand, database systems, while
highly scalable, do not support interactivity as a first-class primitive. We
are developing DataSpread, to holistically integrate spreadsheets as a
front-end interface with databases as a back-end datastore, providing
scalability to spreadsheets, and interactivity to databases, an integration we
term presentational data management (PDM). In this paper, we make a first step
towards this vision: developing a storage engine for PDM, studying how to
flexibly represent spreadsheet data within a database and how to support and
maintain access by position. We first conduct an extensive survey of
spreadsheet use to motivate our functional requirements for a storage engine
for PDM. We develop a natural set of mechanisms for flexibly representing
spreadsheet data and demonstrate that identifying the optimal representation is
NP-Hard; however, we develop an efficient approach to identify the optimal
representation from an important and intuitive subclass of representations. We
extend our mechanisms with positional access mechanisms that don't suffer from
cascading update issues, leading to constant time access and modification
performance. We evaluate these representations on a workload of typical
spreadsheets and spreadsheet operations, providing up to 20% reduction in
storage, and up to 50% reduction in formula evaluation time
Sorting improves word-aligned bitmap indexes
Bitmap indexes must be compressed to reduce input/output costs and minimize
CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use
techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid
(WAH) compression. These techniques are sensitive to the order of the rows: a
simple lexicographical sort can divide the index size by 9 and make indexes
several times faster. We investigate row-reordering heuristics. Simply
permuting the columns of the table can increase the sorting efficiency by 40%.
Secondary contributions include efficient algorithms to construct and aggregate
bitmaps. The effect of word length is also reviewed by constructing 16-bit,
32-bit and 64-bit indexes. Using 64-bit CPUs, we find that 64-bit indexes are
slightly faster than 32-bit indexes despite being nearly twice as large
Recommended from our members
The scheduling of sparse matrix-vector multiplication on a massively parallel dap computer
An efficient data structure is presented which supports general unstructured sparse matrix-vector multiplications on a Distributed Array of Processors (DAP). This approach seeks to reduce the inter-processor data movements and organises the operations in batches of massively parallel steps by a heuristic scheduling procedure performed on the host computer.
The resulting data structure is of particular relevance to iterative schemes for solving linear systems. Performance results for matrices taken from well known Linear Programming (LP) test problems are presented and analysed
- …