31,040 research outputs found
Attribute Value Reordering For Efficient Hybrid OLAP
The normalization of a data cube is the ordering of the attribute values. For
large multidimensional arrays where dense and sparse chunks are stored
differently, proper normalization can lead to improved storage efficiency. We
show that it is NP-hard to compute an optimal normalization even for 1x3
chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are
nearly statistically independent, we show that dimension-wise attribute
frequency sorting is an optimal normalization and takes time O(d n log(n)) for
data cubes of size n^d. When dimensions are not independent, we propose and
evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is
already 19%-30% more efficient than ROLAP, but normalization can improve it
further by 9%-13% for a total gain of 29%-44% over ROLAP
Layered Label Propagation: A MultiResolution Coordinate-Free Ordering for Compressing Social Networks
We continue the line of research on graph compression started with WebGraph,
but we move our focus to the compression of social networks in a proper sense
(e.g., LiveJournal): the approaches that have been used for a long time to
compress web graphs rely on a specific ordering of the nodes (lexicographical
URL ordering) whose extension to general social networks is not trivial. In
this paper, we propose a solution that mixes clusterings and orders, and devise
a new algorithm, called Layered Label Propagation, that builds on previous work
on scalable clustering and can be used to reorder very large graphs (billions
of nodes). Our implementation uses overdecomposition to perform aggressively on
multi-core architecture, making it possible to reorder graphs of more than 600
millions nodes in a few hours. Experiments performed on a wide array of web
graphs and social networks show that combining the order produced by the
proposed algorithm with the WebGraph compression framework provides a major
increase in compression with respect to all currently known techniques, both on
web graphs and on social networks. These improvements make it possible to
analyse in main memory significantly larger graphs
Compressing DNA sequence databases with coil
Background: Publicly available DNA sequence databases such as GenBank are large, and are
growing at an exponential rate. The sheer volume of data being dealt with presents serious storage
and data communications problems. Currently, sequence data is usually kept in large "flat files,"
which are then compressed using standard Lempel-Ziv (gzip) compression â an approach which
rarely achieves good compression ratios. While much research has been done on compressing
individual DNA sequences, surprisingly little has focused on the compression of entire databases
of such sequences. In this study we introduce the sequence database compression software coil.
Results: We have designed and implemented a portable software package, coil, for compressing
and decompressing DNA sequence databases based on the idea of edit-tree coding. coil is geared
towards achieving high compression ratios at the expense of execution time and memory usage
during compression â the compression time represents a "one-off investment" whose cost is
quickly amortised if the resulting compressed file is transmitted many times. Decompression
requires little memory and is extremely fast. We demonstrate a 5% improvement in compression
ratio over state-of-the-art general-purpose compression tools for a large GenBank database file
containing Expressed Sequence Tag (EST) data. Finally, coil can efficiently encode incremental
additions to a sequence database.
Conclusion: coil presents a compelling alternative to conventional compression of flat files for the
storage and distribution of DNA sequence databases having a narrow distribution of sequence
lengths, such as EST data. Increasing compression levels for databases having a wide distribution of
sequence lengths is a direction for future work
Index ordering by query-independent measures
Conventional approaches to information retrieval search through all applicable entries in an inverted file for a particular collection in order to find those documents with the highest scores. For particularly large collections this may be extremely time consuming.
A solution to this problem is to only search a limited amount of the collection at query-time, in order to speed up the retrieval process. In doing this we can also limit the loss in retrieval efficacy (in terms of accuracy of results). The way we achieve this is to firstly identify the most âimportantâ documents within the collection, and sort documents within inverted file lists in order of this âimportanceâ. In this way we limit the amount of information to be searched at query time by eliminating documents of lesser importance, which not only makes the search more efficient, but also limits loss in retrieval accuracy. Our experiments, carried out on the TREC Terabyte collection, report significant savings, in terms of number of postings examined, without significant loss of effectiveness when based on several measures of importance used in isolation, and in combination. Our results point to several ways in which the computation cost of searching large collections of documents can be significantly reduced
Contextual Information Retrieval based on Algorithmic Information Theory and Statistical Outlier Detection
The main contribution of this paper is to design an Information Retrieval
(IR) technique based on Algorithmic Information Theory (using the Normalized
Compression Distance- NCD), statistical techniques (outliers), and novel
organization of data base structure. The paper shows how they can be integrated
to retrieve information from generic databases using long (text-based) queries.
Two important problems are analyzed in the paper. On the one hand, how to
detect "false positives" when the distance among the documents is very low and
there is actual similarity. On the other hand, we propose a way to structure a
document database which similarities distance estimation depends on the length
of the selected text. Finally, the experimental evaluations that have been
carried out to study previous problems are shown.Comment: Submitted to 2008 IEEE Information Theory Workshop (6 pages, 6
figures
Tight and simple Web graph compression
Analysing Web graphs has applications in determining page ranks, fighting Web
spam, detecting communities and mirror sites, and more. This study is however
hampered by the necessity of storing a major part of huge graphs in the
external memory, which prevents efficient random access to edge (hyperlink)
lists. A number of algorithms involving compression techniques have thus been
presented, to represent Web graphs succinctly but also providing random access.
Those techniques are usually based on differential encodings of the adjacency
lists, finding repeating nodes or node regions in the successive lists, more
general grammar-based transformations or 2-dimensional representations of the
binary matrix of the graph. In this paper we present two Web graph compression
algorithms. The first can be seen as engineering of the Boldi and Vigna (2004)
method. We extend the notion of similarity between link lists, and use a more
compact encoding of residuals. The algorithm works on blocks of varying size
(in the number of input lines) and sacrifices access time for better
compression ratio, achieving more succinct graph representation than other
algorithms reported in the literature. The second algorithm works on blocks of
the same size, in the number of input lines, and its key mechanism is merging
the block into a single ordered list. This method achieves much more attractive
space-time tradeoffs.Comment: 15 page
- âŠ