27,736 research outputs found
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
The NASA Astrophysics Data System: Architecture
The powerful discovery capabilities available in the ADS bibliographic
services are possible thanks to the design of a flexible search and retrieval
system based on a relational database model. Bibliographic records are stored
as a corpus of structured documents containing fielded data and metadata, while
discipline-specific knowledge is segregated in a set of files independent of
the bibliographic data itself.
The creation and management of links to both internal and external resources
associated with each bibliography in the database is made possible by
representing them as a set of document properties and their attributes.
To improve global access to the ADS data holdings, a number of mirror sites
have been created by cloning the database contents and software on a variety of
hardware and software platforms.
The procedures used to create and manage the database and its mirrors have
been written as a set of scripts that can be run in either an interactive or
unsupervised fashion.
The ADS can be accessed at http://adswww.harvard.eduComment: 25 pages, 8 figures, 3 table
Use of an engineering data management system in the analysis of space shuttle orbiter tiles
The use of an engineering data management system to facilitate the extensive stress analyses of the space shuttle orbiter thermal protection system is demonstrated. The methods used to gather, organize, and store the data; to query data interactively; to generate graphic displays of the data; and to access, transform, and prepare the data for input to a stress analysis program are described. Information related to many separate tiles can be accessed individually from the data base which has a natural organization from an engineering viewpoint. The flexible user features of the system facilitate changes in data content and organization which occur during the development and refinement of the tile analysis procedure. Additionally, the query language supports retrieval of data to satisfy a variety of user-specified conditions
How Land Use Shapes the Evolution of Road Networks
The present research develops an agent-based model to treat the organization, growth, and contraction of network elements. The components model travel demand, revenue, cost, and investment. Revenue earned by links in excess of maintenance costs is invested on the link to until all revenue is consumed. After upgrading (or downgrading) each link in the network, the time period is incremented and the whole process is repeated until an equilibrium is reached or it is clear that it cannot be achieved. The model is tested with three alternative land use patterns: uniform, random, and bell-shaped, to test the effects of land use on resulting network patterns. It is found that similar, but not identical, equilibrium hierarchical networks result in all cases, with the bell-shaped network, with a CBD, having higher level roads concentrated in a belt around the CBD, while the other networks are less concentratedSelf-organization, network growth, network evolution, transportation planning, land use planning
Performance comparison of point and spatial access methods
In the past few years a large number of multidimensional point access methods, also called
multiattribute index structures, has been suggested, all of them claiming good performance. Since no
performance comparison of these structures under arbitrary (strongly correlated nonuniform, short
"ugly") data distributions and under various types of queries has been performed, database
researchers and designers were hesitant to use any of these new point access methods. As shown in
a recent paper, such point access methods are not only important in traditional database applications.
In new applications such as CAD/CIM and geographic or environmental information systems, access
methods for spatial objects are needed. As recently shown such access methods are based on point
access methods in terms of functionality and performance. Our performance comparison naturally
consists of two parts. In part I we w i l l compare multidimensional point access methods, whereas in
part I I spatial access methods for rectangles will be compared. In part I we present a survey and
classification of existing point access methods. Then we carefully select the following four methods
for implementation and performance comparison under seven different data files (distributions) and
various types of queries: the 2-level grid file, the BANG file, the hB-tree and a new scheme, called
the BUDDY hash tree. We were surprised to see one method to be the clear winner which was the
BUDDY hash tree. It exhibits an at least 20 % better average performance than its competitors and is
robust under ugly data and queries. In part I I we compare spatial access methods for rectangles.
After presenting a survey and classification of existing spatial access methods we carefully selected
the following four methods for implementation and performance comparison under six different data
files (distributions) and various types of queries: the R-tree, the BANG file, PLOP hashing and the
BUDDY hash tree. The result presented two winners: the BANG file and the BUDDY hash tree.
This comparison is a first step towards a standardized testbed or benchmark. We offer our data and
query files to each designer of a new point or spatial access method such that he can run his
implementation in our testbed
Buckets inverted lists for a search engine with BSP
Most information in science, engineering and business has been recorded in form of text. This information can be found online in the World-Wide-Web. One of the major tools to support information access are the search engines which usually use information retrieval techniques to rank Web pages based on a simple query and an index structure like the inverted lists. The retrieval models are the basis for the algorithms that score and rank the Web pages. The focus of this presentation is to show some inverted lists alternatives, based on buckets, for an information retrieval system. The main interest is how query performance is effected by the index organization on a cluster of PCs. The server design is effected on top of the parallel computing model Bulk Synchronous Parallel-BSP.Facultad de InformĂĄtic
Near-Memory Address Translation
Memory and logic integration on the same chip is becoming increasingly cost
effective, creating the opportunity to offload data-intensive functionality to
processing units placed inside memory chips. The introduction of memory-side
processing units (MPUs) into conventional systems faces virtual memory as the
first big showstopper: without efficient hardware support for address
translation MPUs have highly limited applicability. Unfortunately, conventional
translation mechanisms fall short of providing fast translations as
contemporary memories exceed the reach of TLBs, making expensive page walks
common.
In this paper, we are the first to show that the historically important
flexibility to map any virtual page to any page frame is unnecessary in today's
servers. We find that while limiting the associativity of the
virtual-to-physical mapping incurs no penalty, it can break the
translate-then-fetch serialization if combined with careful data placement in
the MPU's memory, allowing for translation and data fetch to proceed
independently and in parallel. We propose the Distributed Inverted Page Table
(DIPTA), a near-memory structure in which the smallest memory partition keeps
the translation information for its data share, ensuring that the translation
completes together with the data fetch. DIPTA completely eliminates the
performance overhead of translation, achieving speedups of up to 3.81x and
2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure
- âŠ