87,163 research outputs found
On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research
Scientific research requires access, analysis, and sharing of data that is
distributed across various heterogeneous data sources at the scale of the
Internet. An eager ETL process constructs an integrated data repository as its
first step, integrating and loading data in its entirety from the data sources.
The bootstrapping of this process is not efficient for scientific research that
requires access to data from very large and typically numerous distributed data
sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy
ETL is faster in bootstrapping. However, queries on the integrated data
repository of eager ETL perform faster, due to the availability of the entire
data beforehand.
In this paper, we propose a novel ETL approach for scientific data
integration, as a hybrid of eager and lazy ETL approaches, and applied both to
data as well as metadata. This way, Hybrid ETL supports incremental integration
and loading of metadata and data from the data sources. We incorporate a
human-in-the-loop approach, to enhance the hybrid ETL, with selective data
integration driven by the user queries and sharing of integrated data between
users. We implement our hybrid ETL approach in a prototype platform, Obidos,
and evaluate it in the context of data sharing for medical research. Obidos
outperforms both the eager ETL and lazy ETL approaches, for scientific research
data integration and sharing, through its selective loading of data and
metadata, while storing the integrated data in a scalable integrated data
repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD
Journa
A Density-Based Approach to the Retrieval of Top-K Spatial Textual Clusters
Keyword-based web queries with local intent retrieve web content that is
relevant to supplied keywords and that represent points of interest that are
near the query location. Two broad categories of such queries exist. The first
encompasses queries that retrieve single spatial web objects that each satisfy
the query arguments. Most proposals belong to this category. The second
category, to which this paper's proposal belongs, encompasses queries that
support exploratory user behavior and retrieve sets of objects that represent
regions of space that may be of interest to the user. Specifically, the paper
proposes a new type of query, namely the top-k spatial textual clusters (k-STC)
query that returns the top-k clusters that (i) are located the closest to a
given query location, (ii) contain the most relevant objects with regard to
given query keywords, and (iii) have an object density that exceeds a given
threshold. To compute this query, we propose a basic algorithm that relies on
on-line density-based clustering and exploits an early stop condition. To
improve the response time, we design an advanced approach that includes three
techniques: (i) an object skipping rule, (ii) spatially gridded posting lists,
and (iii) a fast range query algorithm. An empirical study on real data
demonstrates that the paper's proposals offer scalability and are capable of
excellent performance
The Case for Learned Index Structures
Indexes are models: a B-Tree-Index can be seen as a model to map a key to the
position of a record within a sorted array, a Hash-Index as a model to map a
key to a position of a record within an unsorted array, and a BitMap-Index as a
model to indicate if a data record exists or not. In this exploratory research
paper, we start from this premise and posit that all existing index structures
can be replaced with other types of models, including deep-learning models,
which we term learned indexes. The key idea is that a model can learn the sort
order or structure of lookup keys and use this signal to effectively predict
the position or existence of records. We theoretically analyze under which
conditions learned indexes outperform traditional index structures and describe
the main challenges in designing learned index structures. Our initial results
show, that by using neural nets we are able to outperform cache-optimized
B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over
several real-world data sets. More importantly though, we believe that the idea
of replacing core components of a data management system through learned models
has far reaching implications for future systems designs and that this work
just provides a glimpse of what might be possible
Efficient Spatial Keyword Search in Trajectory Databases
An increasing amount of trajectory data is being annotated with text
descriptions to better capture the semantics associated with locations. The
fusion of spatial locations and text descriptions in trajectories engenders a
new type of top- queries that take into account both aspects. Each
trajectory in consideration consists of a sequence of geo-spatial locations
associated with text descriptions. Given a user location and a
keyword set , a top- query returns trajectories whose text
descriptions cover the keywords and that have the shortest match
distance. To the best of our knowledge, previous research on querying
trajectory databases has focused on trajectory data without any text
description, and no existing work has studied such kind of top- queries on
trajectories. This paper proposes one novel method for efficiently computing
top- trajectories. The method is developed based on a new hybrid index,
cell-keyword conscious B-tree, denoted by \cellbtree, which enables us to
exploit both text relevance and location proximity to facilitate efficient and
effective query processing. The results of our extensive empirical studies with
an implementation of the proposed algorithms on BerkeleyDB demonstrate that our
proposed methods are capable of achieving excellent performance and good
scalability.Comment: 12 page
Universal Indexes for Highly Repetitive Document Collections
Indexing highly repetitive collections has become a relevant problem with the
emergence of large repositories of versioned documents, among other
applications. These collections may reach huge sizes, but are formed mostly of
documents that are near-copies of others. Traditional techniques for indexing
these collections fail to properly exploit their regularities in order to
reduce space.
We introduce new techniques for compressing inverted indexes that exploit
this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar
compression of the differential inverted lists, instead of the usual practice
of gap-encoding them. We show that, in this highly repetitive setting, our
compression methods significantly reduce the space obtained with classical
techniques, at the price of moderate slowdowns. Moreover, our best methods are
universal, that is, they do not need to know the versioning structure of the
collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are
designed for general strings (not only natural language texts) and represent
the text collection plus the index structure (not an inverted index) in
integrated form. We show that these techniques can compress much further, using
a small fraction of the space required by our new inverted indexes. Yet, they
are orders of magnitude slower.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Location-based indexing for mobile context-aware access to a digital library
Mobile information systems need to collaborate with each other to provide seamless information access to the user. Information about the user and their context provides the points of contact between the systems. Location is the most basic user context.
TIP is a mobile tourist information system that provides location-based access to documents in the digital library Greenstone. This paper identifies the challenges for providing effcient access to location-based information using the various access modes a tourist requires on their travels. We discuss our extended 2DR-tree approach to meet these challenges
- …