2,913 research outputs found
Social Search with Missing Data: Which Ranking Algorithm?
Online social networking tools are extremely popular, but can miss potential discoveries latent in the social 'fabric'. Matchmaking services which can do naive profile matching with old database technology are too brittle in the absence of key data, and even modern ontological markup, though powerful, can be onerous at data-input time. In this paper, we present a system called BuddyFinder which can automatically identify buddies who can best match a user's search requirements specified in a term-based query, even in the absence of stored user-profiles. We deploy and compare five statistical measures, namely, our own CORDER, mutual information (MI), phi-squared, improved MI and Z score, and two TF/IDF based baseline methods to find online users who best match the search requirements based on 'inferred profiles' of these users in the form of scavenged web pages. These measures identify statistically significant relationships between online users and a term-based query. Our user evaluation on two groups of users shows that BuddyFinder can find users highly relevant to search queries, and that CORDER achieved the best average ranking correlations among all seven algorithms and improved the performance of both baseline methods
On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research
Scientific research requires access, analysis, and sharing of data that is
distributed across various heterogeneous data sources at the scale of the
Internet. An eager ETL process constructs an integrated data repository as its
first step, integrating and loading data in its entirety from the data sources.
The bootstrapping of this process is not efficient for scientific research that
requires access to data from very large and typically numerous distributed data
sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy
ETL is faster in bootstrapping. However, queries on the integrated data
repository of eager ETL perform faster, due to the availability of the entire
data beforehand.
In this paper, we propose a novel ETL approach for scientific data
integration, as a hybrid of eager and lazy ETL approaches, and applied both to
data as well as metadata. This way, Hybrid ETL supports incremental integration
and loading of metadata and data from the data sources. We incorporate a
human-in-the-loop approach, to enhance the hybrid ETL, with selective data
integration driven by the user queries and sharing of integrated data between
users. We implement our hybrid ETL approach in a prototype platform, Obidos,
and evaluate it in the context of data sharing for medical research. Obidos
outperforms both the eager ETL and lazy ETL approaches, for scientific research
data integration and sharing, through its selective loading of data and
metadata, while storing the integrated data in a scalable integrated data
repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD
Journa
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
Xu: An Automated Query Expansion and Optimization Tool
The exponential growth of information on the Internet is a big challenge for
information retrieval systems towards generating relevant results. Novel
approaches are required to reformat or expand user queries to generate a
satisfactory response and increase recall and precision. Query expansion (QE)
is a technique to broaden users' queries by introducing additional tokens or
phrases based on some semantic similarity metrics. The tradeoff is the added
computational complexity to find semantically similar words and a possible
increase in noise in information retrieval. Despite several research efforts on
this topic, QE has not yet been explored enough and more work is needed on
similarity matching and composition of query terms with an objective to
retrieve a small set of most appropriate responses. QE should be scalable,
fast, and robust in handling complex queries with a good response time and
noise ceiling. In this paper, we propose Xu, an automated QE technique, using
high dimensional clustering of word vectors and Datamuse API, an open source
query engine to find semantically similar words. We implemented Xu as a command
line tool and evaluated its performances using datasets containing news
articles and human-generated QEs. The evaluation results show that Xu was
better than Datamuse by achieving about 88% accuracy with reference to the
human-generated QE.Comment: Accepted to IEEE COMPSAC 201
On the implementation of E.R.I.K - Effective Retrieval of Information by Keyword: an information storage and retrieval research system
The accelerating production of information, often referred to as the information explosion, requires specialized methods for the storage and access of this wealth of knowledge. One solution to the problems of indexing, filing and retrieval of information has been Information Storage and Retrieval (ISAR) systems. The purpose of this thesis is to implement a functional experimental system (known as ERIK – Effective Retrieval of Information by Keyword) for research into various aspects of information storage and retrieval. An ISAR system has been created, incorporating numerous features found in both commercially available and recent research retrieval systems, as well as a few novel features. The design and implementation (originally done on a personal computer, now ported to an UNIX* environment) of the system is discussed as well as some of the background of the problems faced in the information storage and retrieval systems domain. An exciting implementation of an information retrieval system utilizing a newly developed massively parallel computer is also discussed, and its\u27 approach to the problem of information retrieval is contrasted with existing solutions. * UNIX is a trademark of A.T.&T. Bell Laboratorie
EAGLE—A Scalable Query Processing Engine for Linked Sensor Data
Recently, many approaches have been proposed to manage sensor data using semantic web technologies for effective heterogeneous data integration. However, our empirical observations revealed that these solutions primarily focused on semantic relationships and unfortunately paid less attention to spatio–temporal correlations. Most semantic approaches do not have spatio–temporal support. Some of them have attempted to provide full spatio–temporal support, but have poor performance for complex spatio–temporal aggregate queries. In addition, while the volume of sensor data is rapidly growing, the challenge of querying and managing the massive volumes of data generated by sensing devices still remains unsolved. In this article, we introduce EAGLE, a spatio–temporal query engine for querying sensor data based on the linked data model. The ultimate goal of EAGLE is to provide an elastic and scalable system which allows fast searching and analysis with respect to the relationships of space, time and semantics in sensor data. We also extend SPARQL with a set of new query operators in order to support spatio–temporal computing in the linked sensor data context.EC/H2020/732679/EU/ACTivating InnoVative IoT smart living environments for AGEing well/ACTIVAGEEC/H2020/661180/EU/A Scalable and Elastic Platform for Near-Realtime Analytics for The Graph of Everything/SMARTE
- …