13,925 research outputs found
Peer-based query rewriting in SPARQL for semantic integration of linked data
In this proposal we address the problem of ontology-based SPARQL query answering over distributed Linked Data sources, where the ontology is given by conjunctive mappings between the source schemas in a peer-to-peer fashion and by equality constraints between constants. In our setting, the data is not materialised in a single datastore: it is accessed in a distributed environment through SPARQL endpoints. We aim to achieve query answering by generating the perfect rewriting of the original query and then processing the rewritten query over distributed SPARQL endpoints. We identify a subset of ontology constraints that enjoy the first-order rewritability property and we perform preliminary empirical evaluation taking into account such restricted constraints only. For future work, we aim to tackle the query answering problem in the general case
Data pre-processing:Case of sensor data consistency based on Bi-temporal concepts
The volume, velocity, variety, veracity and value of data currently produced and consumed by different types of information systems turned big Data into a phenomena of study. For data variety, temporal data commonly represents a source of potential inconsistency. This paper reports on a research endeavor for treating the problem of how to minimize inconsistencies in temporal databases due to unavailability of big data. This problem often occurs in situations where a same query is executed on the same data set at different points in time. To address this issue, we propose query optimization strategies based on query transformation and rewriting rules, to amend data consistency in temporal databases. We validate these strategies proposed via case scenario in sensor data analysis, and via manual data input, both for local and distributed query environments
Semantic Query Reformulation in Social PDMS
We consider social peer-to-peer data management systems (PDMS), where each
peer maintains both semantic mappings between its schema and some
acquaintances, and social links with peer friends. In this context,
reformulating a query from a peer's schema into other peer's schemas is a hard
problem, as it may generate as many rewritings as the set of mappings from that
peer to the outside and transitively on, by eventually traversing the entire
network. However, not all the obtained rewritings are relevant to a given
query. In this paper, we address this problem by inspecting semantic mappings
and social links to find only relevant rewritings. We propose a new notion of
'relevance' of a query with respect to a mapping, and, based on this notion, a
new semantic query reformulation approach for social PDMS, which achieves great
accuracy and flexibility. To find rapidly the most interesting mappings, we
combine several techniques: (i) social links are expressed as FOAF (Friend of a
Friend) links to characterize peer's friendship and compact mapping summaries
are used to obtain mapping descriptions; (ii) local semantic views are special
views that contain information about external mappings; and (iii) gossiping
techniques improve the search of relevant mappings. Our experimental
evaluation, based on a prototype on top of PeerSim and a simulated network
demonstrate that our solution yields greater recall, compared to traditional
query translation approaches proposed in the literature.Comment: 29 pages, 8 figures, query rewriting in PDM
Scalable Semantic Matching of Queries to Ads in Sponsored Search Advertising
Sponsored search represents a major source of revenue for web search engines.
This popular advertising model brings a unique possibility for advertisers to
target users' immediate intent communicated through a search query, usually by
displaying their ads alongside organic search results for queries deemed
relevant to their products or services. However, due to a large number of
unique queries it is challenging for advertisers to identify all such relevant
queries. For this reason search engines often provide a service of advanced
matching, which automatically finds additional relevant queries for advertisers
to bid on. We present a novel advanced matching approach based on the idea of
semantic embeddings of queries and ads. The embeddings were learned using a
large data set of user search sessions, consisting of search queries, clicked
ads and search links, while utilizing contextual information such as dwell time
and skipped ads. To address the large-scale nature of our problem, both in
terms of data and vocabulary size, we propose a novel distributed algorithm for
training of the embeddings. Finally, we present an approach for overcoming a
cold-start problem associated with new ads and queries. We report results of
editorial evaluation and online tests on actual search traffic. The results
show that our approach significantly outperforms baselines in terms of
relevance, coverage, and incremental revenue. Lastly, we open-source learned
query embeddings to be used by researchers in computational advertising and
related fields.Comment: 10 pages, 4 figures, 39th International ACM SIGIR Conference on
Research and Development in Information Retrieval, SIGIR 2016, Pisa, Ital
Distributed Reasoning in a Peer-to-Peer Setting: Application to the Semantic Web
In a peer-to-peer inference system, each peer can reason locally but can also
solicit some of its acquaintances, which are peers sharing part of its
vocabulary. In this paper, we consider peer-to-peer inference systems in which
the local theory of each peer is a set of propositional clauses defined upon a
local vocabulary. An important characteristic of peer-to-peer inference systems
is that the global theory (the union of all peer theories) is not known (as
opposed to partition-based reasoning systems). The main contribution of this
paper is to provide the first consequence finding algorithm in a peer-to-peer
setting: DeCA. It is anytime and computes consequences gradually from the
solicited peer to peers that are more and more distant. We exhibit a sufficient
condition on the acquaintance graph of the peer-to-peer inference system for
guaranteeing the completeness of this algorithm. Another important contribution
is to apply this general distributed reasoning setting to the setting of the
Semantic Web through the Somewhere semantic peer-to-peer data management
system. The last contribution of this paper is to provide an experimental
analysis of the scalability of the peer-to-peer infrastructure that we propose,
on large networks of 1000 peers
LiteMat: a scalable, cost-efficient inference encoding scheme for large RDF graphs
The number of linked data sources and the size of the linked open data graph
keep growing every day. As a consequence, semantic RDF services are more and
more confronted with various "big data" problems. Query processing in the
presence of inferences is one them. For instance, to complete the answer set of
SPARQL queries, RDF database systems evaluate semantic RDFS relationships
(subPropertyOf, subClassOf) through time-consuming query rewriting algorithms
or space-consuming data materialization solutions. To reduce the memory
footprint and ease the exchange of large datasets, these systems generally
apply a dictionary approach for compressing triple data sizes by replacing
resource identifiers (IRIs), blank nodes and literals with integer values. In
this article, we present a structured resource identification scheme using a
clever encoding of concepts and property hierarchies for efficiently evaluating
the main common RDFS entailment rules while minimizing triple materialization
and query rewriting. We will show how this encoding can be computed by a
scalable parallel algorithm and directly be implemented over the Apache Spark
framework. The efficiency of our encoding scheme is emphasized by an evaluation
conducted over both synthetic and real world datasets.Comment: 8 pages, 1 figur
Efficient & Effective Selective Query Rewriting with Efficiency Predictions
To enhance effectiveness, a user's query can be rewritten internally by the search engine in many ways, for example by applying proximity, or by expanding the query with related terms. However, approaches that benefit effectiveness often have a negative impact on efficiency, which has impacts upon the user satisfaction, if the query is excessively slow. In this paper, we propose a novel framework for using the predicted execution time of various query rewritings to select between alternatives on a per-query basis, in a manner that ensures both effectiveness and efficiency. In particular, we propose the prediction of the execution time of ephemeral (e.g., proximity) posting lists generated from uni-gram inverted index posting lists, which are used in establishing the permissible query rewriting alternatives that may execute in the allowed time. Experiments examining both the effectiveness and efficiency of the proposed approach demonstrate that a 49% decrease in mean response time (and 62% decrease in 95th-percentile response time) can be attained without significantly hindering the effectiveness of the search engine
ReStore: Reusing Results of MapReduce Jobs
Analyzing large scale data has emerged as an important activity for many
organizations in the past few years. This large scale data analysis is
facilitated by the MapReduce programming and execution model and its
implementations, most notably Hadoop. Users of MapReduce often have analysis
tasks that are too complex to express as individual MapReduce jobs. Instead,
they use high-level query languages such as Pig, Hive, or Jaql to express their
complex tasks. The compilers of these languages translate queries into
workflows of MapReduce jobs. Each job in these workflows reads its input from
the distributed file system used by the MapReduce system and produces output
that is stored in this distributed file system and read as input by the next
job in the workflow. The current practice is to delete these intermediate
results from the distributed file system at the end of executing the workflow.
One way to improve the performance of workflows of MapReduce jobs is to keep
these intermediate results and reuse them for future workflows submitted to the
system. In this paper, we present ReStore, a system that manages the storage
and reuse of such intermediate results. ReStore can reuse the output of whole
MapReduce jobs that are part of a workflow, and it can also create additional
reuse opportunities by materializing and storing the output of query execution
operators that are executed within a MapReduce job. We have implemented ReStore
as an extension to the Pig dataflow system on top of Hadoop, and we
experimentally demonstrate significant speedups on queries from the PigMix
benchmark.Comment: VLDB201
Extending a multi-set relational algebra to a parallel environment
Parallel database systems will very probably be the future for high-performance data-intensive applications. In the past decade, many parallel database systems have been developed, together with many languages and approaches to specify operations in these systems. A common background is still missing, however. This paper proposes an extended relational algebra for this purpose, based on the well-known standard relational algebra. The extended algebra provides both complete database manipulation language features, and data distribution and process allocation primitives to describe parallelism. It is defined in terms of multi-sets of tuples to allow handling of duplicates and to obtain a close connection to the world of high-performance data processing. Due to its algebraic nature, the language is well suited for optimization and parallelization through expression rewriting. The proposed language can be used as a database manipulation language on its own, as has been done in the PRISMA parallel database project, or as a formal basis for other languages, like SQL
- …