1,043 research outputs found
Reasoning & Querying – State of the Art
Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF
A Formal Framework for Linguistic Annotation
`Linguistic annotation' covers any descriptive or analytic notations applied
to raw language data. The basic data may be in the form of time functions --
audio, video and/or physiological recordings -- or it may be textual. The added
notations may include transcriptions of all sorts (from phonetic features to
discourse structures), part-of-speech and sense tagging, syntactic analysis,
`named entity' identification, co-reference annotation, and so on. While there
are several ongoing efforts to provide formats and tools for such annotations
and to publish annotated linguistic databases, the lack of widely accepted
standards is becoming a critical problem. Proposed standards, to the extent
they exist, have focussed on file formats. This paper focuses instead on the
logical structure of linguistic annotations. We survey a wide variety of
existing annotation formats and demonstrate a common conceptual core, the
annotation graph. This provides a formal framework for constructing,
maintaining and searching linguistic annotations, while remaining consistent
with many alternative data structures and file formats.Comment: 49 page
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
Cyber security is one of the most significant technical challenges in current
times. Detecting adversarial activities, prevention of theft of intellectual
properties and customer data is a high priority for corporations and government
agencies around the world. Cyber defenders need to analyze massive-scale,
high-resolution network flows to identify, categorize, and mitigate attacks
involving networks spanning institutional and national boundaries. Many of the
cyber attacks can be described as subgraph patterns, with prominent examples
being insider infiltrations (path queries), denial of service (parallel paths)
and malicious spreads (tree queries). This motivates us to explore subgraph
matching on streaming graphs in a continuous setting. The novelty of our work
lies in using the subgraph distributional statistics collected from the
streaming graph to determine the query processing strategy. We introduce a
"Lazy Search" algorithm where the search strategy is decided on a
vertex-to-vertex basis depending on the likelihood of a match in the vertex
neighborhood. We also propose a metric named "Relative Selectivity" that is
used to select between different query processing strategies. Our experiments
performed on real online news, network traffic stream and a synthetic social
network benchmark demonstrate 10-100x speedups over selectivity agnostic
approaches.Comment: in 18th International Conference on Extending Database Technology
(EDBT) (2015
A Distributed Path Query Engine for Temporal Property Graphs
Property graphs are a common form of linked data, with path queries used to
traverse and explore them for enterprise transactions and mining. Temporal
property graphs are a recent variant where time is a first-class entity to be
queried over, and their properties and structure vary over time. These are seen
in social, telecom, transit and epidemic networks. However, current graph
databases and query engines have limited support for temporal relations among
graph entities, no support for time-varying entities and/or do not scale on
distributed resources. We address this gap by extending a linear path query
model over property graphs to include intuitive temporal predicates and
aggregation operators over temporal graphs. We design a distributed execution
model for these temporal path queries using the interval-centric computing
model, and develop a novel cost model to select an efficient execution plan
from several. We perform detailed experiments of our Granite distributed query
engine using both static and dynamic temporal property graphs as large as 52M
vertices, 218M edges and 325M properties, and a 1600-query workload, derived
from the LDBC benchmark. We often offer sub-second query latencies on a
commodity cluster, which is 149x-1140x faster compared to industry-leading
Neo4J shared-memory graph database and the JanusGraph / Spark distributed graph
query engine. Granite also completes 100% of the queries for all graphs,
compared to only 32-92% workload completion by the baseline systems. Further,
our cost model selects a query plan that is within 10% of the optimal execution
time in 90% of the cases. Despite the irregular nature of graph processing, we
exhibit a weak-scaling efficiency >= 60% on 8 nodes and >= 40% on 16 nodes, for
most query workloads.Comment: An extended version of the paper that appears in IEEE/ACM
International Symposium on Cluster, Cloud and Internet Computing (CCGrid),
202
AsterixDB: A Scalable, Open Source BDMS
AsterixDB is a new, full-function BDMS (Big Data Management System) with a
feature set that distinguishes it from other platforms in today's open source
Big Data ecosystem. Its features make it well-suited to applications like web
data warehousing, social data storage and analysis, and other use cases related
to Big Data. AsterixDB has a flexible NoSQL style data model; a query language
that supports a wide range of queries; a scalable runtime; partitioned,
LSM-based data storage and indexing (including B+-tree, R-tree, and text
indexes); support for external as well as natively stored data; a rich set of
built-in types; support for fuzzy, spatial, and temporal types and queries; a
built-in notion of data feeds for ingestion of data; and transaction support
akin to that of a NoSQL store.
Development of AsterixDB began in 2009 and led to a mid-2013 initial open
source release. This paper is the first complete description of the resulting
open source AsterixDB system. Covered herein are the system's data model, its
query language, and its software architecture. Also included are a summary of
the current status of the project and a first glimpse into how AsterixDB
performs when compared to alternative technologies, including a parallel
relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data
analytics platform, for things that both technologies can do. Also included is
a brief description of some initial trials that the system has undergone and
the lessons learned (and plans laid) based on those early "customer"
engagements
A Framework to Support Spatial, Temporal and Thematic Analytics over Semantic Web Data
Spatial and temporal data are critical components in many applications. This is especially true in analytical applications ranging from scientific discovery to national security and criminal investigation. The analytical process often requires uncovering and analyzing complex thematic relationships between disparate people, places and events. Fundamentally new query operators based on the graph structure of Semantic Web data models, such as semantic associations, are proving useful for this purpose. However, these analysis mechanisms are primarily intended for thematic relationships. In this paper, we describe a framework built around the RDF data model for analysis of thematic, spatial and temporal relationships between named entities. We present a spatiotemporal modeling approach that uses an upper-level ontology in combination with temporal RDF graphs. A set of query operators that use graph patterns to specify a form of context are formally defined. We also describe an efficient implementation of the framework in Oracle DBMS and demonstrate the scalability of our approach with a performance study using both synthetic and real-world RDF datasets of over 25 million triple
A three-year study on the freshness of Web search engine databases
This paper deals with one aspect of the index quality of search engines: index freshness. The purpose is to analyse the update strategies of the major Web search engines Google, Yahoo, and MSN/Live.com. We conducted a test of the
updates of 40 daily updated pages and 30 irregularly updated pages, respectively. We used data from a time span of six weeks in the years 2005, 2006, and 2007. We found that the best search engine in terms of up-to-dateness changes over the years and that none of the engines has an ideal solution for index freshness. Frequency distributions for the pages’ ages are skewed, which means that search engines do differentiate between often- and seldom-updated pages. This is confirmed by the difference between the average ages of daily updated pages and our control group of pages. Indexing patterns are often irregular, and there seems to be no clear policy regarding when to revisit Web pages. A major problem identified in our research is the delay in making crawled pages available for searching, which differs from one engine to another
- …