1,573 research outputs found
RDF Querying
Reactive Web systems, Web services, and Web-based publish/
subscribe systems communicate events as XML messages, and in
many cases require composite event detection: it is not sufficient to react
to single event messages, but events have to be considered in relation to
other events that are received over time.
Emphasizing language design and formal semantics, we describe the
rule-based query language XChangeEQ for detecting composite events.
XChangeEQ is designed to completely cover and integrate the four complementary
querying dimensions: event data, event composition, temporal
relationships, and event accumulation. Semantics are provided as
model and fixpoint theories; while this is an established approach for rule
languages, it has not been applied for event queries before
An XML Query Engine for Network-Bound Data
XML has become the lingua franca for data exchange and integration across administrative and enterprise boundaries. Nearly all data providers are adding XML import or export capabilities, and standard XML Schemas and DTDs are being promoted for all types of data sharing. The ubiquity of XML has removed one of the major obstacles to integrating data from widely disparate sources ā- namely, the heterogeneity of data formats.
However, general-purpose integration of data across the wide area also requires a query processor that can query data sources on demand, receive streamed XML data from them, and combine and restructure the data into new XML output -- while providing good performance for both batch-oriented and ad-hoc, interactive queries. This is the goal of the Tukwila data integration system, the first system that focuses on network-bound, dynamic XML data sources. In contrast to previous approaches, which must read, parse, and often store entire XML objects before querying them, Tukwila can return query results even as the data is streaming into the system. Tukwila is built with a new system architecture that extends adaptive query processing and relational-engine techniques into the XML realm, as facilitated by a pair of operators that incrementally evaluate a queryās input path expressions as data is read. In this paper, we describe the Tukwila architecture and its novel aspects, and we experimentally demonstrate that Tukwila provides better overall query performance and faster initial answers than existing systems, and has excellent scalability
Continuously Providing Approximate Results under Limited Resources: Load Shedding and Spilling in XML Streams
Because of the high volume and unpredictable arrival rates, stream processing systems may not always be able to keep up with the input data streams, resulting in buffer overflow and uncontrolled loss of data. To continuously supply online results, two alternate solutions to tackle this problem of unpredictable failures of such overloaded systems can be identified. One technique, called load shedding, drops some fractions of data from the input stream to reduce the memory and CPU requirements of the workload. However, dropping some portions of the input data means that the accuracy of the output is reduced since some data is lost. To produce eventually complete results, the second technique, called data spilling, pushes some fractions of data to persistent storage temporarily when the processing speed cannot keep up with the arrival rate. The processing of the disk resident data is then postponed until a later time when system resources become available. This dissertation explores these load reduction technologies in the context of XML stream systems.
Load shedding in the specific context of XML streams poses several unique opportunities and challenges. Since XML data is hierarchical, subelements, extracted from different positions of the XML tree structure, may vary in their importance. Further, dropping different subelements may vary in their savings of storage and computation. Hence, unlike prior work in the literature that drops data completely or not at all, in this dissertation we introduce the notion of structure-oriented load shedding, meaning selectively some XML subelements are shed from the possibly complex XML objects in the XML stream. First we develop a preference model that enables users to specify the relative importance of preserving different subelements within the XML result structure. This transforms shedding into the problem of rewriting the user query into shed queries that return approximate answers with their utility as measured by the user preference model. Our optimizer finds the appropriate shed queries to maximize the output utility driven by our structure-based preference model under the limitation of available computation resources. The experimental results demonstrate that our proposed XML-specific shedding solution consistently achieves higher utility results compared to the existing relational shedding techniques.
Second, we introduces structure-based spilling, a spilling technique customized for XML streams by considering the spilling of partial substructures of possibly complex XML elements. Several new challenges caused by structure-based spilling are addressed. When a path is spilled, multiple other paths may be affected. We categorize varying types of spilling side effects on the query caused by spilling. How to execute the reduced query to produce the correct runtime output is also studied. Three optimization strategies are developed to select the reduced query that maximizes the output quality. We also examine the clean-up stage to guarantee that an entire result set is eventually generated by producing supplementary results to complement the partial results output earlier. The experimental study demonstrates that our proposed solutions consistently achieve higher quality results compared to the state-of-the-art techniques.
Third, we design an integrated framework that combines both shedding and spilling policies into one comprehensive methodology. Decisions on the choice of whether to shed or spill data may be affected by the application needs and data arrival patterns. For some input data, it may be worth to flush it to disk if a delayed output of its result will be important, while other data would best directly dropped from the system given that a delayed delivery of these results would no longer be meaningful to the application. Therefore we need sophisticated technologies capable of deploying both shedding and spilling techniques within one integrated strategy with the ability to deliver the most appropriate decision customers need for each specific circumstance. We propose a novel flexible framework for structure-based shed and spill approaches, applicable in any XML stream system. We propose a solution space that represents all the shed and spill candidates. An age-based quality model is proposed for evaluating the output quality for different reduced query and supplementary query pairs. We also propose a family of four optimization strategies, OptF, OptSmart, HiX and Fex. OptF and OptSmart are both guaranteed to identify an optimal solution of reduced and supplementary query pair, with OptSmart exhibiting significantly less overhead than OptF. HiX and Fex use heuristic-based approaches that are much more efficient than OptF and OptSmart
AsterixDB: A Scalable, Open Source BDMS
AsterixDB is a new, full-function BDMS (Big Data Management System) with a
feature set that distinguishes it from other platforms in today's open source
Big Data ecosystem. Its features make it well-suited to applications like web
data warehousing, social data storage and analysis, and other use cases related
to Big Data. AsterixDB has a flexible NoSQL style data model; a query language
that supports a wide range of queries; a scalable runtime; partitioned,
LSM-based data storage and indexing (including B+-tree, R-tree, and text
indexes); support for external as well as natively stored data; a rich set of
built-in types; support for fuzzy, spatial, and temporal types and queries; a
built-in notion of data feeds for ingestion of data; and transaction support
akin to that of a NoSQL store.
Development of AsterixDB began in 2009 and led to a mid-2013 initial open
source release. This paper is the first complete description of the resulting
open source AsterixDB system. Covered herein are the system's data model, its
query language, and its software architecture. Also included are a summary of
the current status of the project and a first glimpse into how AsterixDB
performs when compared to alternative technologies, including a parallel
relational DBMS, a popular NoSQL store, and a popular Hadoop-based SQL data
analytics platform, for things that both technologies can do. Also included is
a brief description of some initial trials that the system has undergone and
the lessons learned (and plans laid) based on those early "customer"
engagements
Towards a query language for annotation graphs
The multidimensional, heterogeneous, and temporal nature of speech databases
raises interesting challenges for representation and query. Recently,
annotation graphs have been proposed as a general-purpose representational
framework for speech databases. Typical queries on annotation graphs require
path expressions similar to those used in semistructured query languages.
However, the underlying model is rather different from the customary graph
models for semistructured data: the graph is acyclic and unrooted, and both
temporal and inclusion relationships are important. We develop a query language
and describe optimization techniques for an underlying relational
representation.Comment: 8 pages, 10 figure
EBAY QUERY LINGUISTIC SERVICE
In this project we designed, tested and implemented a query service for expanding and normalizing titles and searches. With regards to finding synonyms in titles, phrase recognition algorithms were developed as well. An additional service was created to find appropriate categories for a query to assist the search of synonyms. An application was then built on top of these two services to help users expand their searches and build advanced queries with no additional knowledge
An Expressive Language and Efficient Execution System for Software Agents
Software agents can be used to automate many of the tedious, time-consuming
information processing tasks that humans currently have to complete manually.
However, to do so, agent plans must be capable of representing the myriad of
actions and control flows required to perform those tasks. In addition, since
these tasks can require integrating multiple sources of remote information ?
typically, a slow, I/O-bound process ? it is desirable to make execution as
efficient as possible. To address both of these needs, we present a flexible
software agent plan language and a highly parallel execution system that enable
the efficient execution of expressive agent plans. The plan language allows
complex tasks to be more easily expressed by providing a variety of operators
for flexibly processing the data as well as supporting subplans (for
modularity) and recursion (for indeterminate looping). The executor is based on
a streaming dataflow model of execution to maximize the amount of operator and
data parallelism possible at runtime. We have implemented both the language and
executor in a system called THESEUS. Our results from testing THESEUS show that
streaming dataflow execution can yield significant speedups over both
traditional serial (von Neumann) as well as non-streaming dataflow-style
execution that existing software and robot agent execution systems currently
support. In addition, we show how plans written in the language we present can
represent certain types of subtasks that cannot be accomplished using the
languages supported by network query engines. Finally, we demonstrate that the
increased expressivity of our plan language does not hamper performance;
specifically, we show how data can be integrated from multiple remote sources
just as efficiently using our architecture as is possible with a
state-of-the-art streaming-dataflow network query engine
Source File Set Search for Clone-and-Own Reuse Analysis
Clone-and-own approach is a natural way of source code reuse for software
developers. To assess how known bugs and security vulnerabilities of a cloned
component affect an application, developers and security analysts need to
identify an original version of the component and understand how the cloned
component is different from the original one. Although developers may record
the original version information in a version control system and/or directory
names, such information is often either unavailable or incomplete. In this
research, we propose a code search method that takes as input a set of source
files and extracts all the components including similar files from a software
ecosystem (i.e., a collection of existing versions of software packages). Our
method employs an efficient file similarity computation using b-bit minwise
hashing technique. We use an aggregated file similarity for ranking components.
To evaluate the effectiveness of this tool, we analyzed 75 cloned components in
Firefox and Android source code. The tool took about two hours to report the
original components from 10 million files in Debian GNU/Linux packages. Recall
of the top-five components in the extracted lists is 0.907, while recall of a
baseline using SHA-1 file hash is 0.773, according to the ground truth recorded
in the source code repositories.Comment: 14th International Conference on Mining Software Repositorie
- ā¦