177 research outputs found
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
Reasoning & Querying – State of the Art
Various query languages for Web and Semantic Web data, both for practical use and as an area of research in the scientific community, have emerged in recent years. At the same time, the broad adoption of the internet where keyword search is used in many applications, e.g. search engines, has familiarized casual users with using keyword queries to retrieve information on the internet. Unlike this easy-to-use querying, traditional query languages require knowledge of the language itself as well as of the data to be queried. Keyword-based query languages for XML and RDF bridge the gap between the two, aiming at enabling simple querying of semi-structured data, which is relevant e.g. in the context of the emerging Semantic Web. This article presents an overview of the field of keyword querying for XML and RDF
Four Lessons in Versatility or How Query Languages Adapt to the Web
Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”
Mining Rooted Ordered Trees under Subtree Homeomorphism
Mining frequent tree patterns has many applications in different areas such
as XML data, bioinformatics and World Wide Web. The crucial step in frequent
pattern mining is frequency counting, which involves a matching operator to
find occurrences (instances) of a tree pattern in a given collection of trees.
A widely used matching operator for tree-structured data is subtree
homeomorphism, where an edge in the tree pattern is mapped onto an
ancestor-descendant relationship in the given tree. Tree patterns that are
frequent under subtree homeomorphism are usually called embedded patterns. In
this paper, we present an efficient algorithm for subtree homeomorphism with
application to frequent pattern mining. We propose a compact data-structure,
called occ, which stores only information about the rightmost paths of
occurrences and hence can encode and represent several occurrences of a tree
pattern. We then define efficient join operations on the occ data-structure,
which help us count occurrences of tree patterns according to occurrences of
their proper subtrees. Based on the proposed subtree homeomorphism method, we
develop an effective pattern mining algorithm, called TPMiner. We evaluate the
efficiency of TPMiner on several real-world and synthetic datasets. Our
extensive experiments confirm that TPMiner always outperforms well-known
existing algorithms, and in several cases the improvement with respect to
existing algorithms is significant.Comment: This paper is accepted in the Data Mining and Knowledge Discovery
journal
(http://www.springer.com/computer/database+management+%26+information+retrieval/journal/10618
Parameterized XPath Views
We present a new approach for accelerating the execution of XPath expressions using parameterized materialized XPath views (PXV). While the approach is generic we show how it can be utilized in an XML extension for relational database systems. Furthermore we discuss an algorithm for automatically determining the best PXV candidates to materialize based on a given workload. We evaluate our approach and show the superiority of our cost based algorithm for determining PXV candidates over frequent pattern based algorithms
Regular Rooted Graph Grammars
In dieser Arbeit wir ein pragmatischer Ansatz zur Typisierung, statischen Analyse und Optimierung von Web-Anfragespachen, speziell Xcerpt, untersucht. Pragmatisch ist der Ansatz in dem Sinne, dass dem Benutzer keinerlei Einschränkungen aus Entscheidbarkeits- oder Effizienzgründen auf modellierbare Typen gestellt werden. Effizienz und Entscheidbarkeit werden stattdessen, falls nötig, durch Vergröberungen bei der Typprüfung erkauft.
Eine Typsprache zur Typisierung von Graph-strukturierten Daten im Web wird eingeführt. Modellierbare Graphen sind so genannte gewurzelte Graphen, welche aus einem Spannbaum und Querreferenzen aufgebaut sind. Die Typsprache basiert auf
reguläre Baum Grammatiken, welche um typisierte Referenzen erweitert wurde. Neben wie im Web mit XML üblichen geordneten strukturierten Daten, sind auch ungeordnete Daten, wie etwa in Xcerpt oder RDF üblich, modellierbar. Der dazu verwendete Ansatz---ungeordnete Interpretation Regulärer Ausdrücke---ist neu. Eine operationale Semantik für geordnete wie ungeordnete Typen wird auf Basis spezialisierter Baumautomaten und sog. Counting Constraints (welche wiederum auf presburgerarithmetische Ausdrücke) basieren. Es wird ferner statische Typ-Prüfung und -Inferenz von Xcerpt Anfrage- und Konstrukttermen, wie auch Optimierung von Xcerpt Anfragen auf Basis von Typinformation eingeführt.This thesis investigates a pragmatic approach to typing, static analysis and static
optimization of Web query languages, in special the Web query language Xcerpt. The
approach is pragmatic in the sense, that no restriction on the types are made for
decidability or efficiency reasons, instead precision is given up if necessary.
Pragmatics on the dynamic side means to use types not only to ensure validity of objects
operating on, but also influencing query selection based on types.
A typing language for typing of graph structured data on the Web is introduced.
The Graphs in mind are based on spanning trees with references, the typing languages
is based on regular tree grammars with typed reference extensions. Beside ordered data
in the spirit of XML, unordered data (i.e. in the spirit of the Xcerpt data model or
RDF) can be modelled using regular expressions under unordered interpretation – this
approach is new. An operational semantics for ordered and unordered types is given
based on specialized regular tree automata and counting constraints (them again based
on Presburger arithmetic formulae). Static type checking of Xcerpt query and construct
terms is introduced, as well as optimization of Xcerpt query terms based on schema
information
Detecting Irrelevant subtrees to improve probabilistic learning from tree-structured data
International audienceIn front of the large increase of the available amount of structured data (such as XML documents), many algorithms have emerged for dealing with tree-structured data. In this article, we present a probabilistic approach which aims at a posteriori pruning noisy or irrelevant subtrees in a set of trees. The originality of this approach, in comparison with classic data reduction techniques, comes from the fact that only a part of a tree (i.e. a subtree) can be deleted, rather than the whole tree itself. Our method is based on the use of confidence intervals, on a partition of subtrees, computed according to a given probability distribution. We propose an original approach to assess these intervals on tree-structured data and we experimentally show its interest in the presence of noise
Querying Large Collections of Semistructured Data
An increasing amount of data is published as semistructured documents formatted with presentational markup. Examples include data objects such as mathematical expressions encoded with MathML or web pages encoded with XHTML. Our intention is to improve the state of the art in retrieving, manipulating, or mining such data.
We focus first on mathematics retrieval, which is appealing in various domains, such as education, digital libraries, engineering, patent documents, and medical sciences. Capturing the similarity of mathematical expressions also greatly enhances document classification in such domains. Unlike text retrieval, where keywords carry enough semantics to distinguish text documents and rank them, math symbols do not contain much semantic information on their own. Unfortunately, considering the structure of mathematical expressions to calculate relevance scores of documents results in ranking algorithms that are computationally more expensive than the typical ranking algorithms employed for text documents. As a result, current math retrieval systems either limit themselves to exact matches, or they ignore the structure completely; they sacrifice either recall or precision for efficiency.
We propose instead an efficient end-to-end math retrieval system based on a structural similarity ranking algorithm. We describe novel optimization techniques to reduce the index size and the query processing time. Thus, with the proposed optimizations, mathematical contents can be fully exploited to rank documents in response to mathematical queries. We demonstrate the effectiveness and the efficiency of our solution experimentally, using a special-purpose testbed that we developed for evaluating math retrieval systems. We finally extend our retrieval system to accommodate rich queries that consist of combinations of math expressions and textual keywords.
As a second focal point, we address the problem of recognizing structural repetitions in typical web documents. Most web pages use presentational markup standards, in which the tags control the formatting of documents rather than semantically describing their contents. Hence, their structures typically contain more irregularities than descriptive (data-oriented) markup languages. Even though applications would greatly benefit from a grammar inference algorithm that captures structure to make it explicit, the existing algorithms for XML schema inference, which target data-oriented markup, are ineffective in inferring grammars for web documents with presentational markup.
There is currently no general-purpose grammar inference framework that can handle irregularities commonly found in web documents and that can operate with only a few examples. Although inferring grammars for individual web pages has been partially addressed by data extraction tools, the existing solutions rely on simplifying assumptions that limit their application. Hence, we describe a principled approach to the problem by defining a class of grammars that can be inferred from very small sample sets and can capture the structure of most web documents. The effectiveness of this approach, together with a comparison against various classes of grammars including DTDs and XSDs, is demonstrated through extensive experiments on web documents. We finally use the proposed grammar inference framework to extend our math retrieval system and to optimize it further
Web and Semantic Web Query Languages
A number of techniques have been developed to facilitate
powerful data retrieval on the Web and Semantic Web. Three categories
of Web query languages can be distinguished, according to the format
of the data they can retrieve: XML, RDF and Topic Maps. This article
introduces the spectrum of languages falling into these categories
and summarises their salient aspects. The languages are introduced using
common sample data and query types. Key aspects of the query
languages considered are stressed in a conclusion
- …