540 research outputs found
Scaling Similarity Joins over Tree-Structured Data
Given a large collection of tree-structured objects (e.g., XML documents), the similarity join finds the pairs of objects that are similar to each other, based on a similarity threshold and a tree edit distance measure. The state-of-the-art similarity join methods compare simpler approximations of the objects (e.g., strings), in order to prune pairs that cannot be part of the similarity join result based on distance bounds derived by the approximations. In this paper, we propose a novel similarity join approach, which is based on the dynamic decomposition of the tree objects into subgraphs, according to the similarity threshold. Our technique avoids computing the exact distance between two tree objects, if the objects do not share at least one common subgraph. In order to scale up the join, the computed subgraphs are managed in a two-layer index. Our experimental results on real and synthetic data collections show that our approach outperforms the state-of-the-art methods by up to an order of magnitude.published_or_final_versio
Content-Aware DataGuides for Indexing Large Collections of XML Documents
XML is well-suited for modelling structured data with
textual content. However, most indexing approaches perform
structure and content matching independently, combining
the retrieved path and keyword occurrences in a third
step. This paper shows that retrieval in XML documents can
be accelerated significantly by processing text and structure
simultaneously during all retrieval phases. To this end,
the Content-Aware DataGuide (CADG) enhances the wellknown
DataGuide with (1) simultaneous keyword and path
matching and (2) a precomputed content/structure join. Extensive
experiments prove the CADG to be 50-90% faster
than the DataGuide for various sorts of query and document,
including difficult cases such as poorly structured
queries and recursive document paths. A new query classification
scheme identifies precise query characteristics with
a predominant influence on the performance of the individual
indices. The experiments show that the CADG is applicable
to many real-world applications, in particular large
collections of heterogeneously structured XML documents
An Event based Prediction Suffix Tree
This article introduces the Event based Prediction Suffix Tree (EPST), a
biologically inspired, event-based prediction algorithm. The EPST learns a
model online based on the statistics of an event based input and can make
predictions over multiple overlapping patterns. The EPST uses a representation
specific to event based data, defined as a portion of the power set of event
subsequences within a short context window. It is explainable, and possesses
many promising properties such as fault tolerance, resistance to event noise,
as well as the capability for one-shot learning. The computational features of
the EPST are examined in a synthetic data prediction task with additive event
noise, event jitter, and dropout. The resulting algorithm outputs predicted
projections for the near term future of the signal, which may be applied to
tasks such as event based anomaly detection or pattern recognition
Analysis and conception of tuple spaces in the eye of scalability
Applications in the emerging fields of eCommerce and
Ubiquitous Computing are composed of heterogenous systems that
have been designed separately.
Hence, these systems loosely coupled and require a coordination
mechanism that is able to gap spatial and temporal remoteness.
The use of tuple spaces for data-driven coordination of these
systems has been proposed in the past. In addition, applications
of eCommerce and Ubiquitous Computing are not bound to a
predefined size, so that the underlying coordination
mechanism has to be highly scalable. However, it seems to be
difficult to conceive a scalable tuple space.
This report is an English version of the author\u27s diploma
thesis. It comprises the chapter two, three, four, and five. By
this means, the design and the implementation of the proposed
tuple space is not part of this report
- …