6,108 research outputs found
XML Schema Clustering with Semantic and Hierarchical Similarity Measures
With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis
XML Matchers: approaches and challenges
Schema Matching, i.e. the process of discovering semantic correspondences
between concepts adopted in different data source schemas, has been a key topic
in Database and Artificial Intelligence research areas for many years. In the
past, it was largely investigated especially for classical database models
(e.g., E/R schemas, relational databases, etc.). However, in the latest years,
the widespread adoption of XML in the most disparate application fields pushed
a growing number of researchers to design XML-specific Schema Matching
approaches, called XML Matchers, aiming at finding semantic matchings between
concepts defined in DTDs and XSDs. XML Matchers do not just take well-known
techniques originally designed for other data models and apply them on
DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical
structure of a DTD/XSD) to improve the performance of the Schema Matching
process. The design of XML Matchers is currently a well-established research
area. The main goal of this paper is to provide a detailed description and
classification of XML Matchers. We first describe to what extent the
specificities of DTDs/XSDs impact on the Schema Matching task. Then we
introduce a template, called XML Matcher Template, that describes the main
components of an XML Matcher, their role and behavior. We illustrate how each
of these components has been implemented in some popular XML Matchers. We
consider our XML Matcher Template as the baseline for objectively comparing
approaches that, at first glance, might appear as unrelated. The introduction
of this template can be useful in the design of future XML Matchers. Finally,
we analyze commercial tools implementing XML Matchers and introduce two
challenging issues strictly related to this topic, namely XML source clustering
and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure
A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate
Using Element Clustering to Increase the Efficiency of XML Schema Matching
Schema matching attempts to discover semantic mappings between elements of two schemas. Elements are cross compared using various heuristics (e.g., name, data-type, and structure similarity). Seen from a broader perspective, the schema matching problem is a combinatorial problem with an exponential complexity. This makes the naive matching algorithms for large schemas prohibitively inefficient. In this paper we propose a clustering based technique for improving the efficiency of large scale schema matching. The technique inserts clustering as an intermediate step into existing schema matching algorithms. Clustering partitions schemas and reduces the overall matching load, and creates a possibility to trade between the efficiency and effectiveness. The technique can be used in addition to other optimization techniques. In the paper we describe the technique, validate the performance of one implementation of the technique, and open directions for future research
Duplicate Detection in Probabilistic Data
Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
- …