Search CORE

405 research outputs found

XML Matchers: approaches and challenges

Author: Agreste Santa
De Meo Pasquale
Ferrara Emilio
Ursino Domenico
Publication venue: 'Elsevier BV'
Publication date: 10/07/2014
Field of study

Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

arXiv.org e-Print Archive

IRIS UniversitÃ Politecnica delle Marche

A graphical environment for change detection in structured documents

Author: Patel Girish A.
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1997
Field of study

Change detection in structured documents (e.g. SGML is important in data warehousing, digital libraries and Internet databases. This thesis presents a graphical environment for detecting changes in the structured documents. We represent. each document by alp ordered labeled tree based on the underlying markup language. We then compare two documents by invoking previously developed algorithms for approximate pattern matching and pattern discovery in trees. Several operators are developed to support. the comparison of the documents; graphical devices are provided to facilitate the use of the operators. We believe the proposed tool is useful for not only document management, but also software maintenance, particularly configuration management and version control, where programs aro represented as parse trees and detecting changes in the trees provides a way to find the syntactic differences of two program versions

Digital Commons @ New Jersey Institute of Technology (NJIT)

A methodology for clustering XML documents by structure

Author: Abiteboul
Abiteboul
Carmel
Chawathe
Chawathe
Cobena
Cormen
Dalamagas
Direen
Flesca
Fuhr
Garcia-Molina
Garofalakis
Goldman
Gower
Halkidi
Hearst
Hubert
Jardine
Klaas-Jan Winkel
Lewie
Liu
Milligan
Myers
Nierman
Papakonstantinou
Polyzotis
Rasmussen
Sankoff
Selkow
Shanmugasundaram
Tai
Tang
Tao Cheng
Theodore Dalamagas
Timos Sellis
van Rijsbergen
Wagner
Wang
Wilson
Yang
Yoon
Zhang
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

An efficient and scalable algorithm for clustering XML documents by structure

Author: D.W.-l. Cheung
N. Mamoulis
Siu-Ming Yiu
Wang Lian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

XML document-grammar comparison: related problems and applications

Author: A. Algergawy
A. Balmin
A. Budanitsky
A. Doan
A. Formica
A. Neumann
B. Bouchou
C. Chitic
C. Werner
C.J. Rijsbergen van
C.Y. Chan
D.C. Reis
E. Bertino
E.T. Ray
F. Giunchiglia
G. Lee
G. Salton
G.M. Landau
H. Do
J. Lee
J. Tekli
J. Tekli
K. Zhang
K. Zhang
M. Murata
P. Resnik
P. Shvaiko
R. Luz Da
R. Rada
R. Schenkel
S. Amer-Yahia
S. Axelsson
S. Nishimura
S.M. Selkow
T. Akatsu
T. Dalamagas
T. Schlieder
W. Lian
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

Crossref

Structure and content semantic similarity detection of eXtensible markup language documents using keys

Author: Viyanon Waraporn
Publication venue: Scholars\u27 Mine
Publication date: 01/01/2010
Field of study

XML (eXtensible Mark-up Language) has become the fundamental standard for efficient data management and exchange. Due to the widespread use of XML for describing and exchanging data on the web, XML-based comparison is central issues in database management and information retrieval. In fact, although many heterogeneous XML sources have similar content, they may be described using different tag names and structures. This work proposes a series of algorithms for detection of structural and content changes among XML data. The first is an algorithm called XDoI (XML Data Integration Based on Content and Structure Similarity Using Keys) that clusters XML documents into subtrees using leaf-node parents as clustering points. This algorithm matches subtrees using the key concept and compares unmatched subtrees for similarities in both content and structure. The experimental results show that this approach finds much more accurate matches with or without the presence of keys in the subtrees. A second algorithm proposed here is called XDI-CSSK (a system for detecting xml similarity in content and structure using relational database); it eliminates unnecessary clustering points using instance statistics and a taxonomic analyzer. As the number of subtrees to be compared is reduced, the overall execution time is reduced dramatically. Semantic similarity plays a crucial role in precise computational similarity measures. A third algorithm, called XML-SIM (structure and content semantic similarity detection using keys) is based on previous work to detect XML semantic similarity based on structure and content. This algorithm is an improvement over XDI-CSSK and XDoI in that it determines content similarity based on semantic structural similarity. In an experimental evaluation, it outperformed previous approaches in terms of both execution time and false positive rates. Information changes periodically; therefore, it is important to be able to detect changes among different versions of an XML document and use that information to identify semantic similarities. Finally, this work introduces an approach to detect XML similarity and thus to join XML document versions using a change detection mechanism. In this approach, subtree keys still play an important role in order to avoid unnecessary subtree comparisons within multiple versions of the same document. Real data sets from bibliographic domains demonstrate the effectiveness of all these algorithms --Abstract, page iv-v

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

A survey on tree matching and XML retrieval

Author: Aho
Al-Khalifa
Alilaouar
Amer-Yahia
Aouicha
Ayala
Bille
Bille
Botev
Bruno
Buneman
Burghardt
Cai
Campi
Ceri
Chamberlin
Chase
Chen
Chen
Chen
Chen
Chen
Chen
Cheng
Cole
Cole
Cyril Laitang
Dalamagas
Dalamagas
Damiani
Damiani
Dao
de Vries
Demaine
Denoyer
Dubiner
Dulucq
Dürr
Hamamache Kheddouci
Haw
Haw
Hoffmann
Hubert
Hummel
Izadi
Jansson
Jiang
Jiang
Jiang
Kamps
Karen Pinel-Sauvagnat
Kazai
Kazai
Kilpelainen
Klein
Knuth
Kosaraju
Kuboyama
Laitang
Lalmas
Lalmas
Le
Lei Ning
Levenshtein
Levy
Li
Li
Li
Lu
Lu
Mass
Mihajlovic
Mohammed Amin Tahraoui
Mohand Boughanem
Ogilvie
Pehcevski
Pehcevski
Pinel-Sauvagnat
Piwowarski
Popovici
Qin
Rao
Richter
Robie
Runapongsa
Schenkel
Schenkel
Schlieder
Shasha
Stahl
Tai
Tekli
Theobald
Trotman
Trotman
Trotman
Trotman
Trotman
van Zwol
Wagner
Wang
Wang
Wang
Wang
Wu
Yang
Yao
Zezula
Zezula
Zhang
Zhang
Zhou
Publication venue: 'Elsevier BV'
Publication date: 01/05/2013
Field of study

International audienceWith the increasing number of available XML documents, numerous approaches for retrieval have been proposed in the literature. They usually use the tree representation of documents and queries to process them, whether in an implicit or explicit way. Although retrieving XML documents can be considered as a tree matching problem between the query tree and the document trees, only a few approaches take advantage of the algorithms and methods proposed by the graph theory. In this paper, we aim at studying the theoretical approaches proposed in the literature for tree matching and at seeing how these approaches have been adapted to XML querying and retrieval, from both an exact and an approximate matching perspective. This study will allow us to highlight theoretical aspects of graph theory that have not been yet explored in XML retrieval

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

Hal - Université Grenoble Alpes

Open Archive Toulouse Archive Ouverte

Hal-Diderot

Development of a context-specific search engine, an executive information system, and a novel www ready external cost model

Author: Revankar Amit V.
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1999
Field of study

NJPIES is associated with Information Ecology and Sustainability, a holistic approach to environmental data collection, compilation, integration and provision that puts people, not technology, at the center of the environmental information world. The first main goal of this project was to develop an algorithm and associated computer-based tool that could perform a lifecycle cost analysis for a model system. The application developed solved the primary problem associated with the lifecycle cost analysis of a product: it accounted for all costs (e.g., environmental costs such as ecological costs and health costs associated with emissions) of the activity. A lifecycle cost analysis attempts to identify, measure, and quantify the social costs of human activities such as manufacturing that are not considered with traditional accounting systems. The application developed will quantify, monetize, and rank the damage or external costs to the environment of certain types of emissions. We developed a preliminary algorithm and software and implemented it at two plants: load assembly pack operation at Iowa Army Ammunition Plant (IAAAP) and Armtec, a manufacturer of combustible cartridge cases. The second main goal of this project is to act as a credible information-clearing house in pollution prevention (P2) and related environmental matters, and to educate the public and keep them aware of facts taking place in the environmental/manufacturing world. Intelligent search engines have been built to access these huge databases in human readable format and correlate the data to various reports providing information on the environmentally hazardous chemicals, releases, and facilities in different regions. The third main goal is the enhancement of EnviroDaemon with a hierarchical information search interface. This project describes some approaches that locate information according to syntactic criteria, augmented by pragmatic aspects like the utilization of information in a certain context. The main emphasis of this project lies in the treatment of structured knowledge, where essential aspects about the topic of interest are encoded not only by the individual items, but also by their relationships among each other. Benefits of this approach are enhanced precision and approximate search in an already focused, context specific search engine for the environment

Digital Commons @ New Jersey Institute of Technology (NJIT)

Flexible and scalable digital library search

Author: Blok H.E.
Petkovic M.
Schmidt A.R.
Windhouwer M.A. (Menzo)
Zwol R. van
Publication venue: CWI
Publication date: 01/01/2001
Field of study

In this report the development of a specialised search engine for a digital library is described. The proposed system architecture consists of three levels: the conceptual, the logical and the physical level. The conceptual level schema enables by its exposure of a domain specific schema semantically rich conceptual search. The logical level provides a description language to achieve a high degree of flexibility for multimedia retrieval. The physical level takes care of scalable and efficient persistent data storage. The role, played by each level, changes during the various stages of a search engine's lifecycle: (1) modeling the index, (2) populating and maintaining the index and (3) querying the index. The integration of all this functionality allows the combination of both conceptual and content-based querying in the query stage. A search engine for the Australian Open tennis tournament website is used as a running example, which shows the power of the complete architecture and its various component

CWI's Institutional Repository