10,494 research outputs found

    Structure and content semantic similarity detection of eXtensible markup language documents using keys

    Get PDF
    XML (eXtensible Mark-up Language) has become the fundamental standard for efficient data management and exchange. Due to the widespread use of XML for describing and exchanging data on the web, XML-based comparison is central issues in database management and information retrieval. In fact, although many heterogeneous XML sources have similar content, they may be described using different tag names and structures. This work proposes a series of algorithms for detection of structural and content changes among XML data. The first is an algorithm called XDoI (XML Data Integration Based on Content and Structure Similarity Using Keys) that clusters XML documents into subtrees using leaf-node parents as clustering points. This algorithm matches subtrees using the key concept and compares unmatched subtrees for similarities in both content and structure. The experimental results show that this approach finds much more accurate matches with or without the presence of keys in the subtrees. A second algorithm proposed here is called XDI-CSSK (a system for detecting xml similarity in content and structure using relational database); it eliminates unnecessary clustering points using instance statistics and a taxonomic analyzer. As the number of subtrees to be compared is reduced, the overall execution time is reduced dramatically. Semantic similarity plays a crucial role in precise computational similarity measures. A third algorithm, called XML-SIM (structure and content semantic similarity detection using keys) is based on previous work to detect XML semantic similarity based on structure and content. This algorithm is an improvement over XDI-CSSK and XDoI in that it determines content similarity based on semantic structural similarity. In an experimental evaluation, it outperformed previous approaches in terms of both execution time and false positive rates. Information changes periodically; therefore, it is important to be able to detect changes among different versions of an XML document and use that information to identify semantic similarities. Finally, this work introduces an approach to detect XML similarity and thus to join XML document versions using a change detection mechanism. In this approach, subtree keys still play an important role in order to avoid unnecessary subtree comparisons within multiple versions of the same document. Real data sets from bibliographic domains demonstrate the effectiveness of all these algorithms --Abstract, page iv-v

    XML Matchers: approaches and challenges

    Full text link
    Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.Comment: 34 pages, 8 tables, 7 figure

    XML Schema Clustering with Semantic and Hierarchical Similarity Measures

    Get PDF
    With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

    A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity

    Get PDF
    Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate

    Automated schema matching techniques: an exploratory study

    Get PDF
    Manual schema matching is a problem for many database applications that use multiple data sources including data warehousing and e-commerce applications. Current research attempts to address this problem by developing algorithms to automate aspects of the schema-matching task. In this paper, an approach using an external dictionary facilitates automated discovery of the semantic meaning of database schema terms. An experimental study was conducted to evaluate the performance and accuracy of five schema-matching techniques with the proposed approach, called SemMA. The proposed approach and results are compared with two existing semi-automated schema-matching approaches and suggestions for future research are made

    Improving the Representation and Conversion of Mathematical Formulae by Considering their Textual Context

    Full text link
    Mathematical formulae represent complex semantic information in a concise form. Especially in Science, Technology, Engineering, and Mathematics, mathematical formulae are crucial to communicate information, e.g., in scientific papers, and to perform computations using computer algebra systems. Enabling computers to access the information encoded in mathematical formulae requires machine-readable formats that can represent both the presentation and content, i.e., the semantics, of formulae. Exchanging such information between systems additionally requires conversion methods for mathematical representation formats. We analyze how the semantic enrichment of formulae improves the format conversion process and show that considering the textual context of formulae reduces the error rate of such conversions. Our main contributions are: (1) providing an openly available benchmark dataset for the mathematical format conversion task consisting of a newly created test collection, an extensive, manually curated gold standard and task-specific evaluation metrics; (2) performing a quantitative evaluation of state-of-the-art tools for mathematical format conversions; (3) presenting a new approach that considers the textual context of formulae to reduce the error rate for mathematical format conversions. Our benchmark dataset facilitates future research on mathematical format conversions as well as research on many problems in mathematical information retrieval. Because we annotated and linked all components of formulae, e.g., identifiers, operators and other entities, to Wikidata entries, the gold standard can, for instance, be used to train methods for formula concept discovery and recognition. Such methods can then be applied to improve mathematical information retrieval systems, e.g., for semantic formula search, recommendation of mathematical content, or detection of mathematical plagiarism.Comment: 10 pages, 4 figure

    A comparative analysis of web-based GIS applications using usability metrics

    Get PDF
    With the rapid expansion of the internet, Web-based Geographic Information System (WGIS) applications have gained popularity, despite the interface of the WGIS application being difficult to learn and understand because special functions are needed to manipulate the maps. Hence, it is essential to evaluate the usability of WGIS applications. Usability is an important factor in ensuring the development of quality, usable software products. On the other hand, there are a number of standards and models in the literature, each of which describes usability in terms of various set of attributes. These models are vague and difficult to understand. Therefore, the primary purpose of this study is to compare five common usability models (Shackel, Nielsen, ISO 9241 P-11, ISO 9126-1 and QUIM) to identify usability metrics that have most frequently used in the previous models. The questionnaire method and the automated usability evaluation method by using Loop11 tool were used, in order to evaluate the usability metrics for three case studies of commonly used WGIS applications as Google maps, Yahoo maps, and MapQuest. Finally, those case studies were compared and analysed based on usability metrics that have been identified. Based on a comparative study, four usability metrics (Effectiveness, Efficiency, Satisfaction and Learnability) were identified. Those usability metrics were characterized by consistent, comprehensive, not vaguely and proper to evaluate the usability of WGIS applications. In addition, there was a positive correlation between these usability metrics. The comparative analysis indicates that Effectiveness, Satisfaction and Learnability were higher, and the Efficiency was lesser by using the Loop11 tool compared to questionnaire method for the three case studies. In addition, Yahoo Maps and MapQuest have usability metrics rate lesser than Google Maps by applying two methods. Therefore, Google Maps is more usable compared to Yahoo Maps and MapQuest

    Semantically Resolving Type Mismatches in Scientific Workflows

    No full text
    Scientists are increasingly utilizing Grids to manage large data sets and execute scientific experiments on distributed resources. Scientific workflows are used as means for modeling and enacting scientific experiments. Windows Workflow Foundation (WF) is a major component of Microsoft’s .NET technology which offers lightweight support for long-running workflows. It provides a comfortable graphical and programmatic environment for the development of extended BPEL-style workflows. WF’s visual features ease the syntactic composition of Web services into scientific workflows but do nothing to assure that information passed between services has consistent semantic types or representations or that deviant flows, errors and compensations are handled meaningfully. In this paper we introduce SAWSDL-compliant annotations for WF and use them with a semantic reasoner to guarantee semantic type correctness in scientific workflows. Examples from bioinformatics are presented
    • …
    corecore