9,375 research outputs found

    Bayesian Network and Network Pruning Strategy for XML Duplicate Detection

    Get PDF
    Data Duplication causes excess use of redundant storage, excess time and inconsistency. Duplicate detection will help to ensure accurate data by identifying and preventing identical or similar records. There is a long work on identifying duplicates in relational data, but only a slight solution focused on duplicate detection in more complex hierarchical structures, like XML data. Hierarchical data are defined as a set of data items that are related to each other by hierarchical relationships such as XML .In the world of XML there are not necessarily uniform and clearly defined structures like tables. Duplicate detection has been studied extensively for relational data. Methods devised for duplicate detection in a single relation do not directly apply to XML data. Therefore there is a need to develop a method to detect duplicate objects in nested XML data. In proposed system duplicates are detected by using duplicate detection algorithm called as XMLDup. Proposed XMLDup method will be using Bayesian network. It determine the probability of two XML elements being duplicates by considering the information within the elements and the structure of information. In order to improve the Bayesian Network evaluation time, pruning strategy is used. Finally work will be analyzed by measuring Precision and Recall value

    Duplicate Detection in Probabilistic Data

    Get PDF
    Collected data often contains uncertainties. Probabilistic databases have been proposed to manage uncertain data. To combine data from multiple autonomous probabilistic databases, an integration of probabilistic data has to be performed. Until now, however, data integration approaches have focused on the integration of certain source data (relational or XML). There is no work on the integration of uncertain (esp. probabilistic) source data so far. In this paper, we present a first step towards a concise consolidation of probabilistic data. We focus on duplicate detection as a representative and essential step in an integration process. We present techniques for identifying multiple probabilistic representations of the same real-world entities. Furthermore, for increasing the efficiency of the duplicate detection process we introduce search space reduction methods adapted to probabilistic data

    Coreference detection in XML metadata

    Get PDF
    Preserving data quality is an important issue in data collection management. One of the crucial issues hereby is the detection of duplicate objects (called coreferent objects) which describe the same entity, but in different ways. In this paper we present a method for detecting coreferent objects in metadata, in particular in XML schemas. Our approach consists in comparing the paths from a root element to a given element in the schema. Each path precisely defines the context and location of a specific element in the schema. Path matching is based on the comparison of the different steps of which paths are composed. The uncertainty about the matching of steps is expressed with possibilistic truth values and aggregated using the Sugeno integral. The discovered coreference of paths can help for determining the coreference of different XML schemas

    Automatic test cases generation from software specifications modules

    Get PDF
    A new technique is proposed in this paper to extend the Integrated Classification Tree Methodology (ICTM) developed by Chen et al. [13] This software assists testers to construct test cases from functional specifications. A Unified Modelling Language (UML) class diagram and Object Constraint Language (OCL) are used in this paper to represent the software specifications. Each classification and associated class in the software specification is represented by classes and attributes in the class diagram. Software specification relationships are represented by associated and hierarchical relationships in the class diagram. To ensure that relationships are consistent, an automatic methodology is proposed to capture and control the class relationships in a systematic way. This can help to reduce duplication and illegitimate test cases, which improves the testing efficiency and minimises the time and cost of the testing. The methodology introduced in this paper extracts only the legitimate test cases, by removing the duplicate test cases and those incomputable with the software specifications. Large amounts of time would have been needed to execute all of the test cases; therefore, a methodology was proposed which aimed to select a best testing path. This path guarantees the highest coverage of system units and avoids using all generated test cases. This path reduces the time and cost of the testing

    Measuring the similarity of PML documents with RFID-based sensors

    Get PDF
    The Electronic Product Code (EPC) Network is an important part of the Internet of Things. The Physical Mark-Up Language (PML) is to represent and de-scribe data related to objects in EPC Network. The PML documents of each component to exchange data in EPC Network system are XML documents based on PML Core schema. For managing theses huge amount of PML documents of tags captured by Radio frequency identification (RFID) readers, it is inevitable to develop the high-performance technol-ogy, such as filtering and integrating these tag data. So in this paper, we propose an approach for meas-uring the similarity of PML documents based on Bayesian Network of several sensors. With respect to the features of PML, while measuring the similarity, we firstly reduce the redundancy data except information of EPC. On the basis of this, the Bayesian Network model derived from the structure of the PML documents being compared is constructed.Comment: International Journal of Ad Hoc and Ubiquitous Computin

    Pairwise similarity of TopSig document signatures

    Get PDF
    This paper analyses the pairwise distances of signatures produced by the TopSig retrieval model on two document collections. The distribution of the distances are compared to purely random signatures. It explains why TopSig is only competitive with state of the art retrieval models at early precision. Only the local neighbourhood of the signatures is interpretable. We suggest this is a common property of vector space models

    Cardinality heterogeneities in Web service composition: Issues and solutions

    Full text link
    Data exchanges between Web services engaged in a composition raise several heterogeneities. In this paper, we address the problem of data cardinality heterogeneity in a composition. Firstly, we build a theoretical framework to describe different aspects of Web services that relate to data cardinality, and secondly, we solve this problem by developing a solution for cardinality mediation based on constraint logic programming

    Biochemical network matching and composition

    Get PDF
    This paper looks at biochemical network matching and compositio
    corecore