Search CORE

30 research outputs found

Bayesian Network and Network Pruning Strategy for XML Duplicate Detection

Author: Ms. Trupti Patil, Siddheshwar Patil, Mis
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 30/11/2014
Field of study

Data Duplication causes excess use of redundant storage, excess time and inconsistency. Duplicate detection will help to ensure accurate data by identifying and preventing identical or similar records. There is a long work on identifying duplicates in relational data, but only a slight solution focused on duplicate detection in more complex hierarchical structures, like XML data. Hierarchical data are defined as a set of data items that are related to each other by hierarchical relationships such as XML .In the world of XML there are not necessarily uniform and clearly defined structures like tables. Duplicate detection has been studied extensively for relational data. Methods devised for duplicate detection in a single relation do not directly apply to XML data. Therefore there is a need to develop a method to detect duplicate objects in nested XML data. In proposed system duplicates are detected by using duplicate detection algorithm called as XMLDup. Proposed XMLDup method will be using Bayesian network. It determine the probability of two XML elements being duplicates by considering the information within the elements and the structure of information. In order to improve the Bayesian Network evaluation time, pruning strategy is used. Finally work will be analyzed by measuring Precision and Recall value

International Journal on Recent and Innovation Trends in Computing and Communication

Measuring the similarity of PML documents with RFID-based sensors

Author: Reza Malekian
Wang Ru-chuan
Wang Zhong-qin
Ye Ning
Zhao Ting-ting
Publication venue
Publication date: 12/09/2013
Field of study

The Electronic Product Code (EPC) Network is an important part of the Internet of Things. The Physical Mark-Up Language (PML) is to represent and de-scribe data related to objects in EPC Network. The PML documents of each component to exchange data in EPC Network system are XML documents based on PML Core schema. For managing theses huge amount of PML documents of tags captured by Radio frequency identification (RFID) readers, it is inevitable to develop the high-performance technol-ogy, such as filtering and integrating these tag data. So in this paper, we propose an approach for meas-uring the similarity of PML documents based on Bayesian Network of several sensors. With respect to the features of PML, while measuring the similarity, we firstly reduce the redundancy data except information of EPC. On the basis of this, the Bayesian Network model derived from the structure of the PML documents being compared is constructed.Comment: International Journal of Ad Hoc and Ubiquitous Computin

arXiv.org e-Print Archive

UPSpace at the University of Pretoria

Measuring the similarity of PML documents with RFID-based sensors

Author: Reza Malekian
Wang Ru-chuan
Wang Zhong-qin
Ye Ning
Zhao Ting-ting
Publication venue: 'Inderscience Publishers'
Publication date: 01/01/2014
Field of study

The electronic product code (EPC) network is an important part of the internet of things. The physical mark-up language (PML) is to represent and describe data related to objects in EPC network. The PML documents of each component to exchange data in EPC network system are XML documents based on PML Core schema. For managing theses huge amount of PML documents of tags captured by radio frequency identification (RFID) readers, it is inevitable to develop the high-performance technology, such as filtering and integrating these tag data. So in this paper, we propose an approach for measuring the similarity of PML documents based on Bayesian network of several sensors. With respect to the features of PML, while measuring the similarity, we firstly reduce the redundancy data except information of EPC. On the basis of this, the Bayesian network model derived from the structure of the PML documents being compared is constructed.The research is support by National Natural Science Foundation of P. R. China (Grant No. 61170065 and 61003039), Peak of Six Major Talent in Jiangsu Province (Grant No.2010DZXX026), Project sponsored by Jiangsu provincial research scheme of natural science for higher education institutions (Grant No.12KJB520009), Science & Technology Innovation Fund for higher education institutions of Jiangsu Province (Grant No.CXZZ11-0405).http://www.inderscience.com/jhome.php?jcode=IJAHUChb201

UPSpace at the University of Pretoria

Semantical mapping of attribute values for data integration

Author: Bronselaer Antoon
De Tré Guy
Szymczak Marcin
Zadrozny Slawomir
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Nowadays the amount of data is increasing very fast. Moreover, useful information is scattered over multiple sources. Therefore, automatic data integration that guarantees high data quality is extremely important. One of the crucial operations in integration of information from independent databases is detection of different representations of the same piece of information (called coreferent data) and translation of the representation of data from one source into the representation of the other source. That translation is also known as object mapping. In this paper, we investigate automatic mapping methods for attributes the values of which may need semantical comparison and can be sorted by means of an order relation that reflects a notion of generality. These mapping methods are investigated closely in terms of their effectiveness. An experimental evaluation of our method shows that using different mapping methods can enlarge a set of true positive mappings

Crossref

Ghent University Academic Bibliography

Structure and content semantic similarity detection of eXtensible markup language documents using keys

Author: Viyanon Waraporn
Publication venue: Scholars\u27 Mine
Publication date: 01/01/2010
Field of study

XML (eXtensible Mark-up Language) has become the fundamental standard for efficient data management and exchange. Due to the widespread use of XML for describing and exchanging data on the web, XML-based comparison is central issues in database management and information retrieval. In fact, although many heterogeneous XML sources have similar content, they may be described using different tag names and structures. This work proposes a series of algorithms for detection of structural and content changes among XML data. The first is an algorithm called XDoI (XML Data Integration Based on Content and Structure Similarity Using Keys) that clusters XML documents into subtrees using leaf-node parents as clustering points. This algorithm matches subtrees using the key concept and compares unmatched subtrees for similarities in both content and structure. The experimental results show that this approach finds much more accurate matches with or without the presence of keys in the subtrees. A second algorithm proposed here is called XDI-CSSK (a system for detecting xml similarity in content and structure using relational database); it eliminates unnecessary clustering points using instance statistics and a taxonomic analyzer. As the number of subtrees to be compared is reduced, the overall execution time is reduced dramatically. Semantic similarity plays a crucial role in precise computational similarity measures. A third algorithm, called XML-SIM (structure and content semantic similarity detection using keys) is based on previous work to detect XML semantic similarity based on structure and content. This algorithm is an improvement over XDI-CSSK and XDoI in that it determines content similarity based on semantic structural similarity. In an experimental evaluation, it outperformed previous approaches in terms of both execution time and false positive rates. Information changes periodically; therefore, it is important to be able to detect changes among different versions of an XML document and use that information to identify semantic similarities. Finally, this work introduces an approach to detect XML similarity and thus to join XML document versions using a change detection mechanism. In this approach, subtree keys still play an important role in order to avoid unnecessary subtree comparisons within multiple versions of the same document. Real data sets from bibliographic domains demonstrate the effectiveness of all these algorithms --Abstract, page iv-v

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Correlation-based methods for data cleaning, with application to biological databases

Author: KOH LIE YONG
Publication venue
Publication date: 25/09/2007
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Data Quality Problems beyond Consistency and Deduplication

Author: Fan Wenfei
Geerts Floris
Ma Shuai
Tang Nan
Yu Wenyuan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Edinburgh Research Explorer

Institutional Repository Universiteit Antwerpen

Training Selection for Tuning Entity Matching

Author: Köpcke Hanna
Rahm Erhard
Publication venue
Publication date: 06/02/2019
Field of study

Entity matching is a crucial and difficult task for data integration. An effective solution strategy typically has to combine several techniques and to find suitable settings for critical configuration parameters such as similarity thresholds. Supervised (training-based) approaches promise to reduce the manual work for determining (learning) effective strategies for entity matching. However, they critically depend on training data selection which is a difficult problem that has so far mostly been addressed manually by human experts. In this paper we propose a training-based framework called STEM for entity matching and present different generic methods for automatically selecting training data to combine and configure several matching techniques. We evaluate the proposed methods for different match tasks and small- and medium-sized training sets

Qucosa - Publikationsserver der Universität Leipzig