583 research outputs found


    Record linkage is the process of matching records from several databases that refer to the same entities. When applied on a single database, this process is known as deduplication. Increasingly, matched data are becoming important in many applications areas, because they can contain information that is not available otherwise, or that is too costly to acquire. Removing duplicate records in a single database is a crucial step in the data cleaning process, because duplicates can severely influence the outcomes of any subsequent data processing or data mining. With the increasing size of today’s databases, the complexity of the matching process becomes one of the major challenges for record linkage and deduplication. In recent years, various indexing techniques have been developed for record linkage and deduplication. They are aimed at reducing the number of record pairs to be compared in the matching process by removing obvious nonmatching pairs, while at the same time maintaining high matching quality. This paper presents a survey of variations of six indexing techniques. Their complexity is analyzed, and their performance and scalability is evaluated within an experimental framework using both synthetic and real data sets. These experiments highlight that one of the most important factors for efficient and accurate indexing for record linkage and deduplication is the proper definition of blocking keys

    Microdata Deduplication with Spark

    Üha rohkem avaldatakse veebis struktureeritud sisu, mis on loetav nii inimeste kui masinate poolt. Tänu otsimootorite loojatele, kes on defineerinud standardid struktureeritud sisu esitamiseks, teevad järjest rohkemad veebisaidid osa oma andmetest, nt toodete, isikute, organisatsioonide ja asukohtade kirjeldused, veebis avalikuks. Selleks kasutatakse RDFa, microdata jms vorminguid. Microdata on üks viimastest vormingutest ning saanud populaarseks suhteliselt lühikese aja jooksul. Sarnaselt on arenenud tehnoloogiad veebist struktureeritud sisu kättesaamiseks. Näiteks on Apache Any23, mis võimaldab veebilehtedest microdata andmeid eraldada ja linkandmetena kättesaadavaks teha. Samas pole struktureeritud andmete veebist kättesaamine enam suurim tehniline väljakutse. Nimelt on veebist saadud andmeid enne kasutamist vaja puhastada - eemaldada duplikaadid, lahendada ebakõlad ning hakkama tuleb saada ka ebamääraste andmetega.\n\rKäesoleva magistritöö peamiseks fookuseks on efektiivse lahenduse loomine veebis leiduvatest linkandmetest duplikaatide eemaldamine suurte andmekoguste jaoks. Kuigi deduplikeerimise algoritmid on saavutanud suhtelise küpsuse, tuleb neid konkreetsete andmekomplektide jaoks siiski peenhäälestada. Eelkõige tuleb tuvastada sobivaim võtme pikkus kirjete sortimiseks. Käesolevas töös tuvastatakse optimaalne võtme pikkus veebis leiduvate tooteandmete deduplikeerimise kontekstis. Suurte andmemahtude tõttu kasutatakse Apache Spark'i deduplikeerimist hajusalgoritmide realiseerimiseks.The web is transforming from traditional web to web of data, where information is presented in such a way that it is readable by machines as well as human. As a part of this transformation, every day more and more websites implant structured data, e.g. product, person, organization, place etc., into the HTML pages. To implant the structured data different encoding vocabularies, such as RDFa, microdata, and microformats, are used. Microdata is the most recent addition to these structure data embedding standards, but it has gained more popularity over other formats in less time. Similarly, progress has been made in the extraction of the structured data from web pages, which has resulted in open source tools such as Apache Any23 and non-profit Common Crawl project. Any23 allows extraction of microdata from the web pages with less effort, whereas Common Crawl extracts data from websites and provides it publically for download. In fact, the microdata extraction tools only take care of parsing and data transformation steps of data cleansing. Although with the help of these state-of-the-art extraction tools microdata can be easily extracted, before the extracted data used in potential applications, duplicates should be removed and data unified. Since microdata origins from arbitrary web resources, it has arbitrary quality as well and should be treated correspondingly. \n\rThe main purpose of this thesis is to develop the effective mechanism for deduplication of microdata on the web scale. Although the deduplication algorithms have reached relative maturity, however, these algorithm needs to be executed on specific datasets for fine-tuning. In particular, the need to identify the most suitable length of sorting key in sorted-based deduplication approach. The present work identifies the optimum length of the sorting key in the context of extracted product microdata deduplication. Due to large volumes of data to be processed continuously, Apache Spark will be used for implementing the necessary procedures


    Sharing data between organizations has growing importance in many data mining projects. Data from various heterogeneous sources often has to be linked and aggregated in order to improve data quality. The importance of data accuracy and quality has increased with the explosion of data size. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset which called Duplicate Record Detection (DRD). These data inaccuracy problems exist due to due to several factors including spelling, typographical and pronunciation variation, dialects and special vowel and consonant distinction and other linguistic characteristics especially with non-Latin languages like Arabic. In this paper, an English/Arabic enabled web-based framework is designed and implemented which considers the user interaction to add new rules, enrich the dictionary and evaluate results is an important step to improve system’s behavior. The proposed framework allows the processing on both single language dataset and bi-lingual dataset. The proposed framework is implemented and verified empirically in several case studies. The comparison results showed that the proposed system has substantial improvements compared to known tools

    Cloud-Scale Entity Resolution: Current State and Open Challenges

    Entity resolution (ER) is a process to identify records in information systems, which refer to the same real-world entity. Because in the two recent decades the data volume has grown so large, parallel techniques are called upon to satisfy the ER requirements of high performance and scalability. The development of parallel ER has reached a relatively prosperous stage, and has found its way into several applications. In this work, we first comprehensively survey the state of the art of parallel ER approaches. From the comprehensive overview, we then extract the classification criteria of parallel ER, classify and compare these approaches based on these criteria. Finally, we identify open research questions and challenges and discuss potential solutions and further research potentials in this field

    A Combined Approach For Private Indexing Mechanism

    Private indexing is a set of approaches for analyzing research data that are similar or resemble similar ones. This is used in the database to keep track of the keys and their values. The main subject of this research is private indexing in record linkage to secure the data. Because unique personal identification numbers or social security numbers are not accessible in most countries or databases, data linkage is limited to attributes such as date of birth and names to distinguish between the number of records and the real-life entities they represent. For security reasons, the encryption of these identifiers is required. Privacy-preserving record linkage, frequently used to link private data within several databases from different companies, prevents sensitive information from being exposed to other companies. This research used a combined method to evaluate the data, using classic and new indexing methods. A combined approach is more secure than typical standard indexing in terms of privacy. Multibit tree indexing, which groups comparable data in many ways, creates a scalable tree-like structure that is both space and time flexible, as it avoids the need for redundant block structures. Because the record pair numbers to compare are the Cartesian product of both the file record numbers, the work required grows with the number of records to compare in the files. The evaluation findings of this research showed that combined method is scalable in terms of the number of databases to be linked, the database size, and the time required