16 research outputs found

    WEB-BASED DUPLICATE RECORDS DETECTION WITH ARABIC LANGUAGE ENHANCEMENT

    Get PDF
    Sharing data between organizations has growing importance in many data mining projects. Data from various heterogeneous sources often has to be linked and aggregated in order to improve data quality. The importance of data accuracy and quality has increased with the explosion of data size. The first step to ensure the data accuracy is to make sure that each real world object is represented once and only once in a certain dataset which called Duplicate Record Detection (DRD). These data inaccuracy problems exist due to due to several factors including spelling, typographical and pronunciation variation, dialects and special vowel and consonant distinction and other linguistic characteristics especially with non-Latin languages like Arabic. In this paper, an English/Arabic enabled web-based framework is designed and implemented which considers the user interaction to add new rules, enrich the dictionary and evaluate results is an important step to improve system’s behavior. The proposed framework allows the processing on both single language dataset and bi-lingual dataset. The proposed framework is implemented and verified empirically in several case studies. The comparison results showed that the proposed system has substantial improvements compared to known tools

    Hybrid ACO and SVM algorithm for pattern classification

    Get PDF
    Ant Colony Optimization (ACO) is a metaheuristic algorithm that can be used to solve a variety of combinatorial optimization problems. A new direction for ACO is to optimize continuous and mixed (discrete and continuous) variables. Support Vector Machine (SVM) is a pattern classification approach originated from statistical approaches. However, SVM suffers two main problems which include feature subset selection and parameter tuning. Most approaches related to tuning SVM parameters discretize the continuous value of the parameters which will give a negative effect on the classification performance. This study presents four algorithms for tuning the SVM parameters and selecting feature subset which improved SVM classification accuracy with smaller size of feature subset. This is achieved by performing the SVM parameters’ tuning and feature subset selection processes simultaneously. Hybridization algorithms between ACO and SVM techniques were proposed. The first two algorithms, ACOR-SVM and IACOR-SVM, tune the SVM parameters while the second two algorithms, ACOMV-R-SVM and IACOMV-R-SVM, tune the SVM parameters and select the feature subset simultaneously. Ten benchmark datasets from University of California, Irvine, were used in the experiments to validate the performance of the proposed algorithms. Experimental results obtained from the proposed algorithms are better when compared with other approaches in terms of classification accuracy and size of the feature subset. The average classification accuracies for the ACOR-SVM, IACOR-SVM, ACOMV-R and IACOMV-R algorithms are 94.73%, 95.86%, 97.37% and 98.1% respectively. The average size of feature subset is eight for the ACOR-SVM and IACOR-SVM algorithms and four for the ACOMV-R and IACOMV-R algorithms. This study contributes to a new direction for ACO that can deal with continuous and mixed-variable ACO

    Detección de registros académicos duplicados obtenidos desde repositorios digitales

    Get PDF
    Esta tesina de grado detalla el análisis y la implementación de una herramienta para la detección de registros académicos duplicados basada en un sistema de reglas. La deduplicación de registros es una tarea clave en el proceso de ingesta masiva de documentos a un repositorio puesto que permite el filtrado de contenido duplicado. Además, permite enriquecer los metadatos de los registros existentes en las distintas fuentes. Adicionalmente se presenta el desarrollo de un módulo de mapeo de metadatos que da soporte al proceso de deduplicación de registros y permite establecer interoperabilidad entre los esquemas utilizados en las distintas fuentes.Asesor profesional: Lic. Ariel Jorge Lira.Facultad de Informátic

    Data quality evaluation through data quality rules and data provenance.

    Get PDF
    The application and exploitation of large amounts of data play an ever-increasing role in today’s research, government, and economy. Data understanding and decision making heavily rely on high quality data; therefore, in many different contexts, it is important to assess the quality of a dataset in order to determine if it is suitable to be used for a specific purpose. Moreover, as the access to and the exchange of datasets have become easier and more frequent, and as scientists increasingly use the World Wide Web to share scientific data, there is a growing need to know the provenance of a dataset (i.e., information about the processes and data sources that lead to its creation) in order to evaluate its trustworthiness. In this work, data quality rules and data provenance are used to evaluate the quality of datasets. Concerning the first topic, the applied solution consists in the identification of types of data constraints that can be useful as data quality rules and in the development of a software tool to evaluate a dataset on the basis of a set of rules expressed in the XML markup language. We selected some of the data constraints and dependencies already considered in the data quality field, but we also used order dependencies and existence constraints as quality rules. In addition, we developed some algorithms to discover the types of dependencies used in the tool. To deal with the provenance of data, the Open Provenance Model (OPM) was adopted, an experimental query language for querying OPM graphs stored in a relational database was implemented, and an approach to design OPM graphs was proposed

    Linking historical census data across time

    No full text
    Historical census data provide a snapshot of the era when our ancestors lived. Such data contain valuable information for the reconstruction of households and the tracking of family changes across time, which can be used for a variety of social science research projects. As valuable as they are, these data provide only snapshots of the main characteristics of the stock of a population. To capture household changes requires that we link person by person and household by household from one census to the next over a series of censuses. Once linked together, the census data are greatly enhanced in value. Development of an automatic or semi-automatic linking procedure will significantly relieve social scientists from the tedious task of manually linking individuals, families, and households, and can lead to an improvement of their productivity. In this thesis, a systematic solution is proposed for linking historical census data that integrates data cleaning and standardisation, as well as record and household linkage over consecutive censuses. This solution consists of several data pre-processing, machine learning, and data mining methods that address different aspects of the historical census data linkage problem. A common property of these methods is that they all adopt a strategy to consider a household as an entity, and use the whole of household information to improve the effectiveness of data cleaning and the accuracy of record and household linkage. We first proposal an approach for automatic cleaning and linking using domain knowledge. The core idea is to use household information in both the cleaning and linking steps, so that records that contain errors and variations can be cleaned and standardised and the number of wrongly linked records can be reduced. Second, we introduce a group linking method into household linkage, which enables tracking of the majority of members in a household over a period of time. The proposed method is based on the outcome of the record linkage step using either a similarity based method or a machine learning approach. A group linking method is then applied, aiming to reduce ambiguity of multiple household linkages. Third, we introduce a graph-based method to link households, which takes the structural relationship between household members into consideration. Based on the results of linking individual records, our method builds a graph for each household, so that the matches of household's in different census are determined by both attribute relationship and record similarities. This allows household similarities be more accurately calculated. Finally, we describe an instance classification method based on a multiple instance learning method. This allows an integrated solution to link both households and individual records at the same time. Our method treats group links as bags and individual record links as instances. We extend multiple instance learning from bag to instance classification in order to allow the reconstruction of bags from candidate instances. The classified bag and instance samples lead to a significant reduction in multiple group links, thereby improving the overall quality of linked data

    Systematically corrupting data to evaluate record linkage techniques

    Get PDF
    Record linkage is widely used to integrate data from different sources to extract knowledge for various research purposes. The tasks of record linkage are usually achieved using automated record linkage systems and algorithms. Such systems and algorithms automate the task of record linkage in order to decide whether the pairs of records refer to the same entity or not. The accuracy of the outcomes of these automation technologies needs to be evaluated. One common approach to evaluating record linkage accuracy is using synthetically generated data to obtain the correct status of record relations, i.e. ground truth. Outputs of the record linkage systems are evaluated by comparing the results to the ground truth. However, synthetic data generators are generally designed to generate data without consideration of data quality issues, i.e. errors and variations. This results in clean synthetic data that does not match the real-world data, which usually contains data quality issues. This is considered a limitation that makes evaluation using such data unrealistic. In this thesis, we present a framework to simulate real-world data errors and variations in testing data. We achieve this through three main objectives. First, we develop a classification for data errors and variations. Then, given our classification, we develop an application that simulates and injects realistic data quality issues based on a corruption profile; we call this application crptr. Finally, we utilise the data corruption application in a record linkage evaluation framework. The framework utilises different tools, such as synthetic data generators, as a source of data, record linkage systems and algorithms, and crptr to simulate real-world data quality characteristics. Using crptr and the evaluation framework, we conduct two evaluation experiments, successfully estimating the accuracy of the linkage outcomes of the linkage technologies used. Experiment outcomes show that the accuracy of the automated linkage technologies evaluated decreases as the level of data corruption increases. The evaluation of commonly-used string similarity measures, i.e. linkage algorithms, shows that the Jaro-Winkler algorithm delivers the highest accuracy based on our experimental scenario. This method of evaluation enables researchers to assess their record linkage strategy based on the characteristics and nature of the real data
    corecore