7,519 research outputs found

    A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

    Get PDF
    Abstract-The proliferation of deep Web offers users a great opportunity to search high-quality information from Web. As a necessary step in deep Web data integration, the goal of duplicate entity identification is to discover the duplicate records from the integrated Web databases for further applications(e.g. price-comparison services). However, most of existing works address this issue only between two data sources, which are not practical to deep Web data integration systems. That is, one duplicate entity matcher trained over two specific Web databases cannot be applied to other Web databases. In addition, the cost of preparing the training set for n Web databases is ۱ ‫ܖ‬ times higher than that for two Web databases. In this paper, we propose a holistic solution to address the new challenges posed by deep Web, whose goal is to build one duplicate entity matcher over multiple Web databases. The extensive experiments on two domains show that the proposed solution is highly effective for deep Web data integration

    A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

    Full text link

    A Qualitative Literature Review on Linkage Techniques for Data Integration

    Get PDF
    The data linkage techniques ”entity linking” and ”record linkage” get rising attention as they enable the integration of multiple data sources for data, web, and text mining approaches. This has resulted in the development of numerous algorithms and systems for these techniques in recent years. The goal of this publication is to provide an overview of these numerous data linkage techniques. Most papers deal with record linkage and structured data. Processing unstructured data through entity linking is rising attention with the trend Big Data. Currently, deep learning algorithms are being explored for both linkage techniques. Most publications focus their research on a single process step or the entire process of ”entity linking” or ”record linkage”. However, the papers have the limitation that the used approaches and techniques have always been optimized for only a few data sources

    Automating data preparation with statistical analysis

    Get PDF
    Data preparation is the process of transforming raw data into a clean and consumable format. It is widely known as the bottleneck to extract value and insights from data, due to the number of possible tasks in the pipeline and factors that can largely affect the results, such as human expertise, application scenarios, and solution methodology. Researchers and practitioners devised a great variety of techniques and tools over the decades, while many of them still place a significant burden on human’s side to configure the suitable input rules and parameters. In this thesis, with the goal of reducing human manual effort, we explore using the power of statistical analysis techniques to automate three subtasks in the data preparation pipeline: data enrichment, error detection, and entity matching. Statistical analysis is the process of discovering underlying patterns and trends from data and deducing properties of an underlying distribution of probability from a sample, for example, testing hypotheses and deriving estimates. We first discuss CrawlEnrich, which automatically figures out the queries for data enrichment via web API data, by estimating the potential benefit of issuing a certain query. Then we study how to derive reusable error detection configuration rules from a web table corpus, so that end-users get results with no efforts. Finally, we introduce AutoML-EM, aiming to automate the entity matching model development process. Entity matching is to find the identical entities in real-world. Our work provides powerful angles to automate the process of various data preparation steps, and we conclude this thesis by discussing future directions

    Deep Learning for Learning Representation and Its Application to Natural Language Processing

    Get PDF
    As the web evolves even faster than expected, the exponential growth of data becomes overwhelming. Textual data is being generated at an ever-increasing pace via emails, documents on the web, tweets, online user reviews, blogs, and so on. As the amount of unstructured text data grows, so does the need for intelligently processing and understanding it. The focus of this dissertation is on developing learning models that automatically induce representations of human language to solve higher level language tasks. In contrast to most conventional learning techniques, which employ certain shallow-structured learning architectures, deep learning is a newly developed machine learning technique which uses supervised and/or unsupervised strategies to automatically learn hierarchical representations in deep architectures and has been employed in varied tasks such as classification or regression. Deep learning was inspired by biological observations on human brain mechanisms for processing natural signals and has attracted the tremendous attention of both academia and industry in recent years due to its state-of-the-art performance in many research domains such as computer vision, speech recognition, and natural language processing. This dissertation focuses on how to represent the unstructured text data and how to model it with deep learning models in different natural language processing viii applications such as sequence tagging, sentiment analysis, semantic similarity and etc. Specifically, my dissertation addresses the following research topics: In Chapter 3, we examine one of the fundamental problems in NLP, text classification, by leveraging contextual information [MLX18a]; In Chapter 4, we propose a unified framework for generating an informative map from review corpus [MLX18b]; Chapter 5 discusses the tagging address queries in map search [Mok18]. This research was performed in collaboration with Microsoft; and In Chapter 6, we discuss an ongoing research work in the neural language sentence matching problem. We are working on extending this work to a recommendation system

    Clustering Approaches for Multi-source Entity Resolution

    Get PDF
    Entity Resolution (ER) or deduplication aims at identifying entities, such as specific customer or product descriptions, in one or several data sources that refer to the same real-world entity. ER is of key importance for improving data quality and has a crucial role in data integration and querying. The previous generation of ER approaches focus on integrating records from two relational databases or performing deduplication within a single database. Nevertheless, in the era of Big Data the number of available data sources is increasing rapidly. Therefore, large-scale data mining or querying systems need to integrate data obtained from numerous sources. For example, in online digital libraries or E-Shops, publications or products are incorporated from a large number of archives or suppliers across the world or within a specified region or country to provide a unified view for the user. This process requires data consolidation from numerous heterogeneous data sources, which are mostly evolving. By raising the number of sources, data heterogeneity and velocity as well as the variance in data quality is increased. Therefore, multi-source ER, i.e. finding matching entities in an arbitrary number of sources, is a challenging task. Previous efforts for matching and clustering entities between multiple sources (> 2) mostly treated all sources as a single source. This approach excludes utilizing metadata or provenance information for enhancing the integration quality and leads up to poor results due to ignorance of the discrepancy between quality of sources. The conventional ER pipeline consists of blocking, pair-wise matching of entities, and classification. In order to meet the new needs and requirements, holistic clustering approaches that are capable of scaling to many data sources are needed. The holistic clustering-based ER should further overcome the restriction of pairwise linking of entities by making the process capable of grouping entities from multiple sources into clusters. The clustering step aims at removing false links while adding missing true links across sources. Additionally, incremental clustering and repairing approaches need to be developed to cope with the ever-increasing number of sources and new incoming entities. To this end, we developed novel clustering and repairing schemes for multi-source entity resolution. The approaches are capable of grouping entities from multiple clean (duplicate-free) sources, as well as handling data from an arbitrary combination of clean and dirty sources. The multi-source clustering schemes exclusively developed for multi-source ER can obtain superior results compared to general purpose clustering algorithms. Additionally, we developed incremental clustering and repairing methods in order to handle the evolving sources. The proposed incremental approaches are capable of incorporating new sources as well as new entities from existing sources. The more sophisticated approach is able to repair previously determined clusters, and consequently yields improved quality and a reduced dependency on the insert order of the new entities. To ensure scalability, the parallel variation of all approaches are implemented on top of the Apache Flink framework which is a distributed processing engine. The proposed methods have been integrated in a new end-to-end ER tool named FAMER (FAst Multi-source Entity Resolution system). The FAMER framework is comprised of Linking and Clustering components encompassing both batch and incremental ER functionalities. The output of Linking part is recorded as a similarity graph where each vertex represents an entity and each edge maintains the similarity relationship between two entities. Such a similarity graph is the input of the Clustering component. The comprehensive comparative evaluations overall show that the proposed clustering and repairing approaches for both batch and incremental ER achieve high quality while maintaining the scalability

    SILE: A Method for the Efficient Management of Smart Genomic Information

    Full text link
    [ES] A lo largo de las últimas dos décadas, los datos generados por las tecnologías de secuenciación de nueva generación han revolucionado nuestro entendimiento de la biología humana. Es más, nos han permitido desarrollar y mejorar nuestro conocimiento sobre cómo los cambios (variaciones) en el ADN pueden estar relacionados con el riesgo de sufrir determinadas enfermedades. Actualmente, hay una gran cantidad de datos genómicos disponibles de forma pública, que son consultados con frecuencia por la comunidad científica para extraer conclusiones significativas sobre las asociaciones entre los genes de riesgo y los mecanismos que producen las enfermedades. Sin embargo, el manejo de esta cantidad de datos que crece de forma exponencial se ha convertido en un reto. Los investigadores se ven obligados a sumergirse en un lago de datos muy complejos que están dispersos en más de mil repositorios heterogéneos, representados en múltiples formatos y con diferentes niveles de calidad. Además, cuando se trata de resolver una tarea en concreto sólo una pequeña parte de la gran cantidad de datos disponibles es realmente significativa. Estos son los que nosotros denominamos datos "inteligentes". El principal objetivo de esta tesis es proponer un enfoque sistemático para el manejo eficiente de datos genómicos inteligentes mediante el uso de técnicas de modelado conceptual y evaluación de calidad de los datos. Este enfoque está dirigido a poblar un sistema de información con datos que sean lo suficientemente accesibles, informativos y útiles para la extracción de conocimiento de valor.[CA] Al llarg de les últimes dues dècades, les dades generades per les tecnologies de secuenciació de nova generació han revolucionat el nostre coneixement sobre la biologia humana. És mes, ens han permès desenvolupar i millorar el nostre coneixement sobre com els canvis (variacions) en l'ADN poden estar relacionats amb el risc de patir determinades malalties. Actualment, hi ha una gran quantitat de dades genòmiques disponibles de forma pública i que són consultats amb freqüència per la comunitat científica per a extraure conclusions significatives sobre les associacions entre gens de risc i els mecanismes que produeixen les malalties. No obstant això, el maneig d'aquesta quantitat de dades que creix de forma exponencial s'ha convertit en un repte i els investigadors es veuen obligats a submergir-se en un llac de dades molt complexes que estan dispersos en mes de mil repositoris heterogenis, representats en múltiples formats i amb diferents nivells de qualitat. A m\és, quan es tracta de resoldre una tasca en concret només una petita part de la gran quantitat de dades disponibles és realment significativa. Aquests són els que nosaltres anomenem dades "intel·ligents". El principal objectiu d'aquesta tesi és proposar un enfocament sistemàtic per al maneig eficient de dades genòmiques intel·ligents mitjançant l'ús de tècniques de modelatge conceptual i avaluació de la qualitat de les dades. Aquest enfocament està dirigit a poblar un sistema d'informació amb dades que siguen accessibles, informatius i útils per a l'extracció de coneixement de valor.[EN] In the last two decades, the data generated by the Next Generation Sequencing Technologies have revolutionized our understanding about the human biology. Furthermore, they have allowed us to develop and improve our knowledge about how changes (variants) in the DNA can be related to the risk of developing certain diseases. Currently, a large amount of genomic data is publicly available and frequently used by the research community, in order to extract meaningful and reliable associations among risk genes and the mechanisms of disease. However, the management of this exponential growth of data has become a challenge and the researchers are forced to delve into a lake of complex data spread in over thousand heterogeneous repositories, represented in multiple formats and with different levels of quality. Nevertheless, when these data are used to solve a concrete problem only a small part of them is really significant. This is what we call "smart" data. The main goal of this thesis is to provide a systematic approach to efficiently manage smart genomic data, by using conceptual modeling techniques and the principles of data quality assessment. The aim of this approach is to populate an Information System with data that are accessible, informative and actionable enough to extract valuable knowledge.This thesis was supported by the Research and Development Aid Program (PAID-01-16) under the FPI grant 2137.León Palacio, A. (2019). SILE: A Method for the Efficient Management of Smart Genomic Information [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/131698TESISPremios Extraordinarios de tesis doctorale

    Aplicación de la Inteligencia Competitiva y el Benchmarking de nuevas teorías para el desarrollo de un Plan Estratégico y Sostenible para la Industria Naval

    Get PDF
    Since their beginning, companies establish procedures to observe their competitors. Methods for obtaining this kind of information have evolved with the internet era; a plethora of tools is nowadays available for this job. As a consequence, a new problem has emerged: documentary noise, keeping companies from being able to process and benefit from the huge amount of information gathered. Strategic planning mainly relies on obtaining environmental knowledge, so companies need help on dealing with this documentary noise; technological surveillance and benchmarking are preferred methodologies to achieve this objective, coping with data produced by automatic internet tools like search engines and others. Qualified results of better nature are produced by bringing new theories on information gathering and processing intoboth tools. This article exposes empirical results on the application of a demonstrative technological surveillance system based on different R&D management structures, relying on benchmarking indicators for the naval and aeronautics industries.Desde su inicio, las empresas establecen procedimientos para observar a sus competidores. Los métodos para obtener este tipo de información han evolucionado con la era del internet; una gran cantidad de herramientas está disponible en la actualidad para esta tarea. En consecuencia, ha surgido un nuevo problema: ruido documental, que evita que las empresas procesen y se beneficien de la gran cantidad de información recolectada. La planeación estratégica principalmente se apoya en el conocimiento ambiental obtenido, así que las empresas necesitan ayuda para tratar con este ruido documental; la vigilancia tecnológica y el benchmarking son metodologías preferidas para lograr este objetivo, y hacerfrente a los datos producidos por herramientas automáticas del internet como motores de búsqueda y otras. Este artículo expone resultados empíricos acerca de la aplicación de un sistema demostrativo de vigilancia tecnológica basado en diferentes estructuras de gestión de I&D, confiando en indicadores de benchmarking para las industrias navales y aeronáuticas
    corecore