16 research outputs found

    End-to-End Entity Resolution for Big Data: A Survey

    Get PDF
    One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

    Self-configured Entity Resolution with pyJedAI

    Get PDF
    Entity Resolution has been an active research topic for the last three decades, with numerous algorithms proposed in the literature. However, putting them into practice is often a complex task that requires implementing, combining and configuring complementary individual algorithms into comprehensive end-to-end workflows. To facilitate this process, we are developing pyJedAI, a novel system that provides a unifying framework for any type of main works in the field (i.e., both unsupervised and learning-based ones). Our vision is to facilitate both novice and expert users to use and combine these algorithms through a series of principled approaches for automatically configuring and benchmarking end-to-end pipelines

    JedAI-spatial: a system for 3-dimensional Geospatial Interlinking

    Get PDF
    Τα γεωχωρικά δεδομένα αποτελούν ένα σημαντικό κομμάτι των δεδομένων του Σημασιολογικού Ιστού (Semantic Web), αλλά μέχρι στιγμής οι πηγές του δεν περιέχουν αρκετούς συνδέσμους στο Linked Open Data cloud. Η Διασύνδεση Γεωχωρικών Δεδομένων (Geospatial Interlinking) έχει ως στόχο να καλύψει αυτό το κενό συνδέοντας τις γεωμετρίες με καθιερωμένες τοπολογικές σχέσεις, όπως αυτές του Dimensionally Extended 9-Intersection Model. Έχουν προταθεί διάφοροι αλγόριθμοι στη βιβλιογραφία για την επίλυση αυτού του προβληματος. Στο πλαίσιο αυτής της διπλωματικής εργασίας, αναπτύσσουμε το JedAI-spatial, ένα καινοτόμο σύστημα ανοιχτού κώδικα, το οποίο οργανώνει τους κύριους υπάρχοντες αλγορίθμους σύμφωνα με τρεις διαστάσεις: i. Το Space Tiling διαφοροποιεί τους αλγόριθμους διασύνδεσης σε αυτούς που βασίζονται σε πλέγμα (grid-based), δέντρα (tree-based) ή κατατμήσεις (partition-based), σύμφωνα με την μέθοδο τους για τη μείωση του χώρου αναζήτησης και συνεπώς της τετραγωνικής πολυπλοκότητας αυτού του προβλήματος. Η πρώτη κατηγορία περιέχει τεχνικές Σημασιολογικού Ιστού, η δεύτερη καθιερωμένες τεχνικές για χωρική διασύνδεση (spatial join) στην κύρια μνήμη από την κοινότητα των βάσεων δεδομένων , ενώ η τρίτη περιλαμβάνει παραλλαγές του βασικού αλγορίθμου plane-sweep της υπολογιστικής γεωμετρίας. ii. Το Budget awareness διαχωρίζει τους αλγόριθμους διασύνδεσης σε budget-agnostic και budget-aware. Οι μέν απαρτίζονται από batch τεχνικές, που παράγουν αποτελέσματα μόνο μετά την επεξεργασία όλων των δεδομένων, ενώ οι δε λειτουργούν με έναν προοδευτικό τρόπο που παράγει αποτελέσματα σταδιακά - ο στόχος τους είναι να επικυρώσουν τις τοπολογικά συσχετιζόμενες γεωμετρίες πριν από τις μη-συσχετιζόμενες. iii. Η Μέθοδος Εκτέλεσης διαφοροποιεί τους αλγορίθμους σε σειριακούς, οι οποίοι εκτελούνται χρησιμοποιώντας ένα πυρήνα (CPU core), και παράλληλους (parallel), οι οποίοι αξιοποιούν την κατανεμημένη εκτέλεση πάνω στο Apache Spark. Στα πλαίσια της διπλωματικής πραγματοποιήθηκαν εκτενή πειράματα με τις μεθόδους και των 3 διαστάσεων, με τα πειραματικά αποτελέσματα να παρέχουν μία ενδιαφέρουσα εικόνα όσον αφορά τη σχετική απόδοση των αλγορίθμων.Geospatial data constitutes a considerable part of Semantic Web data, but so far, its sources lack enough links in the Linked Open Data cloud. Geospatial Interlinking aims to cover this gap by associating geometries with established topological relations, such as those of the Dimensionally Extended 9-Intersection Model. Various algorithms have already been proposed in the literature for this task. In the context of this master thesis, we develop JedAI-spatial, a novel, open-source system that organizes the main existing algorithms according to three dimensions: i. Space Tiling distinguishes interlinking algorithms into grid-, tree- and partition-based, according to their approach for reducing the search space and, thus, the computational cost of this inherently quadratic task. The former category includes Semantic Web techniques that define a static or dynamic EquiGrid and verify pairs of geometries whose minimum bounding rectangles intersect at least one common cell. Tree-based algorithms encompass established main-memory spatial join techniques from the database community, while the partition-based category includes variations of the cornerstone of computational geometry, i.e., the plane sweep algorithm. ii. Budget-awareness distinguishes interlinking algorithms into budget-agnostic and budget-aware ones. The former constitute batch techniques that produce results only after completing their processing over the entire input data, while the latter operate in a pay-as-you-go manner that produces results progressively - their goal is to verify related geometries before the non-related ones. iii. Execution mode distinguishes interlinking algorithms into serial ones, which are carried out using a single CPU-core, and parallel ones, which leverage massive parallelization on top of Apache Spark. Extensive experimental evaluations were performed along these 3 dimensions, with the experimental outcomes providing interesting insights about the relative performance of the considered algorithms

    Entity Resolution On-Demand

    Get PDF
    Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (Extract-Load-Transform) pipelines. We propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports top-k and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of BrewER on four real-world datasets

    Integrazione di dati on-demand

    Get PDF
    Sempre più spesso aziende e organizzazioni basano le proprie decisioni sui dati di cui dispongono. Garantire la qualità di tali dati è fondamentale per poter effettuare analisi accurate e affidabili. L'integrazione dei dati consiste nel combinare dati acquisiti da molteplici sorgenti eterogenee per fornire all'utente finale una vista unitaria e coerente su tali dati. Si tratta perciò di un processo fondamentale per incrementare il valore dei dati disponibili. In passato, operando su numeri limitati di sorgenti, il paradigma di riferimento, noto come ETL, richiedeva di estrarre i dati grezzi, pulirli e immagazzinarli in un data warehouse per poterli poi analizzare. Al giorno d'oggi, operando su milioni di sorgenti, è sempre più diffuso il paradigma noto invece come ELT, per il quale i dati grezzi vengono raccolti in grandi quantità e immagazzinati così come sono, ad esempio in un data lake. Gli utenti possono poi pulire le porzioni di dati utili per le loro applicazioni. È pertanto necessario studiare soluzioni innovative per l'integrazione dei dati, maggiormente adatte alle nuove sfide che tale modello comporta. Uno dei processi fondamentali per l'integrazione dei dati è la riconciliazione di entità, che consiste nell'individuare i profili che descrivono la stessa entità reale (duplicati) per consolidarli in un unico profilo coerente. Tradizionalmente, questo processo viene effettuato sull'intero dataset prima di poter operare su di esso, risultando perciò spesso molto costoso. In molti casi, solo una porzione delle entità pulite si rivela utile per l'applicazione dell'utente finale. Ad esempio, operando su dati raccolti dal Web, è fondamentale poter filtrare le entità d'interesse senza dover pulire l'intera mole di dati, in continua crescita. Allo stesso modo, quando si effettuano interrogazioni su un data lake, si vuole pulire su richiesta solo la porzione di interesse, ottenendo i relativi risultati nel minor tempo possibile. Per rispondere a tali esigenze presentiamo BrewER, un framework per eseguire interrogazioni SQL su dati sporchi emettendo progressivamente i risultati come se fossero stati ottenuti sui dati puliti. BrewER focalizza il processo di pulizia su un'entità alla volta, in base a una priorità definita dall'utente nella clausola ORDER BY. Per molte applicazioni, come l'esplorazione dei dati, BrewER consente di risparmiare una grande quantità di tempo e risorse. I duplicati non esistono però solo a livello di singoli profili, ma anche a livello di dataset. È infatti comune ad esempio che un data scientist per le proprie analisi effettui trasformazioni su un dataset presente nel data lake aziendale, immagazzinando poi anche la nuova versione ottenuta all'interno del data lake stesso. Situazioni simili si verificano nel Web, ad esempio su Wikipedia, dove le tabelle vengono spesso duplicate e le copie ottenute hanno uno sviluppo indipendente, con la possibile insorgenza di inconsistenze. Individuare automaticamente queste tabelle duplicate consente di renderle coerenti con operazione di pulizia dei dati o propagazione delle modifiche, oppure di rimuovere le copie ridondanti per liberare spazio di archiviazione o risparmiare futuro lavoro agli editori. La ricerca di tabelle duplicate è stata perlopiù ignorata dalla letteratura esistente. Per colmare questa mancanza presentiamo Sloth, un framework che, date due tabelle, consente di determinarne la più grande sottotabella in comune, consentendo di quantificarne la similarità e di rilevare le possibili inconsistenze. BrewER e Sloth rappresentano soluzioni innovative per l'integrazione dei dati nello scenario ELT, utilizzando le risorse a disposizione su richiesta e indirizzando il processo di integrazione dei dati verso un approccio orientato alle applicazioni.Companies and organizations depend heavily on their data to make informed business decisions. Therefore, guaranteeing high data quality is critical to ensure the reliability of data analysis. Data integration, which aims to combine data acquired from several heterogeneous sources to provide users with a unified consistent view, plays a fundamental role to enhance the value of the data at hand. In the past, when data integration involved a limited number of sources, ETL (extract, transform, load) established as the most popular paradigm: once collected, raw data is cleaned, then stored in a data warehouse to perform analysis on it. Nowadays, big data integration needs to deal with millions of sources; thus, the paradigm is more and more moving towards ELT (extract, load, transform). A huge amount of raw data is collected and directly stored (e.g., in a data lake), then different users can transform portions of it according to the task at hand. Hence, novel approaches to data integration need to be explored to address the challenges raised by this paradigm. One of the fundamental building blocks for data integration is entity resolution (ER), which aims at detecting profiles that describe the same real-world entity, to consolidate them into a single consistent representation. ER is typically employed as an expensive offline cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire continuously growing data. Similarly, when querying data lakes, we want to transform data on-demand and return results in a timely manner. Hence, we propose BrewER, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data. BrewER tries to focus the cleaning effort on one entity at a time, according to the priority defined by the user through the ORDER BY clause. For a wide range of applications (e.g., data exploration), a significant amount of resources can therefore be saved. Further, duplicates not only exist at profile level, as in the case for ER, but also at dataset level. In the ELT scenario, it is common for data scientists to retrieve datasets from the enterprise’s data lake, perform transformations for their analysis, then store back the new datasets into the data lake. Similarly, in Web contexts such as Wikipedia, a table can be duplicated at a given time, with the different copies having independent development, possibly leading to the insurgence of inconsistencies. Automatically detecting duplicate tables would allow to guarantee their consistency through data cleaning or change propagation, but also to eliminate redundancy to free up storage space or to save additional work for the editors. While dataset discovery research developed efficient tools to retrieve unionable or joinable tables, the problem of detecting duplicate tables has been mostly overlooked in the existing literature. To fill this gap, we therefore present Sloth, a framework to efficiently determine the largest overlap (i.e., the largest common subtable) between two tables. The detection of the largest overlap allows to quantify the similarity between the two tables and spot their inconsistencies. BrewER and Sloth represent novel solutions to perform big data integration in the ELT scenario, fostering on-demand use of available resources and shifting this fundamental task towards a task-driven paradigm

    zbMATH Open: API Solutions and Research Challenges

    Get PDF
    We present zbMATH Open, the most comprehensive collection of reviews and bibliographic metadata of scholarly literature in mathematics. Besides our website https://zbMATH.org which is openly accessible since the beginning of this year, we provide API endpoints to offer our data. The API improves interoperability with others, i.e., digital libraries, and allows using our data for research purposes. In this article, we (1) illustrate the current and future overview of the services offered by zbMATH; (2) present the initial version of the zbMATH links API; (3) analyze potentials and limitations of the links API based on the example of the NIST Digital Library of Mathematical Functions; (4) and finally, present the zbMATH Open dataset as a research resource and discuss connected open research problems

    Learning, deducing and linking entities

    Get PDF
    Improving the quality of data is a critical issue in data management and machine learning, and finding the most representative and concise way to achieve this is a key challenge. Learning how to represent entities accurately is essential for various tasks in data science, such as generating better recommendations and more accurate question answering. Thus, the amount and quality of information available on an entity can greatly impact the quality of results of downstream tasks. This thesis focuses on two specific areas to improve data quality: (i) learning and deducing entities for data currency (i.e., how up-to-date information is), and (ii) linking entities across different data sources. The first technical contribution is GATE (Get the lATEst), a framework that combines deep learning and rule-based methods to find up-to-date information of an entity. GATE learns and deduces temporal orders on attribute values in a set of tuples that pertain to the same entity. It is based on creator-critic framework and the creator trains a neural ranking model to learn temporal orders and rank attribute values based on correlations among the attributes. The critic then validates the temporal orders learned and deduces more ranked pairs by chasing the data with currency constraints; it also provides augmented training data as feedback for the creator to improve the ranking in the next round. The process proceeds until the temporal order obtained becomes stable. The second technical contribution is HER (Heterogeneous Entity Resolution), a framework that consists of a set of methods to link entities across relations and graphs. We propose a new notion, parametric simulation, to link entities across a relational database D and a graph G. Taking functions and thresholds for measuring vertex closeness, path associations and important properties as parameters, parametric simulation identifies tuplest in D and vertices v in G that refer to the same real-world entity, based on topological and semantic matching. We develop machine learning methods to learn the parameter functions and thresholds. Rather than solely concentrating on rule-based methods and machine learning algorithms separately to enhance data quality, we focused on combining both approaches to address the challenges of data currency and entity linking. We combined rule-based methods with state-of-the-art machine learning methods to represent entities, then used representation of these entities for further tasks. These enhanced models, combination of machine learning and logic rules helped us to represent entities in a better way (i) to find the most up-to-date attribute values and (ii) to link them across relations and graphs
    corecore