Search CORE

2,083 research outputs found

MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities

Author: Christophides Vassilis
Efthymiou Vasilis
Papadakis George
Stefanidis Kostas
Publication venue
Publication date: 15/05/2019
Field of study

Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001

arXiv.org e-Print Archive

TamPub Julkaisuarkisto - TamPub Institutional Repository

Trepo - Institutional Repository of Tampere University

End-to-End Entity Resolution for Big Data: A Survey

Author: Christophides Vassilis
Efthymiou Vasilis
Palpanas Themis
Papadakis George
Stefanidis Kostas
Publication venue
Publication date: 01/02/1988
Field of study

One of the most important tasks for improving data quality and the reliability of data analytics results is Entity Resolution (ER). ER aims to identify different descriptions that refer to the same real-world entity, and remains a challenging problem. While previous works have studied specific aspects of ER (and mostly in traditional settings), in this survey, we provide for the first time an end-to-end view of modern ER workflows, and of the novel aspects of entity indexing and matching methods in order to cope with more than one of the Big Data characteristics simultaneously. We present the basic concepts, processing steps and execution strategies that have been proposed by different communities, i.e., database, semantic Web and machine learning, in order to cope with the loose structuredness, extreme diversity, high speed and large scale of entity descriptions used by real-world applications. Finally, we provide a synthetic discussion of the existing approaches, and conclude with a detailed presentation of open research directions

arXiv.org e-Print Archive

University of Richmond

BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios

Author: Bergamaschi Sonia
Gagliardelli Luca
Simonini Giovanni
Zhu Song
Publication venue: 'IOS Press'
Publication date: 01/01/2018
Field of study

Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Generalized Supervised Meta-blocking

Author: George Papadakis
Giovanni Simonini
Luca Gagliardelli
Sonia Bergamaschi
Themis Palpanas
Publication venue: 'VLDB Endowment'
Publication date: 01/01/2022
Field of study

Entity Resolution is a core data integration task that relies on Blocking to scale to large datasets. Schema-agnostic blocking achieves very high recall, requires no domain knowledge and applies to data of any structuredness and schema heterogeneity. This comes at the cost of many irrelevant candidate pairs (i.e., comparisons), which can be significantly reduced by Meta-blocking techniques that leverage the entity co-occurrence patterns inside blocks: first, pairs of candidate entities are weighted in proportion to their matching likelihood, and then, pruning discards the pairs with the lowest scores. Supervised Meta-blocking goes beyond this approach by combining multiple scores per comparison into a feature vector that is fed to a binary classifier. By using probabilistic classifiers, Generalized Supervised Meta-blocking associates every pair of candidates with a score that can be used by any pruning algorithm. For higher effectiveness, new weighting schemes are examined as features. Through extensive experiments, we identify the best pruning algorithms, their optimal sets of features, as well as the minimum possible size of the training set

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

OpenBIM-Tango integrated virtual showroom for offsite manufactured production of self-build housing

Author: Abanda
Afsari
Arayici
Assaf
Banihashemi
Blind
Cao
Cha
Chi
Choi
Choi
Craveiro
Damen
Damen
Dong
Du
Farhad Chamo
Farzad Pour Rahimian
Gasser
Haile
Jaffar
Khalili
Koch
Lee
Lester
Li
Lilia Potseluyko Amobi
Liu
Liu
Mitkus
Monizza
Norouzi
Odeh
Oxman
Park
Pauwels
Petrova
Rasmussen
Ren
Sarhadi
Sebastian
Smart BIM Solutions
Stephen Oliver
Veselina Chavdarova
Walasek
Wang
Wang
Wu
Yuan
Zulch
Publication venue: 'Elsevier BV'
Publication date: 01/06/2019
Field of study

As a result of progressive use of BIM in the AEC sector, the amount of diverse project information is increasing rapidly, thus necessitating interoperability of tools, compatibility of data, effective collaboration and sophisticated data management. Media-rich VR and AR environments have been proven to help users better understand design solutions, however, they have not been quite advanced in supporting interoperability and collaboration. Relying on capabilities of openBIM and IFC schema, this study posits that this shortcoming of VR and AR environment could be addressed by use of BIM server concept allowing for concurrent multiuser and low-latency communication between applications. Successful implementation of this concept can ultimately mitigate the need for advanced technical skills for participation in design processes and facilitate the generation of more useful design solutions by early involvement of stakeholders and end-users in decision making. This paper exemplifies a method for integration of BIM data into immersive VR and AR environments, in order to streamline the design process and provide a pared-down agnostic openBIM system with low latency and synchronised concurrent user accessibility that gives the “right information to the right people at the right time”. These concepts have been further demonstrated through development of a prototype for openBIM-Tango integrated virtual showroom for offsite manufactured production of self-build housing. The prototype directly includes BIM models and data from IFC format and interactively presents them to users on both VR immersive and AR environments, including Google Tango enabled devices. This paper contributes by offering innovative and practical solutions for integration of openBIM and VR/AR interfaces, which can address interoperability issues of the AEC industry

Northumbria Research Link

Crossref

University of Strathclyde Institutional Repository

Teeside University's Research Repository

Format-independent media resource adaptation and delivery

Author: Van Deursen Davy
Publication venue: Ghent University. Faculty of Engineering
Publication date: 01/01/2009
Field of study

Ghent University Academic Bibliography

Incremental Entity Blocking over Heterogeneous Streaming Data

Author: Araújo Tiago Brasileiro
da Nóbrega Thiago Pereira
Nummenmaa Jyrki
Pires Carlos Eduardo Santos
Stefanidis Kostas
Publication venue: 'MDPI AG'
Publication date: 01/12/2022
Field of study

Web systems have become a valuable source of semi-structured and streaming data. In this sense, Entity Resolution (ER) has become a key solution for integrating multiple data sources or identifying similarities between data items, namely entities. To avoid the quadratic costs of the ER task and improve efficiency, blocking techniques are usually applied. Beyond the traditional challenges faced by ER and, consequently, by the blocking techniques, there are also challenges related to streaming data, incremental processing, and noisy data. To address them, we propose a schema-agnostic blocking technique capable of handling noisy and streaming data incrementally through a distributed computational infrastructure. To the best of our knowledge, there is a lack of blocking techniques that address these challenges simultaneously. This work proposes two strategies (attribute selection and top-n neighborhood entities) to minimize resource consumption and improve blocking efficiency. Moreover, this work presents a noise-tolerant algorithm, which minimizes the impact of noisy data (e.g., typos and misspellings) on blocking effectiveness. In our experimental evaluation, we use real-world pairs of data sources, including a case study that involves data from Twitter and Google News. The proposed technique achieves better results regarding effectiveness and efficiency compared to the state-of-the-art technique (metablocking). More precisely, the application of the two strategies over the proposed technique alone improves efficiency by 56%, on average.publishedVersionPeer reviewe

Directory of Open Access Journals

Trepo - Institutional Repository of Tampere University