Search CORE

158,711 research outputs found

On the performance impact of using JSON, beyond impedance mismatch

Author: Abelló Gamazo Alberto
Hewasinghage Moditha Lakshan Dharmasir
Nadal Francesch Sergi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

NOSQL database management systems adopt semi-structured data models, such as JSON, to easily accommodate schema evolution and overcome the overhead generated from transforming internal structures to tabular data (i.e., impedance mismatch). There exist multiple, and equivalent, ways to physically represent semi-structured data, but there is a lack of evidence about the potential impact on space and query performance. In this paper, we embark on the task of quantifying that, precisely for document stores. We empirically compare multiple ways of representing semi-structured data, which allows us to derive a set of guidelines for efficient physical database design considering both JSON and relational options in the same palette.Partly funded by the European Commission through the programme “EM IT4BI-DC”.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC

Data Matching and Deduplication Over Big Data Using Hadoop Framework

Author: Albanese Pablo Adrián
Ale Juan M.
Publication venue
Publication date: 16/11/2016
Field of study

Entity Resolution is the process of matching records from more than one database that refer to the same entity. In case of a single database the process is called deduplication. This article proposes a method to solve entity resolution and deduplication problem using MapReduce over Hadoop framework. The proposed method includes data preprocessing, comparison and classification tasks indexing by standard blocking method. Our method can operate with one, two or more datasets and works with semi structured or structured data.XIII Workshop Bases de datos y Minería de Datos (WBDMD).Red de Universidades con Carreras en Informática (RedUNCI