Search CORE

5 research outputs found

Data context informed data wrangling

Author: Abel Edward
Bogatu Alex
Civili Cristina
Fernandes Alvaro A. A.
Keane John
Koehler Martin
Konstantinou Nikolaos
Libkin Leonid
Paton Norman W.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2017
Field of study

The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

The University of Manchester - Institutional Repository

Feedback Driven Improvement of Data Preparation Pipelines

Author: Konstantinou Nikolaos
Paton Norman
Publication venue
Publication date: 01/01/2019
Field of study

The University of Manchester - Institutional Repository

Intership Report on data merging at the bank of Portugal Internship Experience at the Bank of Portugal: A Comprehensive Dive into Full Stack Development - Leveraging Modern Technology to Innovate Financial Infrastructure and Enhance User Experience

Author: Oliveira Paulo Ricardo Lopes de
Publication venue
Publication date: 25/10/2023
Field of study

Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThis report details my full-stack development internship experiences at the Bank of Portugal, with a particular emphasis on the creation of a website intended to increase operational effectiveness in the DAS Department. My main contributions met a clear need, which was the absence of a reliable platform that could manage and combine data from many sources. I was actively involved in creating functionality for the Django applications Integrator and BAII using Django, a high-level Python web framework. Several problems were addressed by the distinctive features I planned and programmed, including daily data extraction from several SQL databases, entity error detection, data merging, and user-friendly interfaces for data manipulation. A feature that enables the attribution of litigation to certain entities was also developed. The outcomes of the developed features have proven to be useful, giving the Institutional Intervention Area, the Sanctioning Action Area, the Illicit Financial Activity Investigation Area, and the Money Laundering Preventive Supervision Area for Capital and Financing of Terrorism tools to carry out their duties more effectively. The full-stack development approaches' advancement and use in the banking industry, notably in data management and web application development, have been aided by this internship experience

Repositório da Universidade Nova de Lisboa

RefDataCleaner: A usable data cleaning tool

Author: Leon-Medina Juan Carlos
Publication venue: Maestría en Ingeniería y Analítica de Datos
Publication date: 01/01/2019
Field of study

En este documento, llevamos a cabo un experimento para comparar el rendimiento del usuario al limpiar datos utilizando dos herramientas: RefDataCleaner, una herramienta web que creamos para detectar y corregir errores en archivos de datos estructurados y semiestructurados, y Microsoft Excel, una aplicación de hoja de cálculo de uso generalizado en organizaciones de todo el mundo, utilizada para diversos tipos de tareas, incluida la limpieza de datos. Con RefDataCleaner, un usuario especifica reglas para detectar y corregir errores de datos, utilizando valores o recuperándolos desde un archivo de datos de referencia. Comparado con Microsoft Excel, un usuario no experto puede limpiar los datos especificando fórmulas y aplicando las funciones de buscar/reemplazar.In this paper, we carry out an experiment to compare user performance when cleaning data using two contrasting tools: RefDataCleaner, a bespoke web-based tool that we created specifically for detecting and fixing errors in structured and semi-structured data files, and Microsoft Excel, a spreadsheet application in widespread use in organizations throughout the world which is used for diverse types of tasks, including data cleaning. With RefDataCleaner, a user specifies rules to detect and fix data errors, using hard-coded values or by retrieving values from a reference data file. In contrast, with Microsoft Excel, a non-expert user may clean data by specifying formulae and applying find/replace functions.Magíster en Ingeniería y Analítica de Dato

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

First, do no harm - Missing data treatment to support lake ecological condition assessment

Author: Chrobak Grzegorz
Chrobak Katarzyna
Fischer Thomas B
Kazak Jan K
Kowalczyk Tomasz
Szewrański Szymon
Wąsowicz Barbara
Publication venue: 'Elsevier BV'
Publication date: 01/12/2022
Field of study

Indicators of ecological condition status of water bodies associated with field measurements are often subject to data gaps. This obstacle can often lead to abandonment of assessment. Furthermore, it can lead to the use of methods, based merely on their availability. In response to these challenges, a systematic approach for expert-analyst interaction for missing data treatment is proposed. A combination of algorithms with hierarchical clustering of results is used. A particular emphasis is put on the preparation and interpretation of input data and the role of an expert in the workflow. The proposed approach enhances the decision-making process by improving communication and transparency throughout interactions between experts, analysts and decision makers. Future research should focus on assessing the scale of the ecological data drift phenomenon, which, based on the observed climate change, anthropological pressure and biodiversity loss, may impact the broad concept of indicator construction for lake water ecological assessmen

University of Liverpool Repository