5 research outputs found
Data context informed data wrangling
The process of preparing potentially large and complex data sets for further
analysis or manual examination is often called data wrangling. In classical
warehousing environments, the steps in such a process have been carried out
using Extract-Transform-Load platforms, with significant manual involvement in
specifying, configuring or tuning many of them. Cost-effective data wrangling
processes need to ensure that data wrangling steps benefit from automation
wherever possible. In this paper, we define a methodology to fully automate an
end-to-end data wrangling process incorporating data context, which associates
portions of a target schema with potentially spurious extensional data of types
that are commonly available. Instance-based evidence together with data
profiling paves the way to inform automation in several steps within the
wrangling process, specifically, matching, mapping validation, value format
transformation, and data repair. The approach is evaluated with real estate
data showing substantial improvements in the results of automated wrangling
Intership Report on data merging at the bank of Portugal Internship Experience at the Bank of Portugal: A Comprehensive Dive into Full Stack Development - Leveraging Modern Technology to Innovate Financial Infrastructure and Enhance User Experience
Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThis report details my full-stack development internship experiences at the Bank of Portugal, with a
particular emphasis on the creation of a website intended to increase operational effectiveness in the
DAS Department. My main contributions met a clear need, which was the absence of a reliable
platform that could manage and combine data from many sources. I was actively involved in creating
functionality for the Django applications Integrator and BAII using Django, a high-level Python web
framework. Several problems were addressed by the distinctive features I planned and programmed,
including daily data extraction from several SQL databases, entity error detection, data merging, and
user-friendly interfaces for data manipulation. A feature that enables the attribution of litigation to
certain entities was also developed. The outcomes of the developed features have proven to be useful,
giving the Institutional Intervention Area, the Sanctioning Action Area, the Illicit Financial Activity
Investigation Area, and the Money Laundering Preventive Supervision Area for Capital and Financing
of Terrorism tools to carry out their duties more effectively. The full-stack development approaches'
advancement and use in the banking industry, notably in data management and web application
development, have been aided by this internship experience
RefDataCleaner: A usable data cleaning tool
En este documento, llevamos a cabo un experimento para comparar el rendimiento del usuario al limpiar datos utilizando dos herramientas: RefDataCleaner, una herramienta web que creamos para detectar y corregir errores en archivos de datos estructurados y semiestructurados, y Microsoft Excel, una aplicación de hoja de cálculo de uso generalizado en organizaciones de todo el mundo, utilizada para diversos tipos de tareas, incluida la limpieza de datos. Con RefDataCleaner, un usuario especifica reglas para detectar y corregir errores de datos, utilizando valores o recuperándolos desde un archivo de datos de referencia. Comparado con Microsoft Excel, un usuario no experto puede limpiar los datos especificando fórmulas y aplicando las funciones de buscar/reemplazar.In this paper, we carry out an experiment to compare user performance when cleaning data using two contrasting tools: RefDataCleaner, a bespoke web-based tool that we created specifically for detecting and fixing errors in structured and semi-structured data files, and Microsoft Excel, a spreadsheet application in widespread use in organizations throughout the world which is used for diverse types of tasks, including data cleaning. With RefDataCleaner, a user specifies rules to detect and fix data errors, using hard-coded values or by retrieving values from a reference data file. In contrast, with Microsoft Excel, a non-expert user may clean data by specifying formulae and applying find/replace functions.Magíster en Ingeniería y Analítica de Dato
First, do no harm - Missing data treatment to support lake ecological condition assessment
Indicators of ecological condition status of water bodies associated with field measurements are often subject to data gaps. This obstacle can often lead to abandonment of assessment. Furthermore, it can lead to the use of methods, based merely on their availability. In response to these challenges, a systematic approach for expert-analyst interaction for missing data treatment is proposed. A combination of algorithms with hierarchical clustering of results is used. A particular emphasis is put on the preparation and interpretation of input data and the role of an expert in the workflow. The proposed approach enhances the decision-making process by improving communication and transparency throughout interactions between experts, analysts and decision makers. Future research should focus on assessing the scale of the ecological data drift phenomenon, which, based on the observed climate change, anthropological pressure and biodiversity loss, may impact the broad concept of indicator construction for lake water ecological assessmen