5 research outputs found

    Data context informed data wrangling

    Get PDF
    The process of preparing potentially large and complex data sets for further analysis or manual examination is often called data wrangling. In classical warehousing environments, the steps in such a process have been carried out using Extract-Transform-Load platforms, with significant manual involvement in specifying, configuring or tuning many of them. Cost-effective data wrangling processes need to ensure that data wrangling steps benefit from automation wherever possible. In this paper, we define a methodology to fully automate an end-to-end data wrangling process incorporating data context, which associates portions of a target schema with potentially spurious extensional data of types that are commonly available. Instance-based evidence together with data profiling paves the way to inform automation in several steps within the wrangling process, specifically, matching, mapping validation, value format transformation, and data repair. The approach is evaluated with real estate data showing substantial improvements in the results of automated wrangling

    Intership Report on data merging at the bank of Portugal Internship Experience at the Bank of Portugal: A Comprehensive Dive into Full Stack Development - Leveraging Modern Technology to Innovate Financial Infrastructure and Enhance User Experience

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceThis report details my full-stack development internship experiences at the Bank of Portugal, with a particular emphasis on the creation of a website intended to increase operational effectiveness in the DAS Department. My main contributions met a clear need, which was the absence of a reliable platform that could manage and combine data from many sources. I was actively involved in creating functionality for the Django applications Integrator and BAII using Django, a high-level Python web framework. Several problems were addressed by the distinctive features I planned and programmed, including daily data extraction from several SQL databases, entity error detection, data merging, and user-friendly interfaces for data manipulation. A feature that enables the attribution of litigation to certain entities was also developed. The outcomes of the developed features have proven to be useful, giving the Institutional Intervention Area, the Sanctioning Action Area, the Illicit Financial Activity Investigation Area, and the Money Laundering Preventive Supervision Area for Capital and Financing of Terrorism tools to carry out their duties more effectively. The full-stack development approaches' advancement and use in the banking industry, notably in data management and web application development, have been aided by this internship experience

    RefDataCleaner: A usable data cleaning tool

    Get PDF
    En este documento, llevamos a cabo un experimento para comparar el rendimiento del usuario al limpiar datos utilizando dos herramientas: RefDataCleaner, una herramienta web que creamos para detectar y corregir errores en archivos de datos estructurados y semiestructurados, y Microsoft Excel, una aplicación de hoja de cálculo de uso generalizado en organizaciones de todo el mundo, utilizada para diversos tipos de tareas, incluida la limpieza de datos. Con RefDataCleaner, un usuario especifica reglas para detectar y corregir errores de datos, utilizando valores o recuperándolos desde un archivo de datos de referencia. Comparado con Microsoft Excel, un usuario no experto puede limpiar los datos especificando fórmulas y aplicando las funciones de buscar/reemplazar.In this paper, we carry out an experiment to compare user performance when cleaning data using two contrasting tools: RefDataCleaner, a bespoke web-based tool that we created specifically for detecting and fixing errors in structured and semi-structured data files, and Microsoft Excel, a spreadsheet application in widespread use in organizations throughout the world which is used for diverse types of tasks, including data cleaning. With RefDataCleaner, a user specifies rules to detect and fix data errors, using hard-coded values or by retrieving values from a reference data file. In contrast, with Microsoft Excel, a non-expert user may clean data by specifying formulae and applying find/replace functions.Magíster en Ingeniería y Analítica de Dato

    First, do no harm - Missing data treatment to support lake ecological condition assessment

    Get PDF
    Indicators of ecological condition status of water bodies associated with field measurements are often subject to data gaps. This obstacle can often lead to abandonment of assessment. Furthermore, it can lead to the use of methods, based merely on their availability. In response to these challenges, a systematic approach for expert-analyst interaction for missing data treatment is proposed. A combination of algorithms with hierarchical clustering of results is used. A particular emphasis is put on the preparation and interpretation of input data and the role of an expert in the workflow. The proposed approach enhances the decision-making process by improving communication and transparency throughout interactions between experts, analysts and decision makers. Future research should focus on assessing the scale of the ecological data drift phenomenon, which, based on the observed climate change, anthropological pressure and biodiversity loss, may impact the broad concept of indicator construction for lake water ecological assessmen
    corecore