15,796 research outputs found

    Fairness in Data Wrangling

    Get PDF

    Clinical data wrangling using Ontological Realism and Referent Tracking

    Get PDF
    Ontological realism aims at the development of high quality ontologies that faithfully represent what is general in reality and to use these ontologies to render heterogeneous data collections comparable. To achieve this second goal for clinical research datasets presupposes not merely (1) that the requisite ontologies already exist, but also (2) that the datasets in question are faithful to reality in the dual sense that (a) they denote only particulars and relationships between particulars that do in fact exist and (b) they do this in terms of the types and type-level relationships described in these ontologies. While much attention has been devoted to (1), work on (2), which is the topic of this paper, is comparatively rare. Using Referent Tracking as basis, we describe a technical data wrangling strategy which consists in creating for each dataset a template that, when applied to each particular record in the dataset, leads to the generation of a collection of Referent Tracking Tuples (RTT) built out of unique identifiers for the entities described by means of the data items in the record. The proposed strategy is based on (i) the distinction between data and what data are about, and (ii) the explicit descriptions of portions of reality which RTTs provide and which range not only over the particulars described by data items in a dataset, but also over these data items themselves. This last feature allows us to describe particulars that are only implicitly referred to by the dataset; to provide information about correspondences between data items in a dataset; and to assert which data items are unjustifiably or redundantly present in or absent from the dataset. The approach has been tested on a dataset collected from patients seeking treatment for orofacial pain at two German universities and made available for the NIDCR-funded OPMQoL project

    Cost Effective Analysis of Big Data

    Get PDF
    Executive Summary Big data is everywhere and businesses that can access and analyze it have a huge advantage over those who can’t. One option for leveraging big data to make more informed decisions is to hire a big data consulting company to take over the entire project. This method requires the least effort, but is also the least cost effective. The problem is that the know-how for starting a big data project is not commonly known and the consulting alternative is not very cost effective. This creates the need for a cost effective approach that businesses can use to start and manage big data projects. This report details the development of an advisory tool to cut down on consulting costs of big data projects by taking an active role in the project yourself. The tool is not a set of standard operating procedures, but simply a guide for someone to follow when embarking on a big data project. The advisory tools has three steps that consist of data wrangling, statistical analysis, and data engineering. Data wrangling is the process of cleaning and organizing data into a format that is ready for statistical analysis. The guide recommends using the open source software and programming language of R. The next step is the statistical analysis portion of the process which takes the form of exploratory data analysis and the use of existing models and algorithms. The use of existing methods should always be attempted to the highest performance before justifying the costs to pay for big data analytics and the development of new algorithms. Data engineering consists of creating and applying statistical algorithms, utilizing cloud infrastructure to distribute processing, and the development of a complete platform solution. The experimentation for the design of our advisory toolwas carried out through analysis of many large data sets. The data sets were analyzed to determine the best explanatory variables to predict a selected response. The iterative process of data wrangling, statistical analysis, and model building was carried out for all the data sets. The experience gained, through the iterations of data wrangling and exploratory analysis, was extremely valuable in evaluating the usefulness of the design. The statistical analysis improved every time the iterative loop of wrangling and analysis was navigated. In house data wrangling, before submission to a data scientist, is the primary cost justification of using the advisory tool. Data wrangling typically occupies 80% of data scientist’s time in big data projects. So, if data wrangling is self-performed before a data scientist receives the data, then less time will be spent wrangling by the data scientist. Since data scientists are paid very high hourly wages, extra time saved wrangling equates to direct cost savings. This is assuming that the data wrangling performed before a data scientist takes over is of adequate quality. The results of applying the advisory tool may vary from case to case, depending on the critical skills the user possesses and the development of such skills. The critical skills begin with coding in R and Python as well as knowledge in the statistical methods of choice. Basic knowledge of statistics, and any programming language is a must to begin utilizing this guide. Statistical proficiency is the limiting factor in the advisory tool. The best start for doing a big data project on one’s own is to first learn R and become familiar with the statistical libraries it contains. This allows data wrangling and exploratory analysis to be performed at a high level. This project pushed the boundaries of what can be done with big data using traditional computer framework without cloud usage. Storage and processing limits of traditional computers were tested and in some cases reached, which verified the eventual need to operate in the cloud environment

    Can language models automate data wrangling?

    Full text link
    [EN] The automation of data science and other data manipulation processes depend on the integration and formatting of 'messy' data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because (1) users expect to solve them with short cues or few examples, and (2) the problems depend heavily on domain knowledge. Interestingly, large language models today (1) can infer from very few examples or even a short clue in natural language, and (2) can integrate vast amounts of domain knowledge. It is then an important research question to analyse whether language models are a promising approach for data wrangling, especially as their capabilities continue growing. In this paper we apply different variants of the language model Generative Pre-trained Transformer (GPT) to five batteries covering a wide range of data wrangling problems. We compare the effect of prompts and few-shot regimes on their results and how they compare with specialised data wrangling systems and other tools. Our major finding is that they appear as a powerful tool for a wide range of data wrangling tasks. We provide some guidelines about how they can be integrated into data processing pipelines, provided the users can take advantage of their flexibility and the diversity of tasks to be addressed. However, reliability is still an important issue to overcome.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was funded by the Future of Life Institute, FLI, under grant RFP2-152, the MIT-Spain - INDITEX Sustainability Seed Fund under project COST-OMIZE, the EU (FEDER) and Spanish MINECO under RTI2018-094403-B-C32 and PID2021-122830OB-C42, Generalitat Valenciana under PROMETEO/2019/098 and INNEST/2021/317, EU's Horizon 2020 research and innovation programme under grant agreement No. 952215 (TAILOR) and US DARPA HR00112120007 ReCOG-AI. AcknowledgementsWe thank Lidia Contreras for her help with the Data Wrangling Dataset Repository. We thank the anonymous reviewers from ECMLPKDD Workshop on Automating Data Science (ADS2021) and the anonymous reviewers of this special issue for their comments.Jaimovitch-López, G.; Ferri Ramírez, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez Quintana, MJ. (2023). Can language models automate data wrangling?. Machine Learning. 112(6):2053-2082. https://doi.org/10.1007/s10994-022-06259-920532082112

    Can language models automate data wrangling?

    Full text link
    [ES] La automatización de la ciencia de datos y otros procesos de manipulación de datos dependen de la integración y el formateo de los datos "desordenados". La manipulación de datos es un término que engloba estas tareas tediosas y que requieren mucho tiempo. Tareas como la transformación de fechas, unidades o nombres expresados en diferentes formatos han sido un reto para el aprendizaje automático porque los usuarios esperan resolverlas con pistas cortas o pocos ejemplos, y los problemas dependen en gran medida del conocimiento del dominio. Curiosamente, los grandes modelos lingüísticos de hoy en día infieren a partir de muy pocos ejemplos o incluso de una breve pista en lenguaje natural, e integran grandes cantidades de conocimiento del dominio. Por tanto, es una cuestión de investigación importante analizar si los modelos de lenguaje son un enfoque prometedor para la gestión de datos, especialmente porque sus capacidades siguen creciendo. En este artículo aplicamos diferentes variantes de modelos lingüísticos de GPT a problemas de gestión de datos, comparando sus resultados con los de herramientas especializadas de gestión de datos, y analizando también las tendencias, variaciones y nuevas posibilidades y riesgos de los modelos lingüísticos en esta tarea. Nuestro principal hallazgo es que parecen ser una herramienta poderosa para una amplia gama de tareas de búsqueda de datos, pero la fiabilidad puede ser un problema importante a superar.[EN] The automation of data science and other data manipulation processes depend on the integration and formatting of ‘messy’ data. Data wran gling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because users expect to solve them with short cues or few examples, and the problems depend heavily on domain knowledge. Interestingly, large language models today infer from very few examples or even a short clue in natural language, and integrate vast amounts of domain knowledge. It is then an important research question to analyse whether language models are a promising approach for data wrangling, especially as their capabilities continue growing. In this paper we apply different language model variants of GPT to data wrangling problems, comparing their results to specialised data wrangling tools, also analysing the trends, variations and further possibilities and risks of language models in this task. Our major finding is that they appear as a powerful tool for a wide range of data wrangling tasks, but reliability may be an important issue to overcome.Jaimovitch-López, G.; Ferri, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez-Quintana, MJ. (2021). Can language models automate data wrangling?. http://hdl.handle.net/10251/18502

    CHAMALEON: Framework to improve Data Wrangling with Complex Data

    Get PDF
    Data transformation and schema conciliation are relevant topics in Industry due to the incorporation of data-intensive business processes in organizations. As the amount of data sources increases, the complexity of such data increases as well, leading to complex and nested data schemata. Nowadays, novel approaches are being employed in academia and Industry to assist non-expert users in transforming, integrating, and improving the quality of datasets (i.e., data wrangling). However, there is a lack of support for transforming semi-structured complex data. This article makes an state-of-the-art by identifying and analyzing the most relevant solutions that can be found in academia and Industry to transform this type of data. In addition, we propose a Domain-Specific Language (DSL) to support the transformation of complex data as a first approach to enhance data wrangling processes. We also develop a framework to implement the DSL and evaluate it in a real-world case study
    corecore