2,314 research outputs found

    Greater data science at baccalaureate institutions

    Get PDF
    Donoho's JCGS (in press) paper is a spirited call to action for statisticians, who he points out are losing ground in the field of data science by refusing to accept that data science is its own domain. (Or, at least, a domain that is becoming distinctly defined.) He calls on writings by John Tukey, Bill Cleveland, and Leo Breiman, among others, to remind us that statisticians have been dealing with data science for years, and encourages acceptance of the direction of the field while also ensuring that statistics is tightly integrated. As faculty at baccalaureate institutions (where the growth of undergraduate statistics programs has been dramatic), we are keen to ensure statistics has a place in data science and data science education. In his paper, Donoho is primarily focused on graduate education. At our undergraduate institutions, we are considering many of the same questions.Comment: in press response to Donoho paper in Journal of Computational Graphics and Statistic

    Teaching Stats for Data Science

    Get PDF
    “Data science” is a useful catchword for methods and concepts original to the field of statistics, but typically being applied to large, multivariate, observational records. Such datasets call for techniques not often part of an introduction to statistics: modeling, consideration of covariates, sophisticated visualization, and causal reasoning. This article re-imagines introductory statistics as an introduction to data science and proposes a sequence of 10 blocks that together compose a suitable course for extracting information from contemporary data. Recent extensions to the mosaic packages for R together with tools from the “tidyverse” provide a concise and readable notation for wrangling, visualization, model-building, and model interpretation: the fundamental computational tasks of data science

    Can language models automate data wrangling?

    Full text link
    [EN] The automation of data science and other data manipulation processes depend on the integration and formatting of 'messy' data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because (1) users expect to solve them with short cues or few examples, and (2) the problems depend heavily on domain knowledge. Interestingly, large language models today (1) can infer from very few examples or even a short clue in natural language, and (2) can integrate vast amounts of domain knowledge. It is then an important research question to analyse whether language models are a promising approach for data wrangling, especially as their capabilities continue growing. In this paper we apply different variants of the language model Generative Pre-trained Transformer (GPT) to five batteries covering a wide range of data wrangling problems. We compare the effect of prompts and few-shot regimes on their results and how they compare with specialised data wrangling systems and other tools. Our major finding is that they appear as a powerful tool for a wide range of data wrangling tasks. We provide some guidelines about how they can be integrated into data processing pipelines, provided the users can take advantage of their flexibility and the diversity of tasks to be addressed. However, reliability is still an important issue to overcome.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work was funded by the Future of Life Institute, FLI, under grant RFP2-152, the MIT-Spain - INDITEX Sustainability Seed Fund under project COST-OMIZE, the EU (FEDER) and Spanish MINECO under RTI2018-094403-B-C32 and PID2021-122830OB-C42, Generalitat Valenciana under PROMETEO/2019/098 and INNEST/2021/317, EU's Horizon 2020 research and innovation programme under grant agreement No. 952215 (TAILOR) and US DARPA HR00112120007 ReCOG-AI. AcknowledgementsWe thank Lidia Contreras for her help with the Data Wrangling Dataset Repository. We thank the anonymous reviewers from ECMLPKDD Workshop on Automating Data Science (ADS2021) and the anonymous reviewers of this special issue for their comments.Jaimovitch-López, G.; Ferri Ramírez, C.; Hernández-Orallo, J.; Martínez-Plumed, F.; Ramírez Quintana, MJ. (2023). Can language models automate data wrangling?. Machine Learning. 112(6):2053-2082. https://doi.org/10.1007/s10994-022-06259-920532082112

    The fivethirtyeight R package: ‘Tame Data’ Principles for Introductory Statistics and Data Science Courses

    Get PDF
    As statistics and data science instructors, we often seek to use data in our courses that are rich, real, realistic, and relevant. To this end we created the fivethirtyeight R package of data and code behind the stories and interactives at the data journalism website FiveThirtyEight.com. After a discussion on the conflicting pedagogical goals of minimizing prerequisites to research (Cobb 2015) while at the same time presenting students with a realistic view of data as it exists in the wild, we articulate how a desired balance between these two goals informed the design of the package. The details behind this balance are articulated as our proposed Tame data principles for introductory statistics and data science courses. Details of the package\u27s construction and example uses are included as well

    PRESISTANT: Learning based assistant for data pre-processing

    Get PDF
    Data pre-processing is one of the most time consuming and relevant steps in a data analysis process (e.g., classification task). A given data pre-processing operator (e.g., transformation) can have positive, negative or zero impact on the final result of the analysis. Expert users have the required knowledge to find the right pre-processing operators. However, when it comes to non-experts, they are overwhelmed by the amount of pre-processing operators and it is challenging for them to find operators that would positively impact their analysis (e.g., increase the predictive accuracy of a classifier). Existing solutions either assume that users have expert knowledge, or they recommend pre-processing operators that are only "syntactically" applicable to a dataset, without taking into account their impact on the final analysis. In this work, we aim at providing assistance to non-expert users by recommending data pre-processing operators that are ranked according to their impact on the final analysis. We developed a tool PRESISTANT, that uses Random Forests to learn the impact of pre-processing operators on the performance (e.g., predictive accuracy) of 5 different classification algorithms, such as J48, Naive Bayes, PART, Logistic Regression, and Nearest Neighbor. Extensive evaluations on the recommendations provided by our tool, show that PRESISTANT can effectively help non-experts in order to achieve improved results in their analytical tasks

    Cost Effective Analysis of Big Data

    Get PDF
    Executive Summary Big data is everywhere and businesses that can access and analyze it have a huge advantage over those who can’t. One option for leveraging big data to make more informed decisions is to hire a big data consulting company to take over the entire project. This method requires the least effort, but is also the least cost effective. The problem is that the know-how for starting a big data project is not commonly known and the consulting alternative is not very cost effective. This creates the need for a cost effective approach that businesses can use to start and manage big data projects. This report details the development of an advisory tool to cut down on consulting costs of big data projects by taking an active role in the project yourself. The tool is not a set of standard operating procedures, but simply a guide for someone to follow when embarking on a big data project. The advisory tools has three steps that consist of data wrangling, statistical analysis, and data engineering. Data wrangling is the process of cleaning and organizing data into a format that is ready for statistical analysis. The guide recommends using the open source software and programming language of R. The next step is the statistical analysis portion of the process which takes the form of exploratory data analysis and the use of existing models and algorithms. The use of existing methods should always be attempted to the highest performance before justifying the costs to pay for big data analytics and the development of new algorithms. Data engineering consists of creating and applying statistical algorithms, utilizing cloud infrastructure to distribute processing, and the development of a complete platform solution. The experimentation for the design of our advisory toolwas carried out through analysis of many large data sets. The data sets were analyzed to determine the best explanatory variables to predict a selected response. The iterative process of data wrangling, statistical analysis, and model building was carried out for all the data sets. The experience gained, through the iterations of data wrangling and exploratory analysis, was extremely valuable in evaluating the usefulness of the design. The statistical analysis improved every time the iterative loop of wrangling and analysis was navigated. In house data wrangling, before submission to a data scientist, is the primary cost justification of using the advisory tool. Data wrangling typically occupies 80% of data scientist’s time in big data projects. So, if data wrangling is self-performed before a data scientist receives the data, then less time will be spent wrangling by the data scientist. Since data scientists are paid very high hourly wages, extra time saved wrangling equates to direct cost savings. This is assuming that the data wrangling performed before a data scientist takes over is of adequate quality. The results of applying the advisory tool may vary from case to case, depending on the critical skills the user possesses and the development of such skills. The critical skills begin with coding in R and Python as well as knowledge in the statistical methods of choice. Basic knowledge of statistics, and any programming language is a must to begin utilizing this guide. Statistical proficiency is the limiting factor in the advisory tool. The best start for doing a big data project on one’s own is to first learn R and become familiar with the statistical libraries it contains. This allows data wrangling and exploratory analysis to be performed at a high level. This project pushed the boundaries of what can be done with big data using traditional computer framework without cloud usage. Storage and processing limits of traditional computers were tested and in some cases reached, which verified the eventual need to operate in the cloud environment

    An Educator’s Perspective of the Tidyverse

    Get PDF
    Computing makes up a large and growing component of data science and statistics courses. Many of those courses, especially when taught by faculty who are statisticians by training, teach R as the programming language. A number of instructors have opted to build much of their teaching around use of the tidyverse. The tidyverse, in the words of its developers, “is a collection of R packages that share a high-level design philosophy and low-level grammar and data structures, so that learning one package makes it easier to learn the next” (Wickham et al. 2019). These shared principles have led to the widespread adoption of the tidyverse ecosystem. A large part of this usage is because the tidyverse tools have been intentionally designed to ease the learning process and make it easier for users to learn new functions as they engage with additional pieces of the larger ecosystem. Moreover, the functionality offered by the packages within the tidyverse spans the entire data science cycle, which includes data import, visualisation, wrangling, modeling, and communication. We believe the tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle. In this paper, we introduce the tidyverse from an educator’s perspective. We provide a brief introduction to the tidyverse, demonstrate how foundational statistics and data science tasks are accomplished with the tidyverse, and discuss the strengths of the tidyverse, particularly in the context of teaching and learning

    How can humans leverage machine learning? From Medical Data Wrangling to Learning to Defer to Multiple Experts

    Get PDF
    Mención Internacional en el título de doctorThe irruption of the smartphone into everyone’s life and the ease with which we digitise or record any data supposed an explosion of quantities of data. Smartphones, equipped with advanced cameras and sensors, have empowered individuals to capture moments and contribute to the growing pool of data. This data-rich landscape holds great promise for research, decision-making, and personalized applications. By carefully analyzing and interpreting this wealth of information, valuable insights, patterns, and trends can be uncovered. However, big data is worthless in a vacuum. Its potential value is unlocked only when leveraged to drive decision-making. In recent times we have been participants of the outburst of artificial intelligence: the development of computer systems and algorithms capable of perceiving, reasoning, learning, and problem-solving, emulating certain aspects of human cognitive abilities. Nevertheless, our focus tends to be limited, merely skimming the surface of the problem, while the reality is that the application of machine learning models to data introduces is usually fraught. More specifically, there are two crucial pitfalls frequently neglected in the field of machine learning: the quality of the data and the erroneous assumption that machine learning models operate autonomously. These two issues have established the foundation for the motivation driving this thesis, which strives to offer solutions to two major associated challenges: 1) dealing with irregular observations and 2) learning when and who should we trust. The first challenge originates from our observation that the majority of machine learning research primarily concentrates on handling regular observations, neglecting a crucial technological obstacle encountered in practical big-data scenarios: the aggregation and curation of heterogeneous streams of information. Before applying machine learning algorithms, it is crucial to establish robust techniques for handling big data, as this specific aspect presents a notable bottleneck in the creation of robust algorithms. Data wrangling, which encompasses the extraction, integration, and cleaning processes necessary for data analysis, plays a crucial role in this regard. Therefore, the first objective of this thesis is to tackle the frequently disregarded challenge of addressing irregularities within the context of medical data. We will focus on three specific aspects. Firstly, we will tackle the issue of missing data by developing a framework that facilitates the imputation of missing data points using relevant information derived from alternative data sources or past observations. Secondly, we will move beyond the assumption of homogeneous observations, where only one statistical data type (such as Gaussian) is considered, and instead, work with heterogeneous observations. This means that different data sources can be represented by various statistical likelihoods, such as Gaussian, Bernoulli, categorical, etc. Lastly, considering the temporal enrichment of todays collected data and our focus on medical data, we will develop a novel algorithm capable of capturing and propagating correlations among different data streams over time. All these three problems are addressed in our first contribution which involves the development of a novel method based on Deep Generative Models (DGM) using Variational Autoencoders (VAE). The proposed model, the Sequential Heterogeneous Incomplete VAE (Shi- VAE), enables the aggregation of multiple heterogeneous data streams in a modular manner, taking into consideration the presence of potential missing data. To demonstrate the feasibility of our approach, we present proof-of-concept results obtained from a real database generated through continuous passive monitoring of psychiatric patients. Our second challenge relates to the misbelief that machine learning algorithms can perform independently. However, this notion that AI systems can solely account for automated decisionmaking, especially in critical domains such as healthcare, is far from reality. Our focus now shifts towards a specific scenario where the algorithm has the ability to make predictions independently or alternatively defer the responsibility to a human expert. The purpose of including the human is not to obtain jsut better performance, but also more reliable and trustworthy predictions we can rely on. In reality, however, important decisions are not made by one person but are usually committed by an ensemble of human experts. With this in mind, two important questions arise: 1) When should the human or the machine bear responsibility and 2) among the experts, who should we trust? To answer the first question, we will employ a recent theory known as Learning to defer (L2D). In L2D we are not only interested in abstaining from prediction but also in understanding the humans confidence for making such prediction. thus deferring only when the human is more likely to be correct. The second question about who to defer among a pool of experts has not been yet answered in the L2D literature, and this is what our contributions aim to provide. First, we extend the two yet proposed consistent surrogate losses in the L2D literature to the multiple-expert setting. Second, we study the frameworks ability to estimate the probability that a given expert correctly predicts and assess whether the two surrogate losses are confidence calibrated. Finally, we propose a conformal inference technique that chooses a subset of experts to query when the system defers. Ensembling experts based on confidence levels is vital to optimize human-machine collaboration. In conclusion, this doctoral thesis has investigated two cases where humans can leverage the power of machine learning: first, as a tool to assist in data wrangling and data understanding problems and second, as a collaborative tool where decision-making can be automated by the machine or delegated to human experts, fostering more transparent and trustworthy solutions.La irrupción de los smartphones en la vida de todos y la facilidad con la que digitalizamos o registramos cualquier situación ha supuesto una explosión en la cantidad de datos. Los teléfonos, equipados con cámaras y sensores avanzados, han contribuido a que las personas puedann capturar más momentos, favoreciendo así el creciente conjunto de datos. Este panorama repleto de datos aporta un gran potencial de cara a la investigación, la toma de decisiones y las aplicaciones personalizadas. Mediante el análisis minucioso y una cuidada interpretación de esta abundante información, podemos descubrir valiosos patrones, tendencias y conclusiones Sin embargo, este gran volumen de datos no tiene valor por si solo. Su potencial se desbloquea solo cuando se aprovecha para impulsar la toma de decisiones. En tiempos recientes, hemos sido testigos del auge de la inteligencia artificial: el desarrollo de sistemas informáticos y algoritmos capaces de percibir, razonar, aprender y resolver problemas, emulando ciertos aspectos de las capacidades cognitivas humanas. No obstante, solemos centrarnos solo en la superficie del problema mientras que la realidad es que la aplicación de modelos de aprendizaje automático a los datos presenta desafíos significativos. Concretamente, se suelen pasar por alto dos problemas cruciales en el campo del aprendizaje automático: la calidad de los datos y la suposición errónea de que los modelos de aprendizaje automático pueden funcionar de manera autónoma. Estos dos problemas han sido el fundamento de la motivación que impulsa esta tesis, que se esfuerza en ofrecer soluciones a dos desafíos importantes asociados: 1) lidiar con datos irregulares y 2) aprender cuándo y en quién debemos confiar. El primer desafío surge de nuestra observación de que la mayoría de las investigaciones en aprendizaje automático se centran principalmente en manejar datos regulares, descuidando un obstáculo tecnológico crucial que se encuentra en escenarios prácticos con gran cantidad de datos: la agregación y el curado de secuencias heterogéneas. Antes de aplicar algoritmos de aprendizaje automático, es crucial establecer técnicas robustas para manejar estos datos, ya que est problemática representa un cuello de botella claro en la creación de algoritmos robustos. El procesamiento de datos (en concreto, nos centraremos en el término inglés data wrangling), que abarca los procesos de extracción, integración y limpieza necesarios para el análisis de datos, desempeña un papel crucial en este sentido. Por lo tanto, el primer objetivo de esta tesis es abordar el desafío normalmente paso por alto de tratar datos irregulare. Específicamente, bajo el contexto de datos médicos. Nos centraremos en tres aspectos principales. En primer lugar, abordaremos el problema de los datos perdidos mediante el desarrollo de un marco que facilite la imputación de estos datos perdidos utilizando información relevante obtenida de fuentes de datos de diferente naturalaeza u observaciones pasadas. En segundo lugar, iremos más allá de la suposición de lidiar con observaciones homogéneas, donde solo se considera un tipo de dato estadístico (como Gaussianos) y, en su lugar, trabajaremos con observaciones heterogéneas. Esto significa que diferentes fuentes de datos pueden estar representadas por diversas distribuciones de probabilidad, como Gaussianas, Bernoulli, categóricas, etc. Por último, teniendo en cuenta el enriquecimiento temporal de los datos hoy en día y nuestro enfoque directo sobre los datos médicos, propondremos un algoritmo innovador capaz de capturar y propagar la correlación entre diferentes flujos de datos a lo largo del tiempo. Todos estos tres problemas se abordan en nuestra primera contribución, que implica el desarrollo de un método basado en Modelos Generativos Profundos (Deep Genarative Model en inglés) utilizando Autoencoders Variacionales (Variational Autoencoders en ingés). El modelo propuesto, Sequential Heterogeneous Incomplete VAE (Shi-VAE), permite la agregación de múltiples flujos de datos heterogéneos de manera modular, teniendo en cuenta la posible presencia de datos perdidos. Para demostrar la viabilidad de nuestro enfoque, presentamos resultados de prueba de concepto obtenidos de una base de datos real generada a través del monitoreo continuo pasivo de pacientes psiquiátricos. Nuestro segundo desafío está relacionado con la creencia errónea de que los algoritmos de aprendizaje automático pueden funcionar de manera independiente. Sin embargo, esta idea de que los sistemas de inteligencia artificial pueden ser los únicos responsables en la toma de decisione, especialmente en dominios críticos como la atención médica, está lejos de la realidad. Ahora, nuestro enfoque se centra en un escenario específico donde el algoritmo tiene la capacidad de realizar predicciones de manera independiente o, alternativamente, delegar la responsabilidad en un experto humano. La inclusión del ser humano no solo tiene como objetivo obtener un mejor rendimiento, sino también obtener predicciones más transparentes y seguras en las que podamos confiar. En la realidad, sin embargo, las decisiones importantes no las toma una sola persona, sino que generalmente son el resultado de la colaboración de un conjunto de expertos. Con esto en mente, surgen dos preguntas importantes: 1) ¿Cuándo debe asumir la responsabilidad el ser humano o cuándo la máquina? y 2) de entre los expertos, ¿en quién debemos confiar? Para responder a la primera pregunta, emplearemos una nueva teoría llamada Learning to defer (L2D). En L2D, no solo estamos interesados en abstenernos de hacer predicciones, sino también en comprender cómo de seguro estará el experto para hacer dichas predicciones, diferiendo solo cuando el humano sea más probable en predecir correcatmente. La segunda pregunta sobre a quién deferir entre un conjunto de expertos aún no ha sido respondida en la literatura de L2D, y esto es precisamente lo que nuestras contribuciones pretenden proporcionar. En primer lugar, extendemos las dos primeras surrogate losses consistentes propuestas hasta ahora en la literatura de L2D al contexto de múltiples expertos. En segundo lugar, estudiamos la capacidad de estos modelos para estimar la probabilidad de que un experto dado haga predicciones correctas y evaluamos si estas surrogate losses están calibradas en términos de confianza. Finalmente, proponemos una técnica de conformal inference que elige un subconjunto de expertos para consultar cuando el sistema decide diferir. Esta combinación de expertos basada en los respectivos niveles de confianza es fundamental para optimizar la colaboración entre humanos y máquinas En conclusión, esta tesis doctoral ha investigado dos casos en los que los humanos pueden aprovechar el poder del aprendizaje automático: primero, como herramienta para ayudar en problemas de procesamiento y comprensión de datos y, segundo, como herramienta colaborativa en la que la toma de decisiones puede ser automatizada para ser realizada por la máquina o delegada a expertos humanos, fomentando soluciones más transparentes y seguras.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Joaquín Míguez Arenas.- Secretario: Juan José Murillo Fuentes.- Vocal: Mélanie Natividad Fernández Pradie
    corecore