38 research outputs found

    Data Science for All: Apache Spark & Jupyter Notebooks

    Get PDF
    The Nation\u27s research enterprise faces a shortage of data scientists. Expanding the pipeline of data science students, particularly from underrepresented populations, requires educational institutions to increase awareness of data science and inspire a passion for data in students as they begin their academic careers. In this tutorial we discuss the development and delivery of a free seminar designed to provide hands-on lessons in the use of both Apache Spark and Jupyter notebooks to students from any academic background in an approachable, no-risk environment. An explanation of the seminar resources, exercises, and implementation guidelines are included, as are lessons learned from several successful seminars held both in-person and virtually at two institutions of high education

    Data Science for all : a stroll in the foothills.

    Get PDF
    Data science presents both opportunities and threats to conventional statistics courses. Opportunities include being at the bleeding edge of data analysis, and learning new ways to model phenomena; threats include the challenge of learning new skills and reviewing fundamental assumptions about explanation, prediction and modeling. Powerful data visualisations makes it easier to introduce students to fundamental statistical ideas associated with multivariate data. Data science provides methods to tackle problems that are intractable using analytic methods. Students need to learn how to model complex problems, and to understand the problematic nature of modeling – and they need to consider the practical and ethical implications of their (and others’) work. Here, we offer a stroll into the foothills, along with aphorisms and heuristics for data analysts

    1st Data Science Symposium, GEOMAR

    Get PDF

    Exploring Approaches to Data Literacy Through a Critical Race Theory Perspective

    Get PDF
    In this paper, we describe and analyze a workshop developed for a work training program called DataWorks. In thisworkshop, data workers chose a topic of their interest, sourced and processed data on that topic, and used that data to createpresentations. Drawing from discourses of data literacy; epistemic agency and lived experience; and critical race theory, we analyze the workshops’ activities and outcomes. Through this analysis, three themes emerge: the tensions between epistemic agency and the context of work, encountering the ordinariness of racism through data work, and understanding the personal as communal and intersectional. Finally, critical race theory also prompts us to consider the very notions of data literacy that undergird our workshop activities. From this analysis, we offer a series of suggestions for approaching designing data literacy activities, taking into account critical race theory

    Locating Ethics in Data Science: Responsibility and Accountability in Global and Distributed Knowledge Production

    Get PDF
    This is the author accepted manuscript. The final version is available from Royal Society via the DOI in this record.The distributed and global nature of data science creates challenges for evaluating the quality, import and potential impact of the data and knowledge claims being produced. This has significant consequences for the management and oversight of responsibilities and accountabilities in data science. In particular, it makes it difficult to determine who is responsible for what output, and how such responsibilities relate to each other; what ‘participation’ means and which accountabilities it involves, with regards to data ownership, donation and sharing as well as data analysis, re-use and authorship; and whether the trust placed on automated tools for data mining and interpretation is warranted (especially since data processing strategies and tools are often developed separately from the situations of data use where ethical concerns typically emerge). To address these challenges, this paper advocates a participative, reflexive management of data practices. Regulatory structures should encourage data scientists to examine the historical lineages and ethical implications of their work at regular intervals. They should also foster awareness of the multitude of skills and perspectives involved in data science, highlighting how each perspective is partial and in need of confrontation with others. This approach has the potential to improve not only the ethical oversight for data science initiatives, but also the quality and reliability of research outputs.This research was funded by the European Research Council grant award 335925 (“The Epistemology of Data-Intensive Science”), the Leverhulme Trust Grant number RPG-2013- 153 and the Australian Research Council, Discovery Project DP160102989

    Understanding Effective Use of Big Data: Challenges and Capabilities (A Management Perspective)

    Get PDF
    While prior research has provided insights into challenges and capabilities related to effective Big Data use, much of this contribution has been conceptual in nature. The aim of this study is to explore such challenges and capabilities through an empirical approach. Accordingly, this paper reports on a multiple case study approach, involving eight organizations from the private and public sectors. The study provides empirical support for capabilities and challenges identified through prior research and identifies additional insights viz. problem-driven approach, time to value, data readiness, data literacy, data misuse, operational agility, and organizational maturity assessment

    Humanized data cleaning

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaData science has started to become one of the most important skills someone can have in the modern world, due to data taking an increasingly meaningful role in our lives. The accessibility of data science is however limited, requiring complicated software or programming knowledge. Both can be challenging and hard to master, even for the simpler tasks. Currently, in order to clean data you need a data scientist. The process of data cleaning, consisting of removing or correcting entries of a data set, usually requires programming knowledge as it is mostly performed using programming languages such as Python and R (kag). However, data cleaning could be performed by people that may possess better knowledge of the data domain, but lack the programming background, if this barrier is removed. We have studied current solutions that are available on the market, the type of interface each one uses to interact with the end users, such as a control flow interface, a tabular based interface or block-based languages. With this in mind, we have approached this issue by providing a new data science tool, termed Data Cleaning for All (DCA), that attempts to reduce the necessary knowledge to perform data science tasks, in particular for data cleaning and curation. By combining Human-Computer Interaction (HCI) concepts, this tool is: simple to use through direct manipulation and showing transformation previews; allows users to save time by eliminate repetitive tasks and automatically calculating many of the common analyses data scientists must perform; and suggests data transformations based on the contents of the data, allowing for a smarter environment.A ciência de dados tornou-se uma das capacidades mais importantes que alguém pode possuir no mundo moderno, devido aos dados serem cada vez mais importantes na nossa sociedade. A acessibilidade da ciência de dados é, no entanto, limitada, requer software complicado ou conhecimentos de programação. Ambos podem ser desafiantes e difíceis de aprender bem, mesmo para tarefas simples. Atualmente, para efetuar a limpeza de dados e necessário um Data Scientist. O processo de limpeza de dados, que consiste em remover ou corrigir entradas de um dataset, é normalmente efetuado utilizando linguagens de programação como Python e R (kag). No entanto, a limpeza de dados poderia ser efetuada por profissionais que possuam melhor conhecimento sobre o domínio dos dados a tratar, mas que não possuam uma formação em ciências da computação. Estudamos soluções que estão presentes no mercado e o tipo de interface que cada uma usa para interagir com o utilizador, seja através de diagramas de fluxo de controlo, interfaces tabulares ou recorrendo a linguagens de programação baseadas em blocos. Com isto em mente, abordamos o problema através do desenvolvimento de uma nova plataforma onde podemos efetuar tarefas de ciências de dados com o nome Data Cleaning for All (DCA). Com esta ferramenta esperamos reduzir os conhecimentos necessários para efetuar tarefas nesta área, especialmente na área da limpeza de dados. Através da combinação de conceitos de HCI, a plataforma é: simples de usar através da manipulação direta dos dados e da demonstração de pré-visualizações das transformações; permite aos utilizadores poupar tempo através da eliminação de tarefas repetitivas ao calcular muitas das métricas que Data Scientist tem de calcular; e sugere transformações dos dados baseadas nos conteúdos dos mesmos, permitindo um ambiente mais inteligente

    Dynamic accessibility by car to tertiary care emergency services in Cali, Colombia, in 2020 : cross-sectional equity analyses using travel time big data from a Google API

    Get PDF
    Objectives: To test a new approach to characterise accessibility to tertiary care emergency health services in urban Cali and assess the links between accessibility and sociodemographic factors relevant to health equity. Design: The impact of traffic congestion on accessibility to tertiary care emergency departments was studied with an equity perspective, using a web-based digital platform that integrated publicly available digital data, including sociodemographic characteristics of the population and places of residence with travel times. Setting and participants: Cali, Colombia (population 2.258million in 2020) using geographic and sociodemographic data. The study used predicted travel times downloaded for a week in July 2020 and a week in November 2020. Primary and secondary outcomes: The share of the population within a 15min journey by car from the place of residence to the tertiary care emergency department with the shortest journey (ie, 15min accessibility rate (15mAR)) at peak-traffic congestion hours. Sociodemographic characteristics were disaggregated for equity analyses. A time-series bivariate analysis explored accessibility rates versus housing stratification. Results: Traffic congestion sharply reduces accessibility to tertiary emergency care (eg, 15mAR was 36.8% during peak-traffic hours vs 84.4% during free-flow hours for the week of 6-12 July 2020). Traffic congestion sharply reduces accessibility to tertiary emergency care. The greatest impact fell on specific ethnic groups, people with less educational attainment and those living in low-income households or on the periphery of Cali (15mAR: 8.1% peak traffic vs 51% free-flow traffic). These populations face longer average travel times to health services than the average population. Conclusions: These findings suggest that health services and land use planning should prioritise travel times over travel distance and integrate them into urban planning. Existing technology and data can reveal inequities by integrating sociodemographic data with accurate travel times to health services estimates, providing the basis for valuable indicators
    corecore