38 research outputs found
Data Science for All: Apache Spark & Jupyter Notebooks
The Nation\u27s research enterprise faces a shortage of data scientists. Expanding the pipeline of data science students, particularly from underrepresented populations, requires educational institutions to increase awareness of data science and inspire a passion for data in students as they begin their academic careers. In this tutorial we discuss the development and delivery of a free seminar designed to provide hands-on lessons in the use of both Apache Spark and Jupyter notebooks to students from any academic background in an approachable, no-risk environment. An explanation of the seminar resources, exercises, and implementation guidelines are included, as are lessons learned from several successful seminars held both in-person and virtually at two institutions of high education
Data Science for all : a stroll in the foothills.
Data science presents both opportunities and threats to conventional statistics courses.
Opportunities include being at the bleeding edge of data analysis, and learning new ways to model
phenomena; threats include the challenge of learning new skills and reviewing fundamental
assumptions about explanation, prediction and modeling. Powerful data visualisations makes it
easier to introduce students to fundamental statistical ideas associated with multivariate data.
Data science provides methods to tackle problems that are intractable using analytic methods.
Students need to learn how to model complex problems, and to understand the problematic nature
of modeling – and they need to consider the practical and ethical implications of their (and others’)
work. Here, we offer a stroll into the foothills, along with aphorisms and heuristics for data
analysts
Exploring Approaches to Data Literacy Through a Critical Race Theory Perspective
In this paper, we describe and analyze a workshop developed for a work training program called DataWorks. In thisworkshop, data workers chose a topic of their interest, sourced and processed data on that topic, and used that data to createpresentations. Drawing from discourses of data literacy; epistemic agency and lived experience; and critical race theory, we analyze the workshops’ activities and outcomes. Through this analysis, three themes emerge: the tensions between epistemic agency and the context of work, encountering the ordinariness of racism through data work, and understanding the personal as communal and intersectional. Finally, critical race theory also prompts us to consider the very notions of data literacy that undergird our workshop activities. From this analysis, we offer a series of suggestions for approaching designing data literacy activities, taking into account critical race theory
Locating Ethics in Data Science: Responsibility and Accountability in Global and Distributed Knowledge Production
This is the author accepted manuscript. The final version is available from Royal Society via the DOI in this record.The distributed and global nature of data science creates challenges for evaluating the
quality, import and potential impact of the data and knowledge claims being produced.
This has significant consequences for the management and oversight of responsibilities and
accountabilities in data science. In particular, it makes it difficult to determine who is
responsible for what output, and how such responsibilities relate to each other; what
‘participation’ means and which accountabilities it involves, with regards to data
ownership, donation and sharing as well as data analysis, re-use and authorship; and
whether the trust placed on automated tools for data mining and interpretation is
warranted (especially since data processing strategies and tools are often developed
separately from the situations of data use where ethical concerns typically emerge). To
address these challenges, this paper advocates a participative, reflexive management of data
practices. Regulatory structures should encourage data scientists to examine the historical
lineages and ethical implications of their work at regular intervals. They should also foster
awareness of the multitude of skills and perspectives involved in data science, highlighting
how each perspective is partial and in need of confrontation with others. This approach has
the potential to improve not only the ethical oversight for data science initiatives, but also
the quality and reliability of research outputs.This research was funded by the European Research Council grant award 335925 (“The
Epistemology of Data-Intensive Science”), the Leverhulme Trust Grant number RPG-2013-
153 and the Australian Research Council, Discovery Project DP160102989
Understanding Effective Use of Big Data: Challenges and Capabilities (A Management Perspective)
While prior research has provided insights into challenges and capabilities related to effective Big Data use, much of this contribution has been conceptual in nature. The aim of this study is to explore such challenges and capabilities through an empirical approach. Accordingly, this paper reports on a multiple case study approach, involving eight organizations from the private and public sectors. The study provides empirical support for capabilities and challenges identified through prior research and identifies additional insights viz. problem-driven approach, time to value, data readiness, data literacy, data misuse, operational agility, and organizational maturity assessment
Humanized data cleaning
Dissertação de mestrado integrado em Engenharia InformáticaData science has started to become one of the most important skills someone can have
in the modern world, due to data taking an increasingly meaningful role in our lives.
The accessibility of data science is however limited, requiring complicated software or
programming knowledge. Both can be challenging and hard to master, even for the simpler
tasks.
Currently, in order to clean data you need a data scientist. The process of data cleaning,
consisting of removing or correcting entries of a data set, usually requires programming
knowledge as it is mostly performed using programming languages such as Python and
R (kag). However, data cleaning could be performed by people that may possess better
knowledge of the data domain, but lack the programming background, if this barrier is
removed.
We have studied current solutions that are available on the market, the type of interface
each one uses to interact with the end users, such as a control flow interface, a tabular
based interface or block-based languages. With this in mind, we have approached this issue
by providing a new data science tool, termed Data Cleaning for All (DCA), that attempts
to reduce the necessary knowledge to perform data science tasks, in particular for data
cleaning and curation. By combining Human-Computer Interaction (HCI) concepts, this tool
is: simple to use through direct manipulation and showing transformation previews; allows
users to save time by eliminate repetitive tasks and automatically calculating many of the
common analyses data scientists must perform; and suggests data transformations based on
the contents of the data, allowing for a smarter environment.A ciência de dados tornou-se uma das capacidades mais importantes que alguém pode possuir no mundo moderno, devido aos dados serem cada vez mais importantes na nossa sociedade. A acessibilidade da ciência de dados é, no entanto, limitada, requer software complicado ou conhecimentos de programação. Ambos podem ser desafiantes e difíceis de aprender bem, mesmo para tarefas simples. Atualmente, para efetuar a limpeza de dados e necessário um Data Scientist. O processo de limpeza de dados, que consiste em remover ou corrigir entradas de um dataset, é normalmente efetuado utilizando linguagens de programação como Python e R (kag). No entanto, a limpeza de dados poderia ser efetuada por profissionais que possuam melhor conhecimento sobre o domínio dos dados a tratar, mas que não possuam uma formação em ciências da computação. Estudamos soluções que estão presentes no mercado e o tipo de interface que cada uma usa para interagir com o utilizador, seja através de diagramas de fluxo de controlo, interfaces tabulares ou recorrendo a linguagens de programação baseadas em blocos. Com isto em mente, abordamos o problema através do desenvolvimento de uma nova plataforma onde podemos efetuar tarefas de ciências de dados com o nome Data Cleaning for All (DCA). Com esta ferramenta esperamos reduzir os conhecimentos necessários para efetuar tarefas nesta área, especialmente na área da limpeza de dados. Através da combinação de conceitos de HCI, a plataforma é: simples de usar através da manipulação direta dos dados e da demonstração de pré-visualizações das transformações; permite aos utilizadores poupar tempo através da eliminação de tarefas repetitivas ao calcular muitas das métricas que Data Scientist tem de calcular; e sugere transformações dos dados baseadas nos conteúdos dos mesmos, permitindo um ambiente mais inteligente
Dynamic accessibility by car to tertiary care emergency services in Cali, Colombia, in 2020 : cross-sectional equity analyses using travel time big data from a Google API
Objectives: To test a new approach to characterise accessibility to tertiary care emergency health services in urban Cali and assess the links between accessibility and sociodemographic factors relevant to health equity. Design: The impact of traffic congestion on accessibility to tertiary care emergency departments was studied with an equity perspective, using a web-based digital platform that integrated publicly available digital data, including sociodemographic characteristics of the population and places of residence with travel times. Setting and participants: Cali, Colombia (population 2.258million in 2020) using geographic and sociodemographic data. The study used predicted travel times downloaded for a week in July 2020 and a week in November 2020. Primary and secondary outcomes: The share of the population within a 15min journey by car from the place of residence to the tertiary care emergency department with the shortest journey (ie, 15min accessibility rate (15mAR)) at peak-traffic congestion hours. Sociodemographic characteristics were disaggregated for equity analyses. A time-series bivariate analysis explored accessibility rates versus housing stratification. Results: Traffic congestion sharply reduces accessibility to tertiary emergency care (eg, 15mAR was 36.8% during peak-traffic hours vs 84.4% during free-flow hours for the week of 6-12 July 2020). Traffic congestion sharply reduces accessibility to tertiary emergency care. The greatest impact fell on specific ethnic groups, people with less educational attainment and those living in low-income households or on the periphery of Cali (15mAR: 8.1% peak traffic vs 51% free-flow traffic). These populations face longer average travel times to health services than the average population. Conclusions: These findings suggest that health services and land use planning should prioritise travel times over travel distance and integrate them into urban planning. Existing technology and data can reveal inequities by integrating sociodemographic data with accurate travel times to health services estimates, providing the basis for valuable indicators