1,958 research outputs found

    Guidelines For Pursuing and Revealing Data Abstractions

    Full text link
    Many data abstraction types, such as networks or set relationships, remain unfamiliar to data workers beyond the visualization research community. We conduct a survey and series of interviews about how people describe their data, either directly or indirectly. We refer to the latter as latent data abstractions. We conduct a Grounded Theory analysis that (1) interprets the extent to which latent data abstractions exist, (2) reveals the far-reaching effects that the interventionist pursuit of such abstractions can have on data workers, (3) describes why and when data workers may resist such explorations, and (4) suggests how to take advantage of opportunities and mitigate risks through transparency about visualization research perspectives and agendas. We then use the themes and codes discovered in the Grounded Theory analysis to develop guidelines for data abstraction in visualization projects. To continue the discussion, we make our dataset open along with a visual interface for further exploration

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only ∌ 0.7 \sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    A Grammar for Reproducible and Painless Extract-Transform-Load Operations on Medium Data

    Get PDF
    Many interesting data sets available on the Internet are of a medium size---too big to fit into a personal computer's memory, but not so large that they won't fit comfortably on its hard disk. In the coming years, data sets of this magnitude will inform vital research in a wide array of application domains. However, due to a variety of constraints they are cumbersome to ingest, wrangle, analyze, and share in a reproducible fashion. These obstructions hamper thorough peer-review and thus disrupt the forward progress of science. We propose a predictable and pipeable framework for R (the state-of-the-art statistical computing environment) that leverages SQL (the venerable database architecture and query language) to make reproducible research on medium data a painless reality.Comment: 30 pages, plus supplementary material

    Cost Effective Analysis of Big Data

    Get PDF
    Executive Summary Big data is everywhere and businesses that can access and analyze it have a huge advantage over those who can’t. One option for leveraging big data to make more informed decisions is to hire a big data consulting company to take over the entire project. This method requires the least effort, but is also the least cost effective. The problem is that the know-how for starting a big data project is not commonly known and the consulting alternative is not very cost effective. This creates the need for a cost effective approach that businesses can use to start and manage big data projects. This report details the development of an advisory tool to cut down on consulting costs of big data projects by taking an active role in the project yourself. The tool is not a set of standard operating procedures, but simply a guide for someone to follow when embarking on a big data project. The advisory tools has three steps that consist of data wrangling, statistical analysis, and data engineering. Data wrangling is the process of cleaning and organizing data into a format that is ready for statistical analysis. The guide recommends using the open source software and programming language of R. The next step is the statistical analysis portion of the process which takes the form of exploratory data analysis and the use of existing models and algorithms. The use of existing methods should always be attempted to the highest performance before justifying the costs to pay for big data analytics and the development of new algorithms. Data engineering consists of creating and applying statistical algorithms, utilizing cloud infrastructure to distribute processing, and the development of a complete platform solution. The experimentation for the design of our advisory toolwas carried out through analysis of many large data sets. The data sets were analyzed to determine the best explanatory variables to predict a selected response. The iterative process of data wrangling, statistical analysis, and model building was carried out for all the data sets. The experience gained, through the iterations of data wrangling and exploratory analysis, was extremely valuable in evaluating the usefulness of the design. The statistical analysis improved every time the iterative loop of wrangling and analysis was navigated. In house data wrangling, before submission to a data scientist, is the primary cost justification of using the advisory tool. Data wrangling typically occupies 80% of data scientist’s time in big data projects. So, if data wrangling is self-performed before a data scientist receives the data, then less time will be spent wrangling by the data scientist. Since data scientists are paid very high hourly wages, extra time saved wrangling equates to direct cost savings. This is assuming that the data wrangling performed before a data scientist takes over is of adequate quality. The results of applying the advisory tool may vary from case to case, depending on the critical skills the user possesses and the development of such skills. The critical skills begin with coding in R and Python as well as knowledge in the statistical methods of choice. Basic knowledge of statistics, and any programming language is a must to begin utilizing this guide. Statistical proficiency is the limiting factor in the advisory tool. The best start for doing a big data project on one’s own is to first learn R and become familiar with the statistical libraries it contains. This allows data wrangling and exploratory analysis to be performed at a high level. This project pushed the boundaries of what can be done with big data using traditional computer framework without cloud usage. Storage and processing limits of traditional computers were tested and in some cases reached, which verified the eventual need to operate in the cloud environment

    An Educator’s Perspective of the Tidyverse

    Get PDF
    Computing makes up a large and growing component of data science and statistics courses. Many of those courses, especially when taught by faculty who are statisticians by training, teach R as the programming language. A number of instructors have opted to build much of their teaching around use of the tidyverse. The tidyverse, in the words of its developers, “is a collection of R packages that share a high-level design philosophy and low-level grammar and data structures, so that learning one package makes it easier to learn the next” (Wickham et al. 2019). These shared principles have led to the widespread adoption of the tidyverse ecosystem. A large part of this usage is because the tidyverse tools have been intentionally designed to ease the learning process and make it easier for users to learn new functions as they engage with additional pieces of the larger ecosystem. Moreover, the functionality offered by the packages within the tidyverse spans the entire data science cycle, which includes data import, visualisation, wrangling, modeling, and communication. We believe the tidyverse provides an effective and efficient pathway for undergraduate students at all levels and majors to gain computational skills and thinking needed throughout the data science cycle. In this paper, we introduce the tidyverse from an educator’s perspective. We provide a brief introduction to the tidyverse, demonstrate how foundational statistics and data science tasks are accomplished with the tidyverse, and discuss the strengths of the tidyverse, particularly in the context of teaching and learning
    • 

    corecore