Search CORE

4 research outputs found

Automating the Capture of Data Transformation Metadata from Statistical Analysis Software

Author: Alter George
Donakowski Darrell
Gager Jack
Heus Pascal
Hunter Carson
Ionescu Sanda
Iverson Jeremy
Jagadish H V
Lagoze Carl
Lyle Jared
Mueller Alexander
Revheim Sigbjørn
Richardson Matthew
Risnes Ørnulf
Seelam Karunakara
Smith Dan
Smith Tom
Song Jie
Vaidya Yashas Jaydeep
Voldsater Ole
Publication venue
Publication date: 06/07/2020
Field of study

The C2Metadata (“Continuous Capture of Metadata for Statistical Data”) Project automates one of the most burdensome aspects of documenting the provenance of research data: describing data transformations performed by statistical software. Researchers in many fields use statistical software (SPSS, Stata, SAS, R, Python) for data transformation and data management as well as analysis. The C2Metadata Project creates a metadata workflow paralleling the data management process by deriving provenance information from scripts used to manage and transform data. C2Metadata differs from most previous data provenance initiatives by documenting transformations at the variable level rather than describing a sequence of opaque programs. Scripts used with statistical software are translated into an independent Structured Data Transformation Language (SDTL), which serves as an intermediate language for describing data transformations. SDTL can be used to add variable-level provenance to data catalogs and codebooks and to create “variable lineages” for auditing software operations. Better data documentation makes research more transparent and expands the discovery and re-use of research data.National Science Foundation grant ACI-1640575https://deepblue.lib.umich.edu/bitstream/2027.42/156014/3/Automating_metadata_capture_v15.pd

Deep Blue Documents at the University of Michigan

Provenance Metadata for Statistical Data: An Introduction to Structured Data Transformation Language (SDTL)

Author: Alter George
Donakowski Darrell
Gager Jack
Heus Pascal
Hunter Carson
Ionescu Sanda
Iverson Jeremy
Jagadish H V
Lagoze Carl
Lyle Jared
Mueller Alexander
Revheim Sigbjørn
Richardson Matthew
Risnes Ørnulf
Seelam Karunakara
Smith Dan
Smith Tom
Song Jie
Vaidya Yashas Jaydeep
Voldsater Ole
Publication venue
Publication date: 06/07/2020
Field of study

Structured Data Transformation Language (SDTL) provides structured, machine actionable representations of data transformation commands found in statistical analysis software. The Continuous Capture of Metadata for Statistical Data Project (C2Metadata) created SDTL as part of an automated system that captures provenance metadata from data transformation scripts and adds variable derivations to standard metadata files. SDTL also has potential for auditing scripts and for translating scripts between languages. SDTL is expressed in a set of JSON schemas, which are machine actionable and easily serialized to other formats. Statistical software languages have a number of special features that have been carried into SDTL. We explain how SDTL handles differences among statistical languages and complex operations, such as merging files and reshaping data tables from “wide” to “long”.National Science Foundation grant ACI-1640575https://deepblue.lib.umich.edu/bitstream/2027.42/156015/1/SDTL_Intro_v14.pdfDescription of SDTL_Intro_v14.pdf : Main articl

Deep Blue Documents at the University of Michigan

Automated Capture and Description of Data Transformations

Author: Alter George
Heus Pascal
Iverson Jeremy
Lyle Jared
Risnes Ørnulf
Smith Dan
Publication venue
Publication date: 07/04/2017
Field of study

This presentation, given at the North American Data Documentation Initiative Conference (NADDI) 2017 on April 7, 2017 in Ithaca, New York, describes the C2Metadata project (http://c2metadata.org/), which is developing new tools that will work with common statistical packages (SPSS®, SAS®, Stata®, R) to automate the capture of metadata at the granularity of individual data transformations. Software-independent data transformation descriptions will be added to metadata in two internationally accepted standards, the Data Documentation Initiative (DDI) and Ecological Metadata Language (EML). These tools will create efficiencies and reduce the costs of data collection, preparation, and re-use. Our project targets research communities with strong metadata standards and heavy reliance on statistical analysis software (social and behavioral sciences and earth observation sciences), but it is generalizable to other domains, such as biomedical research. (A similar version of this presentation was also given at the International Association for Social Science Information Services and Technology (IASSIST) 2017 conference in Lawrence, Kansas on May 25, 2017.)NSF Data Infrastructure Building Blocks (DIBBs) (ACI-1640575)https://deepblue.lib.umich.edu/bitstream/2027.42/136235/1/NADDI 2017 - Metadata capture project 20160804b.pptx.pptxDescription of NADDI 2017 - Metadata capture project 20160804b.pptx.pptx : Presentatio

Deep Blue Documents at the University of Michigan

C2Metadata: Continuous Capture of Metadata

Author: Alter George
Donakowski Darrell
Gager Jack
Heus Pascal
Ionescu Sanda
Iverson Jeremy
Jagadish H.V.
Lagoze Carl
Lyle Jared
Murphy Tom
Risnes Ørnulf
Smith Dan
Smith Tom
Song Jie
Publication venue
Publication date: 25/07/2017
Field of study

This poster presentation, given at the 2017 Society of American Archivists (SAA) Research Forum on July 25, 2017 in Portland, Oregon, describes the C2Metadata project to develop new tools that will work with common statistical packages to automate the capture of metadata at the granularity of individual data transformations.Supported by the Data Infrastructure Building Blocks (DIBBs) program of the National Science Foundation through grant NSF ACI-1640575.https://deepblue.lib.umich.edu/bitstream/2027.42/145473/1/Lyle_C2Metadata-updated3.pdfhttps://deepblue.lib.umich.edu/bitstream/2027.42/145473/3/Lyle_ResearchForumAbstractBio2017_1.pdfDescription of Lyle_C2Metadata-updated3.pdf : Poster presentationDescription of Lyle_ResearchForumAbstractBio2017_1.pdf : Poster presentation abstrac

Deep Blue Documents at the University of Michigan