2 research outputs found

    Provenance Metadata for Statistical Data: An Introduction to Structured Data Transformation Language (SDTL)

    Full text link
    Structured Data Transformation Language (SDTL) provides structured, machine actionable representations of data transformation commands found in statistical analysis software. The Continuous Capture of Metadata for Statistical Data Project (C2Metadata) created SDTL as part of an automated system that captures provenance metadata from data transformation scripts and adds variable derivations to standard metadata files. SDTL also has potential for auditing scripts and for translating scripts between languages. SDTL is expressed in a set of JSON schemas, which are machine actionable and easily serialized to other formats. Statistical software languages have a number of special features that have been carried into SDTL. We explain how SDTL handles differences among statistical languages and complex operations, such as merging files and reshaping data tables from “wide” to “long”.National Science Foundation grant ACI-1640575https://deepblue.lib.umich.edu/bitstream/2027.42/156015/1/SDTL_Intro_v14.pdfDescription of SDTL_Intro_v14.pdf : Main articl

    Automating the Capture of Data Transformation Metadata from Statistical Analysis Software

    Full text link
    The C2Metadata (“Continuous Capture of Metadata for Statistical Data”) Project automates one of the most burdensome aspects of documenting the provenance of research data: describing data transformations performed by statistical software. Researchers in many fields use statistical software (SPSS, Stata, SAS, R, Python) for data transformation and data management as well as analysis. The C2Metadata Project creates a metadata workflow paralleling the data management process by deriving provenance information from scripts used to manage and transform data. C2Metadata differs from most previous data provenance initiatives by documenting transformations at the variable level rather than describing a sequence of opaque programs. Scripts used with statistical software are translated into an independent Structured Data Transformation Language (SDTL), which serves as an intermediate language for describing data transformations. SDTL can be used to add variable-level provenance to data catalogs and codebooks and to create “variable lineages” for auditing software operations. Better data documentation makes research more transparent and expands the discovery and re-use of research data.National Science Foundation grant ACI-1640575https://deepblue.lib.umich.edu/bitstream/2027.42/156014/3/Automating_metadata_capture_v15.pd
    corecore