24 research outputs found

    DDI-Lifecycle and Colectica at the UCLA Social Science Data Archive

    Get PDF
    Presentation at the North American Data Documentation Conference (NADDI) 2013The UCLA Social Science Data Archive’s mission is to provide a foundation for social science research involving original data collection or the reuse of publicly available studies. Archive staff and researchers work as partners throughout all stages of the research process, beginning when a hypothesis or area of study is being developed, during grant and funding activities, while data collection and/or analysis is ongoing, and finally in long term preservation of research results. Three years ago SSDA began to search for a better repository solution to manage its data, make it more visible, and to support the organization’s disaster plan. SSDA wanted to make it easier for researchers to look for data, to document their data, and use data online. Since the goal is to document the entire lifecycle of a data product, the DDI-Lifecycle standard plays a key role in the solution. This paper explores how DDI-Lifecycle and Colectica can help a data archive with limited staff and resources deliver a rich data documentation system that integrates with other tools to allow researchers to discover and understand the data relevant to their work. The paper will discuss how SSDA and Colectica staff worked together to implement the solution.Institute for Policy & Social Research, University of Kansas; University of Kansas Libraries; Alfred P. Sloan Foundation; Data Documentation Initiative Allianc

    ICPSR Working Paper 4

    Full text link
    This paper provides an overview of a methodology used to identify and organize health questions and measures related to Alzheimer’s and other cognitive impairments using data maintained or supported by NACDA. This project specifically used the National Social Life, Health and Aging Project (NSHAP) and the National Health and Aging Trends Study (NHATS) as our comparison proof of concept. The methodology used in this process identifies variables that measure Alzheimer’s disease (A.D.) and other cognitive impairments within NSHAP and NHATS as well as sociodemographic, and comorbidity data commonly associated with increased risk of A.D. and other cognitive impairments. As both NSHAP and NHATS represent multiple waves of longitudinal follow-up information we created longitudinal metadata files that allow for the comparison of A.D. and other cognitive impairments risk across time using these two studies. The project generated enhanced metadata using DDI Lifecycle software to make the discovery of A.D. and other cognitive impairments variables more straightforward and increase the user-friendly elements of these studies. Finally, the proposed supplement included the creation of a customized bibliography (see Appendix) of the use of NSHAP and NHATS data in the analysis of A.D. and other cognitive impairments research, allowing researchers to more easily review the existing body of literature using these data resources. This report describes NACDA’s effort to increase the availability, usability, and discoverability of A.D. and other cognitive impairments information in these studies, encouraging use of NSHAP and NHATS for Alzheimer’s related research and adding to our understanding of how cognitive issues change across time.National Institute on Aging (NIA)http://deepblue.lib.umich.edu/bitstream/2027.42/156403/4/NACDA_cross-series_nshap-nhats_ICPSR_working_paper4_aug2020v2.pdfSEL

    Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires

    Get PDF
    Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge.
 While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift.
 This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository.
 The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments.  Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments

    Provenance of "after the fact" harmonised community-based demographic and HIV surveillance data from ALPHA cohorts

    Get PDF
    Background: Data about data, metadata, for describing Health and Demographic Surveillance System (HDSS) data have often received insufficient attention. This thesis studied how to develop provenance metadata within the context of HDSS data harmonisation - the network for Analysing Longitudinal Population-based HIV/ AIDS data on Africa (ALPHA). Technologies from the data documentation community were customised, among them: A process model - Generic Longitudinal Business Process Model (GLBPM), two metadata standards - Data Documentation Initiative (DDI) and Standard for Data and Metadata eXchange (SDMX) and a data transformations description language - Structured Data Transform Language (SDTL). Methods: A framework with three complementary facets was used: Creating a recipe for annotating primary HDSS data using the GLBPM and DDI; Approaches for documenting data transformations. At a business level, prospective and retrospective documentation using GLBPM and DDI and retrospectively recovering the more granular details using SDMX and SDTL; Requirements analysis for a user-friendly provenance metadata browser. Results: A recipe for the annotation of HDSS data was created outlining considerations to guide HDSS on metadata entry, staff training and software costs. Regarding data transformations, at a business level, a specialised process model for the HDSS domain was created. It has algorithm steps for each data transformation sub-process and data inputs and outputs. At a lower level, the SDMX and SDTL captured about 80% (17/21) of the variable level transformations. The requirements elicitation study yielded requirements for a provenance metadata browser to guide developers. Conclusions: This is a first attempt ever at creating detailed metadata for this resource or any other similar resources in this field. HDSS can implement these recipes to document their data. This will increase transparency and facilitate reuse thus potentially bringing down costs of data management. It will arguably promote the longevity and wide and accurate use of these data

    YARD: A Tool for Curating Research Outputs

    Get PDF
    Repositories increasingly accept research outputs and associated artifacts that underlie reported findings, leading to potential changes in the demand for data curation and repository services. This paper describes a curation tool that responds to this challenge by economizing and optimizing curation efforts. The curation tool is implemented at Yale University’s Institution for Social and Policy Studies (ISPS) as YARD. By standardizing the curation workflow, YARD helps create high quality data packages that are findable, accessible, interoperable, and reusable (FAIR) and promotes research transparency by connecting the activities of researchers, curators, and publishers through a single pipeline

    Engineering a machine learning pipeline for automating metadata extraction from longitudinal survey questionnaires

    Get PDF
    Data Documentation Initiative-Lifecycle (DDI-L) introduced a robust metadata model to support the capture of questionnaire content and flow, and encouraged through support for versioning and provenancing, objects such as BasedOn for the reuse of existing question items. However, the dearth of questionnaire banks including both question text and response domains has meant that an ecosystem to support the development of DDI ready Computer Assisted Interviewing (CAI) tools has been limited. Archives hold the information in PDFs associated with surveys but extracting that in an efficient manner into DDI-Lifecycle is a significant challenge.
 While CLOSER Discovery has been championing the provision of high-quality questionnaire metadata in DDI-Lifecycle, this has primarily been done manually. More automated methods need to be explored to ensure scalable metadata annotation and uplift.
 This paper presents initial results in engineering a machine learning (ML) pipeline to automate the extraction of questions from survey questionnaires as PDFs. Using CLOSER Discovery as a ‘training and test dataset’, a number of machine learning approaches have been explored to classify parsed text from questionnaires to be output as valid DDI items for inclusion in a DDI-L compliant repository.
 The developed ML pipeline adopts a continuous build and integrate approach, with processes in place to keep track of various combinations of the structured DDI-L input metadata, ML models and model parameters against the defined evaluation metrics, thus enabling reproducibility and comparative analysis of the experiments.  Tangible outputs include a map of the various metadata and model parameters with the corresponding evaluation metrics’ values, which enable model tuning as well as transparent management of data and experiments

    Preparing to Share Social Science Data: An Open Source, DDI-based Curation System

    Get PDF
    Objective: This poster will describe the development of a curatorial system to support a repository for research data from randomized controlled trials in the social sciences. Description: The Institution for Social and Policy Studies (ISPS) at Yale University and Innovations for Poverty Action (IPA) are partnering with Colectica to develop a software platform that structures the curation workflow, including checking data for confidentiality and completeness, creating preservation formats, and reviewing and verifying code. The software leverages DDI Lifecycle – the standard for data documentation – and will enable a seamless framework for collecting, processing, archiving, and publishing data. This data curation software system combines several off-the-shelf components with a new, open source, Web application that integrates the existing components to create a flexible data pipeline. The software will help automate parts of the data pipeline and will unify the workflow for staff, and potentially for researchers. Default components include Fedora Commons, Colectica Repository, and Drupal, but the software is developed so each of these can be swapped for alternatives. Results: The software is designed to integrate into any repository workflow, and can also be incorporated earlier in the research workflow, ensuring eventual data and code deposits are of the highest quality. Conclusions: This poster will describe the requirements for the new curatorial workflow tool, the components of the system, how tasks are launched and tracked, and the benefits of building an integrated curatorial system for data, documentation, and code

    Open-access for existing LMIC demographic surveillance data using DDI

    Get PDF
    Open-access for existing LMIC demographic surveillance data using DDI</jats:p

    Committing to Data Quality Review

    Full text link

    Committing to Data Quality Review

    Get PDF
    Amid the pressure and enthusiasm for researchers to share data, a rapidly growing number of tools and services have emerged. What do we know about the quality of these data? Why does quality matter? And who should be responsible for data quality? We believe an essential measure of data quality is the ability to engage in informed reuse, which requires that data are independently understandable. In practice, this means that data must undergo quality review, a process whereby data and associated files are assessed and required actions are taken to ensure files are independently understandable for informed reuse. This paper explains what we mean by data quality review, what measures can be applied to it, and how it is practiced in three domain-specific archives. We explore a selection of other data repositories in the research data ecosystem, as well as the roles of researchers, academic libraries, and scholarly journals in regard to their application of data quality measures in practice. We end with thoughts about the need to commit to data quality and who might be able to take on those tasks
    corecore