16 research outputs found

    Streamlining Study Design and Statistical Analysis for Quality Improvement and Research Reproducibility

    No full text
    Research Overview: This summarizes the current and future work done in streamlining the processes and methods involved with study design and statistical analyses in order to ensure quality of statistical methods and reproducibility of research. Objectives/Goals: Key factors causing irreproducibility of research include those related to inappropriate study design methodologies and statistical analysis. In modern statistical practice irreproducibility could arise due to statistical (false discoveries, p-hacking, overuse/misuse of p-values, low power, poor experimental design) and computational (data, code & software management) issues. These require understanding the processes and workflows practiced by an organization, and the development and use of metrics to quantify reproducibility. Methods/Study Population: Within the Foundation of Discovery - Population Health Research, Center for Clinical and Translational Science, University of Utah, we are undertaking a project to streamline the study design and statistical analysis workflows and processes. As a first step we met with key stakeholders to understand the current practices by eliciting example statistical projects, and then developed process information models for different types of statistical needs using Lucidchart. We then reviewed these with the Foundation’s leadership and the Standards Committee to come up with ideal workflows and model, and defined key measurement points (such as those around study design, analysis plan, final report, requirements for quality checks, and double coding) for assessing reproducibility. As next steps we are using our finding to embed analytical and infrastructural approaches within the statisticians’ workflows. This will include data and code dissemination platforms such as Box, Bitbucket and GitHub, documentation platforms such as Confluence, and workflow tracking platforms such as Jira. These tools will simplify and automate the capture of communications as a statistician work through a project. Data-intensive process will use process-workflow management platforms such as Activiti, Pegasus and Taverna. Results/Anticipated Results: These strategies for sharing and publishing study protocols, data, code and results across the spectrum, active collaboration with the research team, automation of key steps, along with decision support. Discussion/Significance of Impact: This analysis of the statistical methods and process and computational methods to automate them ensure quality of statistical methods and reproducibility of research

    A Conceptual Architecture for Reproducible On-demand Data Integration for Complex Diseases

    No full text
    Eosinophilic Esophagitis, which is a complex and emerging condition characterized by poorly defined phenotypes, and associated with both genetic and environmental conditions. Understanding such diseases requires researchers to seamlessly navigate across multiple scales (e.g., metabolome, proteome, genome, phenome, exposome) and models (sources using different stores, formats, and semantics), interrogate existing knowledge bases, and obtain results in formats of choice to answer different types of research questions. All of these would need to be performed to support reproducibility and sharability of methods used for selecting data sources, designing research queries, as well as query execution, understanding results and their quality. We present a higher level of formalizations for building multi-source data platforms on-demand based on the principles of meta-process modeling and provide reproducible and sharable data query and interrogation workflows and artifacts. A framework based on these formalizations consists of a layered abstraction of processes to support administrative and research end users: Top layer (meta-process): An extendable library of computable generic process concepts (PC) stored in a metadata repository1 (MDR) and describe steps/phases in the translational research life cycle. Middle layer (process): Methods to generate on-demand queries by assembling instantiated PC into query processes and rules. Researchers design query processes using PC, and evaluate their feasibility and validity by leveraging metadata content in the MDR. Bottom layer (execution): Interaction with a hyper-generalized federation platform (e.g. OpenFurther1) that performs complex interrogation and integration queries that require consideration of interdependencies and precedence across the selected sources. This framework can be implemented using process exchange formats (e.g., DAX, BPMN); and scientific workflow systems (e.g., Pegasus2, Apache Taverna3). All content (PC, rules, and workflows), assembling, and executing mechanism are sharable. The content, design, and development of the framework is informed by user-centered design methodology and consists of researcher and integration-centric components to provide robust and reproducible workflows. References 1. Gouripeddi R, Facelli JC, et al. FURTHeR: An Infrastructure for Clinical, Translational and Comparative Effectiveness Research. AMIA Annual Fall Symposium. 2013; Wash, DC. 2. Pegasus. The Pegasus Project. 2016; https://pegasus.isi.edu/. 3. Apache Software Foundation. Apache Taverna. 2016; https://taverna.incubator.apache.org/

    Reproducibility of Electronic Health Record Research Data Requests

    No full text
    <p><strong>Objectives/Goals</strong>: Translational research inclusive of observational, comparative effectiveness, clinical trials and population health studies is increasingly dependent on the secondary use of existing data including electronic health record (EHR), as a source for knowledge discovery. Researchers often provide natural language descriptions of their data needs. Research data teams use these descriptions to develop queries to run against EHR systems and provide result back to the researcher. Within this process, the data team and the researcher usually engage in complex written and verbal communication in order to mediate the details within the natural language description. The data team then abstracts the understood request to the complexities present with the EHR system. This is followed by the development of appropriate query scripts, extraction, transform and provisioning of query results and analysis. The data team and researcher usually iterate over these steps multiple times in order to develop the final data deliverable. In this study, we analyze the reproducibility of the current process of using natural language descriptions for acquiring research data from the EHR.</p> <p><strong>Methods/Study Population</strong>: We provided a natural language description of an Upper Respiratory Tract Infection (URTI) study two data teams of three CTSA sites. The teams were blinded to the true nature of the study, which was to understand the reproducibility of the natural language description. Results and processes followed at each were analyzed. The following is a summary of the URTI data request:</p> <p>     <em>Patients eligible for enrollment between July 1, 2012 and September 30, 2015. Data will be reviewed up to six months pre-index to identify baseline characteristics and exclusion criteria. </em><em>Subjects identified with an ICD-9-CM diagnosis code for an URTI during an outpatient patient encounter will be included. The index date is defined as the first ICD-9-CM documentation for an URTI. A 6-month pre-index period will be used to identify exclusion criteria, and to observe baseline characteristics. Patients will be examined for outcomes of interest within 24-hours of the index clinic date and time. In addition, we exclude patients with a positive rapid antigen detection test (RADT) for group A streptococcal pathogens at the initial visit (results available within 24-hours) as this is an instance when antibiotic prescribing is appropriate. </em></p> <p><strong><em>Inclusion criteria: </em></strong></p> <ol> <li><em>Age >18 years old</em></li> <li><em>Diagnosis (ICD – 9 code list provided) of a URTI in the outpatient setting</em></li> </ol> <p><em>Exclusion criteria: </em></p> <ol> <li><em>AIDS/HIV (ICD – 9 code list provided)</em></li> <li><em>COPD/Asthma (ICD – 9 code list provided)</em></li> <li><em>Cancer (ICD – 9 code list provided)</em></li> <li><em>Conditions for which antibiotic prescribing may be appropriate</em> <ul> <li><em>An URTI diagnosis within the 180 day pre-index period</em></li> <li><em>Additional infectious diseases (ICD – 9 code list provided)</em></li> <li><em>A positive rapid antigen detection test for group A streptococcal pathogen (LOINC code list provided).</em></li> </ul> </li> </ol> <p> </p> <p><strong>Results/Anticipated Results</strong>: Our results yielded 684,478, 460,159, and 412,942 individuals having outpatient visits between July 1, 2012 and September 30, 2015 at the three sites respectively. Of these, 18.7%, 1.7%, and 3.1% had URTI at each site. Of these, 17.5%, 34.2%, 39.9% were over 18 years of age at the three sites, respectively. After applying all exclusion criteria 6,797, 623 and 3,092 patients respectively, were obtained at the three sites. These patients would NOT be expected to receive antibiotics. Of those, 9.4%, 0.3% and 36.5%, and 11.2%, 0.3%, and 40.3% were prescribed antibiotics within the first 24 hours and first 8 days at each of the three sites, respectively.</p> <p><strong>Discussion/Significance of Impact</strong>: Analysis of the results at each stage of the query building showed 10-fold to double differences across the sites. The site and magnitude of these differences varied at each step of inclusion/exclusion criteria. In order to ascertain possible reasons for these discrepancies, we asked the data teams at each site to describe their data query process and analyzed the query results with the undertaken processes. In our analysis we found that contextual, organizational and data analyst specific issues played a significant role in how the data team constructed their queries. In addition, differences in how data was transformed and stored from the EHR source systems, as well as presented to the data team played an important role.</p> <p>Data obtained from the EHR has great potential for use in translational research. However, the inability to successfully reproduce data requests across different sites should be an important consideration. Reproducing research data requires effective communication between the research and data teams. In addition, there is a need for (1) structured or semi-structured data requisition methods using templates, and (2) context-sensitive and metadata-driven workflows that supports the entire life cycle of research data requisition including the development of the natural language description of the research data, query mediation, data abstraction, data extraction and provisioning of results and analysis.</p

    A Framework for Metadata Management and Automated Discovery for Heterogeneous Data Integration

    No full text
    <p>Current approaches to metadata discovery are dependent on time consuming manual curations. To realize the full potential of Big Data technologies in biomedicine, enhance research reproducibility and increase efficiency in translational sciences it is critical to develop automatic and/or semiautomatic metadata discovery methods and the corresponding infrastructure to deploy and maintain these tools and their outputs.<br> Towards such a discovery infrastructure:<br> We conceptually designed a process workflow for Metadata Discovery and Mapping Service, for automated metadata discovery. Based on steps taken by human experts in discovering and mapping metadata from various biomedical data, we designed a framework for automation. It consists of a 3-step process: (1) identification of data file source and format, (2) followed by detailed metadata characterization based on (1), and (3) characterization of the file in relation to other files to support harmonization of content as needed for data integration. The framework discovers and leverages administrative, structural, descriptive and semantic metadata, and consists of metadata and semantic mappers, along with uncertainty characterization and provision of expert review. As next steps we will develop and evaluate this framework using workflow platforms (e.g. Swift, Pegasus).<br> In order to store discovered metadata about digital objects, we enhanced OpenFurther’s Metadata Repository (MDR). We configured the bioCADDIE metadata specifications (DatA Tag Suite (DATS) model) as assets in the MDR for harmonizing metadata of individual datasets (e.g. different protein files) for data integration. This method of metadata management provides a flexible data resource metadata storage system that supports versioning metadata (e.g. DATS 1.0 to 2.1) and data files mapped to different versions, enhance descriptors of resources (DATS) with descriptions of content within resources, and translations to other metadata specifications (e.g. schema.org). Also, this MDR stored metadata is available for various data services including data integration.</p
    corecore