218 research outputs found

    PhenDisco: phenotype discovery system for the database of genotypes and phenotypes.

    Get PDF
    The database of genotypes and phenotypes (dbGaP) developed by the National Center for Biotechnology Information (NCBI) is a resource that contains information on various genome-wide association studies (GWAS) and is currently available via NCBI's dbGaP Entrez interface. The database is an important resource, providing GWAS data that can be used for new exploratory research or cross-study validation by authorized users. However, finding studies relevant to a particular phenotype of interest is challenging, as phenotype information is presented in a non-standardized way. To address this issue, we developed PhenDisco (phenotype discoverer), a new information retrieval system for dbGaP. PhenDisco consists of two main components: (1) text processing tools that standardize phenotype variables and study metadata, and (2) information retrieval tools that support queries from users and return ranked results. In a preliminary comparison involving 18 search scenarios, PhenDisco showed promising performance for both unranked and ranked search comparisons with dbGaP's search engine Entrez. The system can be accessed at http://pfindr.net

    The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies

    Get PDF
    <p>Abstract</p> <p>Introduction</p> <p>The eMERGE (electronic MEdical Records and GEnomics) Network is an NHGRI-supported consortium of five institutions to explore the utility of DNA repositories coupled to Electronic Medical Record (EMR) systems for advancing discovery in genome science. eMERGE also includes a special emphasis on the ethical, legal and social issues related to these endeavors.</p> <p>Organization</p> <p>The five sites are supported by an Administrative Coordinating Center. Setting of network goals is initiated by working groups: (1) Genomics, (2) Informatics, and (3) Consent & Community Consultation, which also includes active participation by investigators outside the eMERGE funded sites, and (4) Return of Results Oversight Committee. The Steering Committee, comprised of site PIs and representatives and NHGRI staff, meet three times per year, once per year with the External Scientific Panel.</p> <p>Current progress</p> <p>The primary site-specific phenotypes for which samples have undergone genome-wide association study (GWAS) genotyping are cataract and HDL, dementia, electrocardiographic QRS duration, peripheral arterial disease, and type 2 diabetes. A GWAS is also being undertaken for resistant hypertension in ≈2,000 additional samples identified across the network sites, to be added to data available for samples already genotyped. Funded by ARRA supplements, secondary phenotypes have been added at all sites to leverage the genotyping data, and hypothyroidism is being analyzed as a cross-network phenotype. Results are being posted in dbGaP. Other key eMERGE activities include evaluation of the issues associated with cross-site deployment of common algorithms to identify cases and controls in EMRs, data privacy of genomic and clinically-derived data, developing approaches for large-scale meta-analysis of GWAS data across five sites, and a community consultation and consent initiative at each site.</p> <p>Future activities</p> <p>Plans are underway to expand the network in diversity of populations and incorporation of GWAS findings into clinical care.</p> <p>Summary</p> <p>By combining advanced clinical informatics, genome science, and community consultation, eMERGE represents a first step in the development of data-driven approaches to incorporate genomic information into routine healthcare delivery.</p

    Repeatable and reusable research - Exploring the needs of users for a Data Portal for Disease Phenotyping

    Get PDF
    Background: Big data research in the field of health sciences is hindered by a lack of agreement on how to identify and define different conditions and their medications. This means that researchers and health professionals often have different phenotype definitions for the same condition. This lack of agreement makes it hard to compare different study findings and hinders the ability to conduct repeatable and reusable research. Objective: This thesis aims to examine the requirements of various users, such as researchers, clinicians, machine learning experts, and managers, for both new and existing data portals for phenotypes (concept libraries). Methods: Exploratory sequential mixed methods were used in this thesis to look at which concept libraries are available, how they are used, what their characteristics are, where there are gaps, and what needs to be done in the future from the point of view of the people who use them. This thesis consists of three phases: 1) two qualitative studies, including one-to-one interviews with researchers, clinicians, machine learning experts, and senior research managers in health data science, as well as focus group discussions with researchers working with the Secured Anonymized Information Linkage databank, 2) the creation of an email survey (i.e., the Concept Library Usability Scale), and 3) a quantitative study with researchers, health professionals, and clinicians. Results: Most of the participants thought that the prototype concept library would be a very helpful resource for conducting repeatable research, but they specified that many requirements are needed before its development. Although all the participants stated that they were aware of some existing concept libraries, most of them expressed negative perceptions about them. The participants mentioned several facilitators that would encourage them to: 1) share their work, such as receiving citations from other researchers; and 2) reuse the work of others, such as saving a lot of time and effort, which they frequently spend on creating new code lists from scratch. They also pointed out several barriers that could inhibit them from: 1) sharing their work, such as concerns about intellectual property (e.g., if they shared their methods before publication, other researchers would use them as their own); and 2) reusing others' work, such as a lack of confidence in the quality and validity of their code lists. Participants suggested some developments that they would like to see happen in order to make research that is done with routine data more reproducible, such as the availability of a drive for more transparency in research methods documentation, such as publishing complete phenotype definitions and clear code lists. Conclusions: The findings of this thesis indicated that most participants valued a concept library for phenotypes. However, only half of the participants felt that they would contribute by providing definitions for the concept library, and they reported many barriers regarding sharing their work on a publicly accessible platform such as the CALIBER research platform. Analysis of interviews, focus group discussions, and qualitative studies revealed that different users have different requirements, facilitators, barriers, and concerns about concept libraries. This work was to investigate if we should develop concept libraries in Kuwait to facilitate the development of improved data sharing. However, at the end of this thesis the recommendation is this would be unlikely to be cost effective or highly valued by users and investment in open access research publications may be of more value to the Kuwait research/academic community

    Improving average ranking precision in user searches for biomedical research datasets

    Full text link
    Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorisation method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries. Our system provides competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP among the participants, being +22.3% higher than the median infAP of the participant's best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system's performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. Our similarity measure algorithm seems to be robust, in particular compared to Divergence From Randomness framework, having smaller performance variations under different training conditions. Finally, the result categorization did not have significant impact on the system's performance. We believe that our solution could be used to enhance biomedical dataset management systems. In particular, the use of data driven query expansion methods could be an alternative to the complexity of biomedical terminologies

    Genome-Wide Association Analysis of Major Depressive Disorder and Its Related Phenotypes.

    Get PDF
    Major Depressive Disorder (MDD) is a complex and chronic disease that ranks fourth as cause of disability worldwide. Thirteen to 14 million adults in the U.S. are believed to have MDD and an estimated 75% attempt suicide making MDD a major public health problem. Recently several genome-wide association (GWA) studies of MDD have been reported; however, few GWA studies focus on the analysis for MDD related phenotypes such as neuroticism and age at onset of MDD. The purpose of this study is to determine risk factors for MDD, identify genome-wide genetic variants affecting neuroticism and age at onset as quantitative traits, and detect gender differences influencing neuroticism. Bivariate and multiple logistic regression analyses were performed on 1,738 MDD cases and 1,618 non-MDD controls to determine phenotypic risk factors for MDD. Multiple linear regression analyses in PLINK software were used for GWA analyses for neuroticism and age at onset of MDD with 437,547 Single Nucleotide Polymorphisms (SNPs). Gender (OR: 1.43; 95% CI: 1.24 - 1.64) and a family history (OR: 2.88; 95% CI: 2.48 - 3.35) were significantly associated with an increased risk of MDD, which supports the findings of prior studies. Through GWA analysis 34 SNPs were identified to be associated with neuroticism (p \u3c 10-4). The best SNP was rs4806846 within the TMPRSS9 gene (p = 7.79 x10-6). Furthermore, 46 SNPs were found showing significant gene x gender interactions for neuroticism with p\u3c10-4. The best SNP showing gene x gender interaction was rs2430132 (p = 5.37x10-6) in HMCN1 gene. In addition, GWA analysis showed that several SNPs within 4 genes (GPR143, ASS1P4, MXRA5 and MAGEC1/2) were significantly associated with age at onset of MDD (p \u3c 5x10-7). This study confirmed previous findings that MDD is associated with an increased prevalence in women (about 43% more compared to men) and is highly heritable among first degree relatives. Several novel genetic loci were identified to be associated with neuroticism and age at onset. Gender differences were found in genetic influence of neuroticism. These findings offer the potential for new insights into the pathogenesis of MDD

    Behavior change interventions: the potential of ontologies for advancing science and practice

    Get PDF
    A central goal of behavioral medicine is the creation of evidence-based interventions for promoting behavior change. Scientific knowledge about behavior change could be more effectively accumulated using "ontologies." In information science, an ontology is a systematic method for articulating a "controlled vocabulary" of agreed-upon terms and their inter-relationships. It involves three core elements: (1) a controlled vocabulary specifying and defining existing classes; (2) specification of the inter-relationships between classes; and (3) codification in a computer-readable format to enable knowledge generation, organization, reuse, integration, and analysis. This paper introduces ontologies, provides a review of current efforts to create ontologies related to behavior change interventions and suggests future work. This paper was written by behavioral medicine and information science experts and was developed in partnership between the Society of Behavioral Medicine's Technology Special Interest Group (SIG) and the Theories and Techniques of Behavior Change Interventions SIG. In recent years significant progress has been made in the foundational work needed to develop ontologies of behavior change. Ontologies of behavior change could facilitate a transformation of behavioral science from a field in which data from different experiments are siloed into one in which data across experiments could be compared and/or integrated. This could facilitate new approaches to hypothesis generation and knowledge discovery in behavioral science

    Ontology-based knowledge representation of experiment metadata in biological data mining

    Get PDF
    According to the PubMed resource from the U.S. National Library of Medicine, over 750,000 scientific articles have been published in the ~5000 biomedical journals worldwide in the year 2007 alone. The vast majority of these publications include results from hypothesis-driven experimentation in overlapping biomedical research domains. Unfortunately, the sheer volume of information being generated by the biomedical research enterprise has made it virtually impossible for investigators to stay aware of the latest findings in their domain of interest, let alone to be able to assimilate and mine data from related investigations for purposes of meta-analysis. While computers have the potential for assisting investigators in the extraction, management and analysis of these data, information contained in the traditional journal publication is still largely unstructured, free-text descriptions of study design, experimental application and results interpretation, making it difficult for computers to gain access to the content of what is being conveyed without significant manual intervention. In order to circumvent these roadblocks and make the most of the output from the biomedical research enterprise, a variety of related standards in knowledge representation are being developed, proposed and adopted in the biomedical community. In this chapter, we will explore the current status of efforts to develop minimum information standards for the representation of a biomedical experiment, ontologies composed of shared vocabularies assembled into subsumption hierarchical structures, and extensible relational data models that link the information components together in a machine-readable and human-useable framework for data mining purposes

    Big Data in Oncology Nursing Research: State of the Science.

    Get PDF
    To review the state of oncology nursing science as it pertains to big data. The authors aim to define and characterize big data, describe key considerations for accessing and analyzing big data, provide examples of analyses of big data in oncology nursing science, and highlight ethical considerations related to the collection and analysis of big data. Peer-reviewed articles published by investigators specializing in oncology, nursing, and related disciplines. Big data is defined as data that are high in volume, velocity, and variety. To date, oncology nurse scientists have used big data to predict patient outcomes from clinician notes, identify distinct symptom phenotypes, and identify predictors of chemotherapy toxicity, among other applications. Although the emergence of big data and advances in computational methods provide new and exciting opportunities to advance oncology nursing science, several challenges are associated with accessing and using big data. Data security, research participant privacy, and the underrepresentation of minoritized individuals in big data are important concerns. With their unique focus on the interplay between the whole person, the environment, and health, nurses bring an indispensable perspective to the interpretation and application of big data research findings. Given the increasing ubiquity of passive data collection, all nurses should be taught the definition, characteristics, applications, and limitations of big data. Nurses who are trained in big data and advanced computational methods will be poised to contribute to guidelines and policies that preserve the rights of human research participants

    WHO, WHAT, WHEN, WHERE, AND WHY? QUANTIFYING AND UNDERSTANDING BIOMEDICAL DATA REUSE

    Get PDF
    Since the mid-2000s, new data sharing mandates have led to an increase in the amount of research data available for reuse. Reuse of data benefits the scientific community and the public by potentially speeding scientific discovery and increasing the return on investment of publicly funded research. However, despite the potential benefits of reuse and the increasing availability of data, research on the impact of data reuse is so far sparse. This dissertation provides a deeper understanding of the impacts of shared biomedical research data by exploring who is reusing data and for what purpose. Specifically, this dissertation examines use requests and dataset descriptions from three biomedical repositories that require potential requestors to submit descriptions of their planned reuse. Content analysis of use requests yields insight into who is requesting data and the methods and topics of their planned reuse. Comparing use requests to the descriptions of the original datasets provides insight into the breadth of impact of data reuse and text mining of the original dataset descriptions helps determine the topics of datasets that are highly reused. This study demonstrates that patterns of reuse differ between dataset types, with genomic datasets used more frequently together in meta-analyses for topics that diverge from the original purpose of collection, while clinical datasets are used more often on their own within a context that is similar to the reason for which they were collected. While requestors do come from a range of career stages from around the world, they are not evenly distributed; most requests come from English-speaking countries, especially the United States. This study also finds that datasets that receive the most requests soon after release continue to go on to be more requested, and that datasets covering common diseases are requested more than datasets on rare diseases. These findings have implications for several stakeholders, including funders and institutions developing policies to reward and incentivize data sharing, researchers who share data and those who reuse it, and repositories and data curators who must make choices about which datasets to curate and preserve

    Optimizing the Privacy Risk - Utility Framework in Data Publication

    Get PDF
    corecore