8 research outputs found

    No major flaws in "Identification of individuals by trait prediction using whole-genome sequencing data"

    Get PDF
    In a recently published PNAS article, we studied the identifiability of genomic samples using machine learning methods [Lippert et al., 2017]. In a response, Erlich [2017] argued that our work contained major flaws. The main technical critique of Erlich [2017] builds on a simulation experiment that shows that our proposed algorithm, which uses only a genomic sample for identification, performed no better than a strategy that uses demographic variables. Below, we show why this comparison is misleading and provide a detailed discussion of the key critical points in our analyses that have been brought up in Erlich [2017] and in the media. Further, not only faces may be derived from DNA, but a wide range of phenotypes and demographic variables. In this light, the main contribution of Lippert et al. [2017] is an algorithm that identifies genomes of individuals by combining multiple DNA-based predictive models for a myriad of traits

    Identification of individuals by trait prediction using whole-genome sequencing data

    No full text
    Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications

    CYP4F2 genetic variant alters required warfarin dose

    No full text
    Warfarin is an effective, commonly prescribed anticoagulant used to treat and prevent thrombotic events. Because of historically high rates of drug-associated adverse events, warfarin remains underprescribed. Further, interindividual variability in therapeutic dose mandates frequent monitoring until target anticoagulation is achieved. Genetic polymorphisms involved in warfarin metabolism and sensitivity have been implicated in variability of dose. Here, we describe a novel variant that influences warfarin requirements. To identify additional genetic variants that contribute to warfarin requirements, screening of DNA variants in additional genes that code for drug-metabolizing enzymes and drug transport proteins was undertaken using the Affymetrix drug-metabolizing enzymes and transporters panel. A DNA variant (rs2108622; V433M) in cytochrome P450 4F2 (CYP4F2) was associated with warfarin dose in 3 independent white cohorts of patients stabilized on warfarin representing diverse geographic regions in the United States and accounted for a difference in warfarin dose of approximately 1 mg/day between CC and TT subjects. Genetic variation of CYP4F2 was associated with a clinically relevant effect on warfarin requirement

    The MAQC-II Project: A comprehensive study of common practices for the development and validation of microarray-based predictive models.

    No full text
    Gene expression data from microarrays are being applied to predict preclinical and clinical endpoints, but the reliability of these predictions has not been established. In the MAQC-II project, 36 independent teams analyzed six microarray data sets to generate predictive models for classifying a sample with respect to one of 13 endpoints indicative of lung or liver toxicity in rodents, or of breast cancer, multiple myeloma or neuroblastoma in humans. In total, >30,000 models were built using many combinations of analytical methods. The teams generated predictive models without knowing the biological meaning of some of the endpoints and, to mimic clinical reality, tested the models on data that had not been used for training. We found that model performance depended largely on the endpoint and team proficiency and that different approaches generated models of similar performance. The conclusions and recommendations from MAQC-II should be useful for regulatory agencies, study committees and independent investigators that evaluate methods for global gene expression analysis
    corecore