1,358 research outputs found

    STATISTICAL METHODS FOR JOINT ANALYSIS OF MULTIPLE PHENOTYPES AND THEIR APPLICATIONS FOR PHEWAS

    Get PDF
    Genome-wide association studies (GWAS) have successfully detected tens of thousands of robust SNP-trait associations. Earlier researches have primarily focused on association studies of genetic variants and some well-defined functions or phenotypic traits. Emerging evidence suggests that pleiotropy, the phenomenon of one genetic variant affects multiple phenotypes, is widespread, especially in complex human diseases. Therefore, individual phenotype analyses may lose statistical power to identify the underlying genetic mechanism. Contrasting with single phenotype analyses, joint analysis of multiple phenotypes exploits the correlations between phenotypes and aggregates multiple weak marginal effects and is therefore likely to provide new insights into the functional consequences of genetic variations. This dissertation includes two papers, corresponding to two primary research projects I have done during my Ph.D. study, with each distributed in one chapter. Chapter 1 proposed an innovative method, which referred to as HC-CLC, for joint analysis of multipole phenotypes using a Hierarchical Clustering (HC) approach followed by a Clustering Linear Combination (CLC) method. The HC step partitions phenotypes into clusters. The CLC method is then used to test the association between the genetic variant and all phenotypes, which is done by combining individual test statistics while taking full advantage of the clustering information in the HC step. Extensive simulations together with the COPDGene data analysis have been used to assess the Type I error rates and the power of our proposed method. Our simulation results demonstrate that the Type I error rates of HC-CLC are effectively controlled in different realistic settings. HC-CLC either outperforms all other methods or has statistical power that is very close to the most powerful alternative method with which it has been compared. In addition, our real data analysis shows that HC-CLC is an appropriate method for GWAS. Chapter 2 redesigned the PheCLC (Phenome-wide association study that uses the CLC method) which was previously developed by our research group. The refined method is then applied on the UKBiobank data, a large cohort study across the United Kingdom, to test the validity and understand the limitations of the proposed method. We have named our new method UKB-PheCLC. The UKB-PheCLC method is an EHR-based PheWAS. In the first step, it classifies the whole phenome into different phenotypic categories according to the UK Biobank ICD codes. In the second step, the CLC method is applied to each phenotypic category to derive a CLC-based p-value for testing the association between the genetic variant of interest and all phenotypes in that category. In the third step, the CLC-based p-values of all categories are combined by using a strategy resemble that of the Adaptive Fisher\u27s Combination (AFC) method. Overall, UKB-PheCLC harnesses the powerful resource of the UK Biobank and considers the possibility that phenotypes can be grouped into different phenotypic categories, which is very common in EHR-based PheWAS. Moreover, UKB-PheCLC can handle both qualitative and quantitative phenotypes, and it also doesn’t require raw phenotype information. The real data analysis results confirm that UKB-PheCLC is more powerful than the existing methods we have it compared with. Thus, UKB-PheCLC can serve as a compelling method for phenome-wide association study

    Ontology-Based Clinical Information Extraction Using SNOMED CT

    Get PDF
    Extracting and encoding clinical information captured in unstructured clinical documents with standard medical terminologies is vital to enable secondary use of clinical data from practice. SNOMED CT is the most comprehensive medical ontology with broad types of concepts and detailed relationships and it has been widely used for many clinical applications. However, few studies have investigated the use of SNOMED CT in clinical information extraction. In this dissertation research, we developed a fine-grained information model based on the SNOMED CT and built novel information extraction systems to recognize clinical entities and identify their relations, as well as to encode them to SNOMED CT concepts. Our evaluation shows that such ontology-based information extraction systems using SNOMED CT could achieve state-of-the-art performance, indicating its potential in clinical natural language processing

    A Genome-First Approach To Investigating The Biological And Clinical Relevance Of Exome-Wide Rare Coding Variation Using Electronic Health Record Phenotypes

    Get PDF
    Genome-wide association studies (GWAS) have successfully described the roles of common genetic variation on human diseases by analyzing large populations recruited based on a shared phenotype, but the biological and clinical relevance of numerous genes remain incompletely described through these ‘phenotype-first’ methodologies. Much of the unexplained genetic contribution to disease risk and variability in complex traits may belong to the very rare and private spectrum of alleles, a range traditionally ignored by GWAS. Furthermore, the phenotype-first approach is likely to miss unexpected phenotypic consequences of genetic variants, such as those that may not be feasible to study in a phenotype-first approach due to rarity of the condition. The Penn Medicine BioBank, a healthcare system-based database of genotype, whole-exome sequencing, and electronic health record data, allows for an unbiased, ‘genome-first’ approach to describing the relationships between genetic variants and human disease traits captured in the clinical setting. Through ‘gene burden’ tests that interrogate the cumulative effects of multiple rare and private variants in a gene that are predicted to affect gene function, this dissertation aims to characterize the clinical manifestations of diseases and traits caused by rare, predicted loss-of-function and predicted deleterious missense variants on an exome-wide and/or phenome-wide scale. These analyses uncover previously unsuspected medical and biological consequences of loss-of-function variants in multiple genes. In summary, this dissertation will investigate the biological and clinical relevance of disease-associated genes by investigating the association of rare coding variation found in whole-exome sequencing with phenotypes derived from the EHR

    Machine learning of structured and unstructured healthcare data

    Get PDF
    The widespread adoption of Electronic Health Records (EHR) systems in healthcare institutions in the United States makes machine learning based on large-scale and real-world clinical data feasible and affordable. Machine learning of healthcare data, or healthcare data analytics, has achieved numerous successes in various applications. However, there are still many challenges for machine learning of healthcare data both structured and unstructured. Longitudinal structured clinical data (e.g., lab test results, diagnoses, and medications) have an enormous variety of categories, are collected at irregularly spaced visits, and are sparsely distributed. Studies on analyzing longitudinal structured EHR data for tasks such as disease prediction and visualization are still limited. For unstructured clinical notes, existing studies mostly focus on disease prediction or cohort selection. Studies on mining clinical notes with the direct purpose to reduce costs for healthcare providers or institutions are limited. To fill in these gaps, this dissertation has three research topics.The first topic is about developing state-of-the-art predictive models to detect diabetic retinopathy using longitudinal structured EHR data. Major deep-learning-based temporal models for disease prediction are studied, implemented, and evaluated. Experimental results on a large-scale dataset show that temporal deep learning models outperform non-temporal random forests models in terms of AUPRC and recall.The second topic is about clustering temporal disease networks to visualize comorbidity progression. We propose a clustering technique to outline comorbidity progression phases as well as a new disease clustering method to simplify the visualization. Two case studies on Clostridioides difficile and stroke show the methods are effective.The third topic is clinical information extraction for medical billing. We propose a framework that consists of two methods, a rule-based and a deep-learning-based, to extract patient history information directly from clinical notes to facilitate the Evaluation and Management Services (E/M) billing. Initial results of the two prototype systems on an annotated dataset are promising and direct us for potential improvements

    THE CONCORD WATER SURVEY

    Get PDF

    An Integrated Immunopeptidomics and Proteogenomics Framework to Discover Non-Canonical Targets for Cancer Immunotherapy

    Get PDF
    Un Ă©lĂ©ment essentiel de l’immunothĂ©rapie appliquĂ©e au cancer est l’identification de peptides liant les antigĂšnes des leucocytes humains (HLA) et capables d’induire une puissante rĂ©ponse T anti-tumorale. La spectromĂ©trie de masse (MS) constitue actuellement la seule mĂ©thode non-biaisĂ©e permettant une analyse dĂ©taillĂ©e du panel d’antigĂšnes susceptibles d’ĂȘtre prĂ©sentĂ©s aux lymphocytes T in vivo. L’utilisation de cette mĂ©thode en clinique requiert toutefois des amĂ©liorations significatives de la mĂ©thodologie utilisĂ©e lors de l’identification des peptides HLA. Un consortium multidisciplinaire de chercheurs a rĂ©cemment mis en lumiĂšre les problĂšmes actuellement liĂ©s Ă  l’utilisation de la MS en immunopeptidomique, soulignant le besoin de dĂ©velopper de nouvelles mĂ©thodes et mettant en Ă©vidence le dĂ©fi que reprĂ©sente la standardisation de l’immuno-purification des molĂ©cules HLA. La premiĂšre partie de cette thĂšse vise Ă  optimiser les mĂ©thodes expĂ©rimentales permettant l’extraction des peptides apprĂȘtĂ©s aux HLA. L’optimisation de la mĂ©thodologie de base a permis des amĂ©liorations notables en terme de dĂ©bit, de reproductibilitĂ©, de sensibilitĂ© et a permis une purification sĂ©quentielle des molĂ©cules de HLA de classe I de classe II ainsi que de leurs peptides, Ă  partir de lignĂ©es cellulaires ou de tissus. En comparaison avec les mĂ©thodes existantes, ce protocole comprend moins d’étapes et permet de limiter la manipulation des Ă©chantillons ainsi que le temps de purification. Cette mĂ©thode, pour les peptides HLA extraits, a permis d’obtenir des taux de reproductibilitĂ© et de sensibilitĂ© sans prĂ©cĂ©dents (corrĂ©lations de Pearson jusqu'Ă  0,98 et 0,97 pour les HLA de classe I et de classe II, respectivement). De plus, la faisabilitĂ© d’études comparatives robustes a Ă©tĂ© dĂ©montrĂ©e Ă  partir d’une lignĂ©e cellulaire de cancer de l’ovaire, traitĂ©e Ă  l'interfĂ©ron gamma. En effet, cette nouvelle mĂ©thode a mis en Ă©vidence des changements quantitatifs et qualitatifs du catalogue de peptides prĂ©sentĂ©s aux HLA. Les rĂ©sultats obtenus ont mis en avant une augmentation de la prĂ©sentation de longs ligands chymotryptiques de classe I. Ce phĂ©nomĂšne est probablement liĂ© Ă  la modulation de la machinerie de traitement et de prĂ©sentation des antigĂšnes. Dans cette premiĂšre partie de thĂšse, nous avons dĂ©veloppĂ© une mĂ©thodologie robuste et rationalisĂ©e, facilitant la purification des HLA et pouvant ĂȘtre appliquĂ©e en recherche fondamentale et translationnelle. Bien que les nĂ©oantigĂšnes reprĂ©sentent une cible attractive, des Ă©tudes rĂ©centes ont mis en Ă©vidence l’existence des antigĂšnes non canoniques. Ces antigĂšnes tumoraux, bien que non mutĂ©s, sont aussi spĂ©cifiques aux cellules cancĂ©reuses et semblent jouer un rĂŽle important dans l’immunitĂ© anti-tumorale. La seconde partie de cette thĂšse a pour objectif le dĂ©veloppement d’une mĂ©thodologie d’analyse permettant l’identification ainsi que la validation de ces antigĂšnes particuliers. Les antigĂšnes non canoniques sont d'origine prĂ©sumĂ©e non codante et ne sont, par consĂ©quent, que rarement inclus dans les bases de donnĂ©es des sĂ©quences de protĂ©ines de rĂ©fĂ©rence. De ce fait, ils ne sont gĂ©nĂ©ralement pas pris en compte lors des recherches de MS utilisant de telles bases de donnĂ©es. Afin de palier ce problĂšme et de permettre leur identification par MS, le sĂ©quençage de l'exome entier, le sĂ©quençage de l'ARN sur une population de cellules et sur des cellules uniques, ainsi que le profilage des ribosomes ont Ă©tĂ© intĂ©grĂ©s aux donnĂ©es d’immunopeptidomique. Ainsi, NewAnce, un programme informatique permettant de combiner les donnĂ©es de deux outils de recherche MS en tandem, a Ă©tĂ© dĂ©veloppĂ© afin de calculer le taux d’antigĂšnes non canoniques identifiĂ©s comme faux positifs. L’utilisation de NewAnce sur des lignĂ©es cellulaires provenant de patients atteints de mĂ©lanomes ainsi que sur des biopsies de cancer du poumon a permis l’identification prĂ©cise de centaines de peptides HLA non classiques, spĂ©cifiques aux cellules tumorales et communs Ă  plusieurs patients. Le niveau de confirmation des peptides non canoniques a ensuite Ă©tĂ© testĂ© Ă  l’aide d’une approche de MS ciblĂ©e. Les peptides rĂ©sultant de ces analyses ont Ă©tĂ© minutieusement validĂ©s pour un des Ă©chantillons de mĂ©lanome disponibles. De plus, le profilage des ribosomes a rĂ©vĂ©lĂ© que les nouveaux cadres de lecture ouverts, desquels rĂ©sultent certains de ces peptides non classiques, sont activement traduits. L’évaluation de l’immunogenicitĂ© de ces peptides a Ă©tĂ© Ă©valuĂ©e avec des cellules immunitaires autologues et a rĂ©vĂ©lĂ© un Ă©pitope immunogĂšne non canonique, provenant d'un cadre de lecture ouvert alternatif du gĂšne ABCB5, un marqueur des cellules souches du mĂ©lanome. De maniĂšre globale, les rĂ©sultats obtenus au cours de cette thĂšse soulignent la possibilitĂ© d’inclure ce type d’analyse de proteogĂ©nomique dans un protocole d’identification de nĂ©oantigĂšnes existant. Cela permettrait d’inclure et prioriser des antigĂšnes tumoraux non classiques et de proposer aux patients en impasse thĂ©rapeutique des immunothĂ©rapies anti-tumorales personnalisĂ©es. -- A central factor to the development of cancer immunotherapy is the identification of clinically relevant human leukocyte antigen (HLA)-bound peptides that elicit potent anti-tumor T cell responses. Mass spectrometry (MS) is the only unbiased technique that captures the in vivo presented HLA repertoire. However, significant improvements in MS-based HLA peptide discovery methodologies are necessary to enable the smooth transition to the clinic. Recently, a consortium of multidisciplinary researchers presented current issues in clinical MS-based immunopeptidomics, highlighting method development and standardization challenges in HLA immunoaffinity purification. The first part of this thesis addresses improvements to the experimental method for HLA peptide extraction. The approach was optimized with several new developments, facilitating high-throughput, reproducible, scalable, and sensitive sequential immunoaffinity purification of HLA class I and class II peptides from cell lines and tissue samples. The method showed increased speed, and reduced sample handling when compared to previous methods. Unprecedented depth and high reproducibility were achieved for the obtained HLA peptides (Pearson correlations up to 0.98 and 0.97 for HLA class I and HLA class II, respectively). Additionally, the feasibility of performing robust comparative studies was demonstrated on an ovarian cancer cell line treated with interferon gamma. Both quantitative and qualitative changes were detected in the cancer HLA repertoire upon treatment. Specifically, a yet unreported and interesting phenomenon was the upregulated presentation of longer and chymotryptic-like HLA class I ligands, likely related to the modulation of the antigen processing and presentation machinery. Taken together, a robust and streamlined framework was built that facilitates peptide purification and its application in basic and translational research. Furthermore, recent studies have shed light that, along with the highly attractive mutated neoantigens, other non-mutated, yet tumor-specific, non-canonical antigens may also play an important role in anti-tumor immunity. Non-canonical antigens are of presumed non-coding origin and not commonly included in protein reference databases, and are therefore typically disregarded in database-dependent MS searches. The second part of this thesis develops an analytical workflow enabling the confident identification and validation of non- canonical tumor antigens. For this purpose, whole exome sequencing, bulk and single-cell RNA sequencing and ribosome profiling were integrated with MS-based immunopeptidomics for personalized non-canonical HLA peptide discovery. A computational module called NewAnce was designed, which combines the results of two tandem MS search tools and implements group-specific false discovery rate calculations to control the error specifically for the non-canonical peptide group. When applied to patient-derived melanoma cell lines and paired lung cancer and normal tissues, NewAnce resulted in the accurate identification of hundreds of shared and tumor-specific non-canonical HLA peptides. Next, the level of non-canonical peptide confirmation was tested in a targeted MS-based approach, and selected non-canonical peptides were extensively validated for one melanoma sample. Furthermore, the novel open reading frames that generate a selection of these non- canonical peptides were found to be actively translated by ribosome profiling. Importantly, these peptides were assessed with autologous immune cells and a non-canonical immunogenic epitope was discovered from an alternative open reading frame of melanoma stem cell marker gene ABCB5. This thesis concludes by highlighting the possibility of incorporating the proteogenomics pipeline into existing neoantigen discovery engines in order to prioritize tumor-specific non-canonical peptides for cancer immunotherapy. -- Maladie trĂšs hĂ©tĂ©rogĂšne et multifactorielle, le cancer reprĂ©sente Ă  ce jour la seconde cause de dĂ©cĂšs dans le monde. Bien que le systĂšme immunitaire soit capable de reconnaĂźtre puis d’éliminer les cellules cancĂ©reuses, ces derniĂšres peuvent Ă  leur tour s’adapter et accumuler des mutations leur permettant d’échapper Ă  cette reconnaissance. L’immunothĂ©rapie anti-tumorale dĂ©montre le rĂŽle clĂ© de l’immunitĂ© dans l’éradication des tumeurs. Cependant, ces thĂ©rapies prometteuses ne sont efficaces que chez une petite proportion des patients traitĂ©s. Une Ă©tape majeure dans l’établissement d’une rĂ©ponse immunitaire anti-tumorale est la reconnaissance d’antigĂšnes associĂ©s aux tumeurs. Des Ă©tudes rĂ©centes ont montrĂ© que les antigĂšnes tumoraux issus de rĂ©gions non-codantes du gĂ©nome (antigĂšnes non-canoniques) peuvent jouer un rĂŽle clĂ© dans l’induction de rĂ©ponses immunitaires. Ainsi, l’identification de ces antigĂšnes tumoraux particuliers permettrait de guider le dĂ©veloppement d’immunothĂ©rapies anti-cancĂ©reuses personnalisĂ©es telles que la vaccination ou encore le transfert adoptif de lymphocytes T reconnaissant ces cibles. La spectromĂ©trie de masse (MS) est une technique non biaisĂ©e permettant l’identification et l’analyse du rĂ©pertoire des antigĂšnes prĂ©sentĂ©s in vivo. Cependant, cette technique nĂ©cessite d’ĂȘtre optimisĂ©e et standardisĂ©e afin d’ĂȘtre utilisĂ©e en clinique. Ainsi, la premiĂšre partie de ces travaux de thĂšse a Ă©tĂ© dĂ©diĂ©e Ă  l’optimisation expĂ©rimentale de cette mĂ©thode Ă  partir d’échantillons de tissus et de lignĂ©es cellulaires. En comparaison avec les protocoles standards, cette technique permet une couverture plus complĂšte, rapide et reproductible du rĂ©pertoire de peptides apprĂȘtĂ©s aux HLA. La seconde partie de cette thĂšse a Ă©tĂ© consacrĂ©e au dĂ©veloppement d’une mĂ©thode permettant l’identification d’antigĂšnes tumoraux non-canoniques via le sĂ©quençage d’ARN cellulaire, ribosomique et l’utilisation de notre mĂ©thode d’immunopeptidomique optimisĂ©e. Afin de contrĂŽler l’identification de faux positifs, nous avons Ă©laborĂ© un nouveau module computationnel. Ce module a permis l’identification de plusieurs centaines de peptides-HLA non-canoniques, partagĂ©s et spĂ©cifiques au mĂ©lanome et au cancer du poumon. Le sĂ©quençage des ARN ribosomiques a mis en Ă©vidence la traduction de nouveaux cadre ouverts de lecture desquels sont traduits de nouveaux peptides non-canoniques. Cette technique nous a permis de mettre en Ă©vidence un Ă©pitope immunogĂšne issu du gĂšne ABCB5, un marqueur de cellules souches cancĂ©reuses prĂ©alablement identifiĂ© dans le mĂ©lanome. De maniĂšre globale, ces travaux de thĂšse, alliant immunopeptidomique et protĂ©ogĂ©nomique, ont permis la mise au point d’une mĂ©thode expĂ©rimentale permettant une meilleure identification d’antigĂšnes tumoraux. Nous espĂ©rons que ces rĂ©sultats amĂ©lioreront l’identification et la priorisation de cibles pertinentes pour l’immunothĂ©rapie anti-cancĂ©reuse en clinique

    The Gene Ontology Handbook

    Get PDF
    bioinformatics; biotechnolog
    • 

    corecore