8 research outputs found

    Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases

    Get PDF
    PURPOSE: Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful. METHODS: We collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols. RESULTS: We found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases. CONCLUSION: The largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases

    Detecting and Analyzing Variation in Protein Interactions

    No full text
    Proteins carry out a dazzling multitude of functions by interacting with DNA, RNA, other proteins and various other molecules within our cells. Together these interactions comprise complex networks that differ naturally across cells within an organism, across individuals in a population, and across species. Although such variation is critical for normal organismal functioning, mutations affecting protein interactions are also known to underlie a wide range of human diseases. In this dissertation, I introduce novel computational approaches that explore the extent to which specific protein interactions vary across species, across healthy individuals, and across individuals with cancer. To start, I focus on interaction variation across species. It is well established that changes in protein-DNA interactions underlie a wide range of observable differences across species. These differences are primarily thought to stem from changes in the DNA sites that transcription factor (TF) proteins bind to, although changes in the binding properties of TFs themselves have also been observed. Determining the prevalence of such TF changes, however, remains infeasible using current experimental approaches. Here, I develop and apply a comparative genomics framework to systematically quantify changes in the DNA-binding properties of orthologous TFs across species spanning ~45 million years of evolutionary divergence. I demonstrate that, contrary to expectation, cross-species regulatory network divergence resulting from changes in non-duplicated DNA-binding proteins is pervasive. These findings reveal a widespread yet largely unstudied source of divergence across transcriptional regulatory programs in animals. Next, I turn my attention to interaction variation across individuals. In order to comprehensively quantify this, I first combine large-scale sequence, domain and structure information to pinpoint sites within protein domains---the fundamental structural units in proteins---that are involved in binding DNA, RNA, peptides, ions, metabolites, or other small molecules. This domain-based approach enables us to identify putative interaction sites in over 60% of human genes, representing a 2.4-fold improvement over comparable state-of-the-art approaches for this task. I next demonstrate that whereas domain-inferred interaction sites are significantly depleted of natural variants across ~60,000 healthy individuals, these same sites are significantly enriched for cancer mutations across ~11,000 tumor samples. My analysis demonstrates that the cellular network variation that occurs across healthy individuals is unlikely to be due to changes within proteins; in contrast, mutations acquired in cancers appear to preferentially alter cellular networks by perturbing the proteins themselves. Finally, I show how we can leverage an interaction-based viewpoint to uncover mutated genes that play causal roles in human cancers. In particular, I aim to uncover genes whose interaction interfaces are significantly altered in tumors. Towards this end, I develop a robust computational framework that integrates my per-domain-position binding propensities with additional sources of biological data regarding protein functionality. I demonstrate that by analytically computing the significance of patterns of mutations, my approach is able to achieve a dramatic improvement in runtime over atypical empirical permutation test for this task. Moreover, my interaction-based method not only recapitulates known cancer driver genes faster and with greater precision than previous methods, but it also uncovers relatively rarely-mutated genes with likely roles in cancer. Through focusing on the somatic alteration of protein interaction interfaces in tumors, my method can inform the perturbed molecular mechanisms across known and putative cancer genes, thereby enabling valuable insights that may help guide personalized cancer treatments

    Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases.

    No full text
    PurposeGenomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful.MethodsWe collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols.ResultsWe found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases.ConclusionThe largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases
    corecore