8 research outputs found
Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases
PURPOSE: Genomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful.
METHODS: We collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols.
RESULTS: We found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases.
CONCLUSION: The largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases
Detecting and Analyzing Variation in Protein Interactions
Proteins carry out a dazzling multitude of functions by interacting with DNA, RNA, other proteins and various other molecules within our cells. Together these interactions comprise complex networks that differ naturally across cells within an organism, across individuals in a population, and across species. Although such variation is critical for normal organismal functioning, mutations affecting protein interactions are also known to underlie a wide range of human diseases. In this dissertation, I introduce novel computational approaches that explore the extent to which specific protein interactions vary across species, across healthy individuals, and across individuals with cancer.
To start, I focus on interaction variation across species. It is well established that changes in protein-DNA interactions underlie a wide range of observable differences across species. These differences are primarily thought to stem from changes in the DNA sites that transcription factor (TF) proteins bind to, although changes in the binding properties of TFs themselves have also been observed. Determining the prevalence of such TF changes, however, remains infeasible using current experimental approaches. Here, I develop and apply a comparative genomics framework to systematically quantify changes in the DNA-binding properties of orthologous TFs across species spanning ~45 million years of evolutionary divergence. I demonstrate that, contrary to expectation, cross-species regulatory network divergence resulting from changes in non-duplicated DNA-binding proteins is pervasive. These findings reveal a widespread yet largely unstudied source of divergence across transcriptional regulatory programs in animals.
Next, I turn my attention to interaction variation across individuals. In order to comprehensively quantify this, I first combine large-scale sequence, domain and structure information to pinpoint sites within protein domains---the fundamental structural units in proteins---that are involved in binding DNA, RNA, peptides, ions, metabolites, or other small molecules. This domain-based approach enables us to identify putative interaction sites in over 60% of human genes, representing a 2.4-fold improvement over comparable state-of-the-art approaches for this task. I next demonstrate that whereas domain-inferred interaction sites are significantly depleted of natural variants across ~60,000 healthy individuals, these same sites are significantly enriched for cancer mutations across ~11,000 tumor samples. My analysis demonstrates that the cellular network variation that occurs across healthy individuals is unlikely to be due to changes within proteins; in contrast, mutations acquired in cancers appear to preferentially alter cellular networks by perturbing the proteins themselves.
Finally, I show how we can leverage an interaction-based viewpoint to uncover mutated genes that play causal roles in human cancers. In particular, I aim to uncover genes whose interaction interfaces are significantly altered in tumors. Towards this end, I develop a robust computational framework that integrates my per-domain-position binding propensities with additional sources of biological data regarding protein functionality. I demonstrate that by analytically computing the significance of patterns of mutations, my approach is able to achieve a dramatic improvement in runtime over atypical empirical permutation test for this task. Moreover, my interaction-based method not only recapitulates known cancer driver genes faster and with greater precision than previous methods, but it also uncovers relatively rarely-mutated genes with likely roles in cancer. Through focusing on the somatic alteration of protein interaction interfaces in tumors, my method can inform the perturbed molecular mechanisms across known and putative cancer genes, thereby enabling valuable insights that may help guide personalized cancer treatments
Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases.
Recommended from our members
Systematic domain-based aggregation of protein structures highlights DNA-, RNA- and other ligand-binding positions.
Domains are fundamental subunits of proteins, and while they play major roles in facilitating protein-DNA, protein-RNA and other protein-ligand interactions, a systematic assessment of their various interaction modes is still lacking. A comprehensive resource identifying positions within domains that tend to interact with nucleic acids, small molecules and other ligands would expand our knowledge of domain functionality as well as aid in detecting ligand-binding sites within structurally uncharacterized proteins. Here, we introduce an approach to identify per-domain-position interaction 'frequencies' by aggregating protein co-complex structures by domain and ascertaining how often residues mapping to each domain position interact with ligands. We perform this domain-based analysis on ∼91000 co-complex structures, and infer positions involved in binding DNA, RNA, peptides, ions or small molecules across 4128 domains, which we refer to collectively as the InteracDome. Cross-validation testing reveals that ligand-binding positions for 2152 domains are highly consistent and can be used to identify residues facilitating interactions in ∼63-69% of human genes. Our resource of domain-inferred ligand-binding sites should be a great aid in understanding disease etiology: whereas these sites are enriched in Mendelian-associated and cancer somatic mutations, they are depleted in polymorphisms observed across healthy populations. The InteracDome is available at http://interacdome.princeton.edu
Recommended from our members
RExPRT: a machine learning tool to predict pathogenicity of tandem repeat loci
Expansions of tandem repeats (TRs) cause approximately 60 monogenic diseases. We expect that the discovery of additional pathogenic repeat expansions will narrow the diagnostic gap in many diseases. A growing number of TR expansions are being identified, and interpreting them is a challenge. We present RExPRT (Repeat EXpansion Pathogenicity pRediction Tool), a machine learning tool for distinguishing pathogenic from benign TR expansions. Our results demonstrate that an ensemble approach classifies TRs with an average precision of 93% and recall of 83%. RExPRT's high precision will be valuable in large-scale discovery studies, which require prioritization of candidate loci for follow-up studies
Commonalities across computational workflows for uncovering explanatory variants in undiagnosed cases.
PurposeGenomic sequencing has become an increasingly powerful and relevant tool to be leveraged for the discovery of genetic aberrations underlying rare, Mendelian conditions. Although the computational tools incorporated into diagnostic workflows for this task are continually evolving and improving, we nevertheless sought to investigate commonalities across sequencing processing workflows to reveal consensus and standard practice tools and highlight exploratory analyses where technical and theoretical method improvements would be most impactful.MethodsWe collected details regarding the computational approaches used by a genetic testing laboratory and 11 clinical research sites in the United States participating in the Undiagnosed Diseases Network via meetings with bioinformaticians, online survey forms, and analyses of internal protocols.ResultsWe found that tools for processing genomic sequencing data can be grouped into four distinct categories. Whereas well-established practices exist for initial variant calling and quality control steps, there is substantial divergence across sites in later stages for variant prioritization and multimodal data integration, demonstrating a diversity of approaches for solving the most mysterious undiagnosed cases.ConclusionThe largest differences across diagnostic workflows suggest that advances in structural variant detection, noncoding variant interpretation, and integration of additional biomedical data may be especially promising for solving chronically undiagnosed cases