10 research outputs found

    Bayesian variable selection using Knockoffs with applications to genomics

    No full text
    Given the costliness of HIV drug therapy research, it is important not only to maximize true positive rate (TPR) by identifying which genetic markers are related to drug resistance, but also to minimize false discovery rate (FDR) by reducing the number of incorrect markers unrelated to drug resistance. In this study, we propose a multiple testing procedure that unifies key concepts in computational statistics, namely Model-free Knockoffs, Bayesian variable selection, and the local false discovery rate. We develop an algorithm that utilizes the augmented data-Knockoff matrix and implement Bayesian Lasso. We then identify signals using test statistics based on Markov Chain Monte Carlo outputs and local false discovery rate. We test our proposed methods against non-bayesian methods such as Benjaminiā€“Hochberg (BHq) and Lasso regression in terms TPR and FDR. Using numerical studies, we show the proposed method yields lower FDR compared to BHq and Lasso for certain cases, such as for low and equi-dimensional cases. We also discuss an application to an HIV-1 data set, which aims to be applied analyzing genetic markers linked to drug resistant HIV in the Philippines in future work

    BAYESIAN LOCAL FALSE DISCOVERY RATE FOR SPARSE COUNT DATA WITH APPLICATION TO THE DISCOVERY OF HOTSPOTS IN PROTEIN DOMAINS

    No full text
    In cancer research at the molecular level, it is critical to understand which somatic mutations play an important role in the initiation or progression of cancer. Recently, studying cancer somatic variants at the protein domain level is an important area for uncovering functionally related somatic mutations. The main issue is to find the protein domain hotspots which have significantly high frequency of mutations. Multiple testing procedures are commonly used to identify hotspots; however, when data is not large enough, existing methods produce unreliable results with failure in controlling a given type I error rate. We propose multiple testing procedures, based on Bayesian local false discovery rate, for sparse count data and apply it in the identification of clusters of somatic mutations across entire gene families using protein domain models. In multiple testing for count data, it is not clear what kind of the null distribution should be admitted. In our proposed algorithms, we implement the zero assumption in the context of Bayesian methods to identify the null distribution for count data rather than using any theoretical null distribution. Furthermore, we also address different types of modeling of alternative distributions. The proposed fully Bayesian models are efficient when the number of count data is small (50 <= N < 200) while the local false discovery rate procedures, based on the empirical Bayes, is desirable for a large number of data ( N > 800). We provide numerical studies to show that the proposed fully Bayesian methods can control a given level of false discovery rate for small number of positions while existing approaches based on nonparametric empirical Bayes fail in controlling a false discovery rate. In addition, we present real data examples of protein domain data to select hotspots in protein domain data.N

    Oncodomains: A protein domain-centric framework for analyzing rare variants in tumor samples

    No full text
    <div><p>The fight against cancer is hindered by its highly heterogeneous nature. Genome-wide sequencing studies have shown that individual malignancies contain many mutations that range from those commonly found in tumor genomes to rare somatic variants present only in a small fraction of lesions. Such rare somatic variants dominate the landscape of genomic mutations in cancer, yet efforts to correlate somatic mutations found in one or few individuals with functional roles have been largely unsuccessful. Traditional methods for identifying somatic variants that drive cancer are ā€˜gene-centricā€™ in that they consider only somatic variants within a particular gene and make no comparison to other similar genes in the same family that may play a similar role in cancer. In this work, we present oncodomain hotspots, a new ā€˜domain-centricā€™ method for identifying clusters of somatic mutations across entire gene families using protein domain models. Our analysis confirms that our approach creates a framework for leveraging structural and functional information encapsulated by protein domains into the analysis of somatic variants in cancer, enabling the assessment of even rare somatic variants by comparison to similar genes. Our results reveal a vast landscape of somatic variants that act at the level of domain families altering pathways known to be involved with cancer such as protein phosphorylation, signaling, gene regulation, and cell metabolism. Due to oncodomain hotspotsā€™ unique ability to assess rare variants, we expect our method to become an important tool for the analysis of sequenced tumor genomes, complementing existing methods.</p></div

    Depiction of the process of mapping variants to domain positions to find oncodomain hotspots.

    No full text
    <p>Depiction of the process of mapping variants to domain positions to find oncodomain hotspots.</p

    Heatmap of Patients with a Variant in an Oncodomain Hotspot for the PKc domain.

    No full text
    <p>Visual representation and hierarchical clustering of oncodomain hotspots on genes that were significant in MutSigCV. For each gene in each cancer type, the number of patients in oncodomain hotspots is quantified and the cell is color-coded if the gene had any patients in oncodomain hotspots (blue), if it was significant in MutSigCV (green) or both (purple). Only the top ten genes based on the gene nameā€™s co-occurrence with the ā€œcancerā€ MeSH term are shown. Here, cancer types are grouped via hierarchical clustering to show similar mutational patterns. Enumerated in each cell are the proportion of patients with a somatic variant in an oncodomain hotspot (numerator) compared to the number of patients that had a somatic variant anywhere in the protein domain region (denominator).</p

    Hotspot frequency of the Ras-like GTPase oncodomain and the calcium binding Epidermal Growth Factor domain.

    No full text
    <p>Structural representations of the Ras-like GTPase (cd00882) oncodomain family (A) and the calcium binding domain of the epidermal growth factor-like (cd00054) oncodomain family (B).</p

    Overlap of oncodomain hotspots with the active site of the catalytic domain of protein kinases.

    No full text
    <p>Structural representation of the frequency of oncodomain hotspots across 20 cancer types (A) compared to the active site residues (B) for the PKc / cd00180 oncodomain.</p
    corecore