171 research outputs found

    Genetic Studies of Complex Human Diseases: Characterizing SNP-Disease Associations Using Bayesian Networks

    Get PDF
    Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis, and treatment of complex human diseases. Applying machine learning or statistical methods to epistatic interaction detection will encounter some common problems, e.g., very limited number of samples, an extremely high search space, a large number of false positives, and ways to measure the association between disease markers and the phenotype. RESULTS: To address the problems of computational methods in epistatic interaction detection, we propose a score-based Bayesian network structure learning method, EpiBN, to detect epistatic interactions. We apply the proposed method to both simulated datasets and three real disease datasets. Experimental results on simulation data show that our method outperforms some other commonly-used methods in terms of power and sample-efficiency, and is especially suitable for detecting epistatic interactions with weak or no marginal effects. Furthermore, our method is scalable to real disease data. CONCLUSIONS: We propose a Bayesian network-based method, EpiBN, to detect epistatic interactions. In EpiBN, we develop a new scoring function, which can reflect higher-order epistatic interactions by estimating the model complexity from data, and apply a fast Branch-and-Bound algorithm to learn the structure of a two-layer Bayesian network containing only one target node. To make our method scalable to real data, we propose the use of a Markov chain Monte Carlo (MCMC) method to perform the screening process. Applications of the proposed method to some real GWAS (genome-wide association studies) datasets may provide helpful insights into understanding the genetic basis of Age-related Macular Degeneration, late-onset Alzheimer's disease, and autism

    bNEAT: a Bayesian network method for detecting epistatic interactions in genome-wide association studies.

    Get PDF
    Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis and treatment of complex human diseases. A recent study in automatic detection of epistatic interactions shows that Markov Blanket-based methods are capable of finding genetic variants strongly associated with common diseases and reducing false positives when the number of instances is large. Unfortunately, a typical dataset from genome-wide association studies consists of very limited number of examples, where current methods including Markov Blanket-based method may perform poorly. RESULTS: To address small sample problems, we propose a Bayesian network-based approach (bNEAT) to detect epistatic interactions. The proposed method also employs a Branch-and-Bound technique for learning. We apply the proposed method to simulated datasets based on four disease models and a real dataset. Experimental results show that our method outperforms Markov Blanket-based methods and other commonly-used methods, especially when the number of samples is small. CONCLUSIONS: Our results show bNEAT can obtain a strong power regardless of the number of samples and is especially suitable for detecting epistatic interactions with slight or no marginal effects. The merits of the proposed approach lie in two aspects: a suitable score for Bayesian network structure learning that can reflect higher-order epistatic interactions and a heuristic Bayesian network structure learning method

    bNEAT: a Bayesian network method for detecting epistatic interactions in genome-wide association studies

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis and treatment of complex human diseases. A recent study in automatic detection of epistatic interactions shows that Markov Blanket-based methods are capable of finding genetic variants strongly associated with common diseases and reducing false positives when the number of instances is large. Unfortunately, a typical dataset from genome-wide association studies consists of very limited number of examples, where current methods including Markov Blanket-based method may perform poorly.</p> <p>Results</p> <p>To address small sample problems, we propose a Bayesian network-based approach (bNEAT) to detect epistatic interactions. The proposed method also employs a Branch-and-Bound technique for learning. We apply the proposed method to simulated datasets based on four disease models and a real dataset. Experimental results show that our method outperforms Markov Blanket-based methods and other commonly-used methods, especially when the number of samples is small.</p> <p>Conclusions</p> <p>Our results show bNEAT can obtain a strong power regardless of the number of samples and is especially suitable for detecting epistatic interactions with slight or no marginal effects. The merits of the proposed approach lie in two aspects: a suitable score for Bayesian network structure learning that can reflect higher-order epistatic interactions and a heuristic Bayesian network structure learning method.</p

    Statistical Methods for Aggregation of Sequence Data and Multiple Testing Correction in Common and Rare Variant Analysis

    Full text link
    Over the last fifteen years, there have been substantial improvements in how we study the association between trait and genetic variations in the human genome. Genome-wide association studies (GWAS) now routinely test millions of variants in hundreds of thousands of individuals and the advance of genome sequencing technology allows us to examine the role of genetic variants across the full allele-frequency spectrum. However, with these changes come new challenges in analyzing and interpreting genetic results. In this dissertation, we present methods to aggregate sequence data and identify significant associations in common and rare variant analysis. In chapter two, we compare two strategies to aggregate sequence data from multiple studies: joint variant calling of all samples together versus calling each study individually and then combining the results using meta-analysis. Although joint calling is the gold standard, single-study calling can be more appealing due to fewer privacy restrictions and smaller computational burden. We use deep- and low-coverage sequence data on 2,250 samples from the GoT2D study to compare the two strategies in terms of variant detection sensitivity, genotype accuracy, and association power. We show single-study calling to be a viable alternative to joint calling for deep-coverage sequence data but show them to have noticeable discrepancies in rare variant calling and association results for low-coverage sequence data. In chapter three, we revisit the common variant P-value significance threshold of 5e-8 and explore the rates of true and false discoveries that can be expected using less restrictive P-value thresholds and three other multiple testing procedures: Benjamini-Hochberg (BH) and Benjamini-Yekutieli (BY) for controlling false discovery rate (FDR), and Bayesian false discovery probability for controlling Bayesian FDR. Using data from the Global Lipids and GIANT consortia, we show for large sample common variant GWAS that using a less stringent P-value threshold of 5e-7 or use of the BH procedure at target FDR threshold of 5% substantially increases the number of true positive discoveries while only modestly increasing false positive discoveries compared with the 5e-8 threshold. The latter threshold remains appropriate for modest-sized studies or for resource-intensive follow-ups such as constructing animal models where a stringently curated list of significant loci is desired from GWAS. In the chapter four, we propose a Bayesian method for multiple testing correction in rare variant studies that calculates the posterior probabilities using an approximation of the Bayes factor and estimates prior parameters from summary statistics using an Expectation-Maximization algorithm. Using simulations analyses of ~400,000 individuals and ~107 million variants from the TOPMed-imputed UK Biobank study, we show that our Bayesian method discovers more true positive loci than P-value-based methods such as the P-value threshold, BH, and BY procedures at equivalent false positive rates. In addition, we show that the Bayesian method controls empirical FDR among discovered loci. Finally, we estimate the genome-wide significant P-value threshold for testing ~107 million variants from the TOPMed imputation reference panel to be 1e-9.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162936/1/zhongshc_1.pd

    UASIS: Universal Automatic SNP Identification System

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>SNP (Single Nucleotide Polymorphism), the most common genetic variations between human beings, is believed to be a promising way towards personalized medicine. As more and more research on SNPs are being conducted, non-standard nomenclatures may generate potential problems. The most serious issue is that researchers cannot perform cross referencing among different SNP databases. This will result in more resources and time required to track SNPs. It could be detrimental to the entire academic community.</p> <p>Results</p> <p>UASIS (Universal Automated SNP Identification System) is a web-based server for SNP nomenclature standardization and translation at DNA level. Three utilities are available. They are UASIS Aligner, Universal SNP Name Generator and SNP Name Mapper. UASIS maps SNPs from different databases, including dbSNP, GWAS, HapMap and JSNP etc., into an uniform view efficiently using a proposed universal nomenclature and state-of-art alignment algorithms. UASIS is freely available at <url>http://www.uasis.tk</url> with no requirement of log-in.</p> <p>Conclusions</p> <p>UASIS is a helpful platform for SNP cross referencing and tracking. By providing an informative, unique and unambiguous nomenclature, which utilizes unique position of a SNP, we aim to resolve the ambiguity of SNP nomenclatures currently practised. Our universal nomenclature is a good complement to mainstream SNP notations such as rs# and HGVS guidelines. UASIS acts as a bridge to connect heterogeneous representations of SNPs.</p

    DETECTING CANCER-RELATED GENES AND GENE-GENE INTERACTIONS BY MACHINE LEARNING METHODS

    Get PDF
    To understand the underlying molecular mechanisms of cancer and therefore to improve pathogenesis, prevention, diagnosis and treatment of cancer, it is necessary to explore the activities of cancer-related genes and the interactions among these genes. In this dissertation, I use machine learning and computational methods to identify differential gene relations and detect gene-gene interactions. To identify gene pairs that have different relationships in normal versus cancer tissues, I develop an integrative method based on the bootstrapping K-S test to evaluate a large number of microarray datasets. The experimental results demonstrate that my method can find meaningful alterations in gene relations. For gene-gene interaction detection, I propose to use two Bayesian Network based methods: DASSO-MB (Detection of ASSOciations using Markov Blanket) and EpiBN (Epistatic interaction detection using Bayesian Network model) to address the two critical challenges: searching and scoring. DASSO-MB is based on the concept of Markov Blanket in Bayesian Networks. In EpiBN, I develop a new scoring function, which can reflect higher-order gene-gene interactions and detect the true number of disease markers, and apply a fast Branch-and-Bound (B&B) algorithm to learn the structure of Bayesian Network. Both DASSO-MB and EpiBN outperform some other commonly-used methods and are scalable to genome-wide data

    Database resources of the National Center for Biotechnology Information

    Get PDF
    In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central, Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Electronic PCR, OrfFinder, Spidey, Splign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Clusters of Orthologous Groups (COGs), Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART) and the PubChem suite of small molecule databases. Augmenting many of the web applications is custom implementation of the BLAST program optimized to search specialized data sets. All of the resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov
    corecore