35,060 research outputs found

    Linkage Disequilibrium Mapping via Cladistic Analysis of Single-Nucleotide Polymorphism Haplotypes

    Get PDF
    We present a novel approach to disease-gene mapping via cladistic analysis of single-nucleotide polymorphism (SNP) haplotypes obtained from large-scale, population-based association studies, applicable to whole-genome screens, candidate-gene studies, or fine-scale mapping. Clades of haplotypes are tested for association with disease, exploiting the expected similarity of chromosomes with recent shared ancestry in the region flanking the disease gene. The method is developed in a logistic-regression framework and can easily incorporate covariates such as environmental risk factors or additional unlinked loci to allow for population structure. To evaluate the power of this approach to detect disease-marker association, we have developed a simulation algorithm to generate high-density SNP data with short-range linkage disequilibrium based on empirical patterns of haplotype diversity. The results of the simulation study highlight substantial gains in power over single-locus tests for a wide range of disease models, despite overcorrection for multiple testing

    Routes for breaching and protecting genetic privacy

    Full text link
    We are entering the era of ubiquitous genetic information for research, clinical care, and personal curiosity. Sharing these datasets is vital for rapid progress in understanding the genetic basis of human diseases. However, one growing concern is the ability to protect the genetic privacy of the data originators. Here, we technically map threats to genetic privacy and discuss potential mitigation strategies for privacy-preserving dissemination of genetic data.Comment: Draft for comment

    Computing Petaflops over Terabytes of Data: The Case of Genome-Wide Association Studies

    Full text link
    In many scientific and engineering applications, one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the Genome-Wide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions (101210^{12}) of correlated generalized least-squares problems, posing a daunting challenge: the performance of petaflops (101510^{15} floating-point operations) over terabytes of data. In this paper, we design an algorithm for performing GWAS on multi-core architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability. With our algorithm, a GWAS that currently requires the use of a supercomputer may now be performed in matter of hours on a single multi-core node. The discussion centers around the methodology to develop the algorithm rather than the specific application. We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms

    A preliminary study of genetic factors that influence susceptibility to bovine tuberculosis in the British cattle herd

    Get PDF
    Associations between specific host genes and susceptibility to Mycobacterial infections such as tuberculosis have been reported in several species. Bovine tuberculosis (bTB) impacts greatly the UK cattle industry, yet genetic predispositions have yet to be identified. We therefore used a candidate gene approach to study 384 cattle of which 160 had reacted positively to an antigenic skin test (‘reactors’). Our approach was unusual in that it used microsatellite markers, embraced high breed diversity and focused particularly on detecting genes showing heterozygote advantage, a mode of action often overlooked in SNP-based studies. A panel of neutral markers was used to control for population substructure and using a general linear model-based approach we were also able to control for age. We found that substructure was surprisingly weak and identified two genomic regions that were strongly associated with reactor status, identified by markers INRA111 and BMS2753. In general the strength of association detected tended to vary depending on whether age was included in the model. At INRA111 a single genotype appears strongly protective with an overall odds ratio of 2.2, the effect being consistent across nine diverse breeds. Our results suggest that breeding strategies could be devised that would appreciably increase genetic resistance of cattle to bTB (strictly, reduce the frequency of incidence of reactors) with implications for the current debate concerning badger-culling

    Methodological Issues in Multistage Genome-Wide Association Studies

    Full text link
    Because of the high cost of commercial genotyping chip technologies, many investigations have used a two-stage design for genome-wide association studies, using part of the sample for an initial discovery of ``promising'' SNPs at a less stringent significance level and the remainder in a joint analysis of just these SNPs using custom genotyping. Typical cost savings of about 50% are possible with this design to obtain comparable levels of overall type I error and power by using about half the sample for stage I and carrying about 0.1% of SNPs forward to the second stage, the optimal design depending primarily upon the ratio of costs per genotype for stages I and II. However, with the rapidly declining costs of the commercial panels, the generally low observed ORs of current studies, and many studies aiming to test multiple hypotheses and multiple endpoints, many investigators are abandoning the two-stage design in favor of simply genotyping all available subjects using a standard high-density panel. Concern is sometimes raised about the absence of a ``replication'' panel in this approach, as required by some high-profile journals, but it must be appreciated that the two-stage design is not a discovery/replication design but simply a more efficient design for discovery using a joint analysis of the data from both stages. Once a subset of highly-significant associations has been discovered, a truly independent ``exact replication'' study is needed in a similar population of the same promising SNPs using similar methods.Comment: Published in at http://dx.doi.org/10.1214/09-STS288 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Solving Sequences of Generalized Least-Squares Problems on Multi-threaded Architectures

    Full text link
    Generalized linear mixed-effects models in the context of genome-wide association studies (GWAS) represent a formidable computational challenge: the solution of millions of correlated generalized least-squares problems, and the processing of terabytes of data. We present high performance in-core and out-of-core shared-memory algorithms for GWAS: By taking advantage of domain-specific knowledge, exploiting multi-core parallelism, and handling data efficiently, our algorithms attain unequalled performance. When compared to GenABEL, one of the most widely used libraries for GWAS, on a 12-core processor we obtain 50-fold speedups. As a consequence, our routines enable genome studies of unprecedented size

    Network-based approaches to explore complex biological systems towards network medicine

    Get PDF
    Network medicine relies on different types of networks: from the molecular level of protein–protein interactions to gene regulatory network and correlation studies of gene expression. Among network approaches based on the analysis of the topological properties of protein–protein interaction (PPI) networks, we discuss the widespread DIAMOnD (disease module detection) algorithm. Starting from the assumption that PPI networks can be viewed as maps where diseases can be identified with localized perturbation within a specific neighborhood (i.e., disease modules), DIAMOnD performs a systematic analysis of the human PPI network to uncover new disease-associated genes by exploiting the connectivity significance instead of connection density. The past few years have witnessed the increasing interest in understanding the molecular mechanism of post-transcriptional regulation with a special emphasis on non-coding RNAs since they are emerging as key regulators of many cellular processes in both physiological and pathological states. Recent findings show that coding genes are not the only targets that microRNAs interact with. In fact, there is a pool of different RNAs—including long non-coding RNAs (lncRNAs) —competing with each other to attract microRNAs for interactions, thus acting as competing endogenous RNAs (ceRNAs). The framework of regulatory networks provides a powerful tool to gather new insights into ceRNA regulatory mechanisms. Here, we describe a data-driven model recently developed to explore the lncRNA-associated ceRNA activity in breast invasive carcinoma. On the other hand, a very promising example of the co-expression network is the one implemented by the software SWIM (switch miner), which combines topological properties of correlation networks with gene expression data in order to identify a small pool of genes—called switch genes—critically associated with drastic changes in cell phenotype. Here, we describe SWIM tool along with its applications to cancer research and compare its predictions with DIAMOnD disease genes
    • …
    corecore