156 research outputs found

    Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

    Full text link
    Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Algorithms for Computational Genetics Epidemiology

    Get PDF
    The most intriguing problems in genetics epidemiology are to predict genetic disease susceptibility and to associate single nucleotide polymorphisms (SNPs) with diseases. In such these studies, it is necessary to resolve the ambiguities in genetic data. The primary obstacle for ambiguity resolution is that the physical methods for separating two haplotypes from an individual genotype (phasing) are too expensive. Although computational haplotype inference is a well-explored problem, high error rates continue to deteriorate association accuracy. Secondly, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs (tagging). Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs. Recent successes in high throughput genotyping technologies drastically increase the length of available SNP sequences. This elevates importance of informative SNP selection for compaction of huge genetic data in order to make feasible fine genotype analysis. Finally, even if complete and accurate data is available, it is unclear if common statistical methods can determine the susceptibility of complex diseases. The dissertation explores above computational problems with a variety of methods, including linear algebra, graph theory, linear programming, and greedy methods. The contributions include (1)significant speed-up of popular phasing tools without compromising their quality, (2)stat-of-the-art tagging tools applied to disease association, and (3)graph-based method for disease tagging and predicting disease susceptibility

    Interleukin-6 gene (IL-6): a possible role in brain morphology in the healthy adult brain

    Get PDF
    Background: Cytokines such as interleukin 6 (IL-6) have been implicated in dual functions in neuropsychiatric disorders. Little is known about the genetic predisposition to neurodegenerative and neuroproliferative properties of cytokine genes. In this study the potential dual role of several IL-6 polymorphisms in brain morphology is investigated. Methodology: In a large sample of healthy individuals (N = 303), associations between genetic variants of IL-6 (rs1800795; rs1800796, rs2069833, rs2069840) and brain volume (gray matter volume) were analyzed using voxel-based morphometry (VBM). Selection of single nucleotide polymorphisms (SNPs) followed a tagging SNP approach (e.g., Stampa algorigthm), yielding a capture 97.08% of the variation in the IL-6 gene using four tagging SNPs. Principal findings/results: In a whole-brain analysis, the polymorphism rs1800795 (−174 C/G) showed a strong main effect of genotype (43 CC vs. 150 CG vs. 100 GG; x = 24, y = −10, z = −15; F(2,286) = 8.54, puncorrected = 0.0002; pAlphaSim-corrected = 0.002; cluster size k = 577) within the right hippocampus head. Homozygous carriers of the G-allele had significantly larger hippocampus gray matter volumes compared to heterozygous subjects. None of the other investigated SNPs showed a significant association with grey matter volume in whole-brain analyses. Conclusions/significance: These findings suggest a possible neuroprotective role of the G-allele of the SNP rs1800795 on hippocampal volumes. Studies on the role of this SNP in psychiatric populations and especially in those with an affected hippocampus (e.g., by maltreatment, stress) are warranted.Bernhard T Baune, Carsten Konrad, Dominik Grotegerd, Thomas Suslow, Eva Birosova, Patricia Ohrmann, Jochen Bauer, Volker Arolt, Walter Heindel, Katharina Domschke, Sonja Schöning, Astrid V Rauch, Christina Uhlmann, Harald Kugel and Udo Dannlowsk

    Leveraging large scale beef cattle genomic data to identify the architecture of polygenic selection and local adaptation

    Get PDF
    Includes vita.Since the invention of the first array-based genotyping assay for cattle in 2008, millions of animals have been genotyped worldwide. Leveraging these genotypes offers exciting opportunities to explore both basic and applied research questions. Commercial genotyping assays are of adequate variant density to perform well in prediction contexts but are not sufficient for mapping studies. Using reference panels made up of individuals genotyped at higher densities, we can statistically infer the missing variation of low-density assays through the process of imputation. Here, we explore the best practices for performing routine imputation in large commercially generated genomic datasets of U.S. beef cattle. We find that using a large multi-breed imputation reference maximizes accuracy, particularly for rare variants. Using three of these large, imputed datasets, we explore two major population genetics questions. First, we map polygenic selection in the bovine genome, using Generation Proxy Selection Mapping (GPSM). This identifies hundreds of regions of the genome actively under selection in cattle populations. Using a similar approach, we identify dozens of genomic variants associated with environments across the U.S., likely involved local adaptation. Understanding the genomic basis of local adaptation in cattle will enable select and breed cattle better suited to a changing climate.Includes bibliographical references (pages 203-228

    Harness the power of genomic selection and the potential of germplasm in crop breeding for global food security in the era with rapid climate change

    Get PDF
    Crop genetic improvements catalysed population growth, which in turn has increased the pressure for food security. We need to produce 70% more food to meet the demands of 9.5 billion people by 2050. Climate changes have posed challenges for global food supply, while the narrow genetic base of elite crop cultivars has further limited our capacity to increase genetic gain through conventional breeding. The effective utilization of genetic resources in germplasm collections for crop improvement is crucial to increasing genetic gain to address challenges in the global food supply. Genomic selection (GS) uses genome-wide markers and phenotype information from observed populations to establish associations, followed by genome-wide markers to predict phenotypic values in test populations. Characterizing an extensive germplasm collection can serve a dual purpose in GS, as a reference population for predicting model, and mining desirable genetic variants for incorporation into elite cultivars. New technologies, such as high-throughput genotyping and phenotyping, machine learning, and gene editing, have great potential to contribute to genome-assisted breeding. Breeding programmes integrating germplasm characterization, GS and emerging technologies offer promise for accelerating the development of cultivars with improved yield and enhanced resistance and tolerance to biotic and abiotic stresses. Finally, scientifically informed regulations on new breeding technologies, and increased sharing of genetic resources, genomic data, and bioinformatics expertise between developed and developing economies will be the key to meeting the challenges of the rapidly changing climate and increased demand for food

    Expression quantitative trait loci as possible biomarkers on depression

    Get PDF

    Sparse reduced-rank regression for imaging genetics studies: models and applications

    Get PDF
    We present a novel statistical technique; the sparse reduced rank regression (sRRR) model which is a strategy for multivariate modelling of high-dimensional imaging responses and genetic predictors. By adopting penalisation techniques, the model is able to enforce sparsity in the regression coefficients, identifying subsets of genetic markers that best explain the variability observed in subsets of the phenotypes. To properly exploit the rich structure present in each of the imaging and genetics domains, we additionally propose the use of several structured penalties within the sRRR model. Using simulation procedures that accurately reflect realistic imaging genetics data, we present detailed evaluations of the sRRR method in comparison with the more traditional univariate linear modelling approach. In all settings considered, we show that sRRR possesses better power to detect the deleterious genetic variants. Moreover, using a simple genetic model, we demonstrate the potential benefits, in terms of statistical power, of carrying out voxel-wise searches as opposed to extracting averages over regions of interest in the brain. Since this entails the use of phenotypic vectors of enormous dimensionality, we suggest the use of a sparse classification model as a de-noising step, prior to the imaging genetics study. Finally, we present the application of a data re-sampling technique within the sRRR model for model selection. Using this approach we are able to rank the genetic markers in order of importance of association to the phenotypes, and similarly rank the phenotypes in order of importance to the genetic markers. In the very end, we illustrate the application perspective of the proposed statistical models in three real imaging genetics datasets and highlight some potential associations

    Identification of disease related significant SNPs

    Get PDF
    Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide in the genome sequence is altered. Since, variations in DNA sequence can have a major impact on complex human diseases such as obesity, epilepsy, type 2 diabetes, rheumatoid arthritis; SNPs have become increasingly significant in identification of such complex diseases. Recent biological studies point out that a single altered gene may have a small effect on a complex disease, whereas interactions between multiple genes may have a significant role. Therefore, identifying multiple genes associated with complex disorders is essential. In this spirit, combinations of multiple SNPs rather than individual SNPs should be analyzed. However, assessing a very large number of SNP combinations is computationally challenging and due to this challenge, in literature there exist a limited number of studies on extracting statistically significant SNP combinations. In this thesis work, we focus on this challenging problem and develop a five step "disease-associated multi-SNP combinations search procedure" to identify statistically significant SNP combinations and the significant rules defining the associations between SNPs and a specified disease. The proposed five step multi-SNP combinations procedure is applied to the simulated rheumatoid arthritis data set provided by Genetic Analysis Workshop 15. In each step, statistically significant SNPs are extracted from the available set of SNPs that are not yet classified as significant or insignificant. In the first step, the genome wide association analysis (GWA) is performed on the original complete multi-family data set. Then, in the second step we use the tag SNP selection algorithm to find a smaller subset of informative SNP markers. In literature most tag SNP selection methods are based on the pair wise (two-markers) linkage disequilibrium (LD) measures. But in this thesis, both the pair wise and multiple marker LD measures have been incorporated to improve the genetic coverage. Up to the third step the procedure aims to identify individual significant SNPs. In the third step a genetic algorithm (GA) based feature selection method is performed. It provides a significant combination of SNPs and the GA constructs this combination by maximizing the explanatory power of the selected SNPs while trying to decrease the number of selected SNPs dynamically. Since GA is a probabilistic search approach, at each execution it may provide different SNP combinations. We apply the GA several times to obtain multiple significant SNP combinations, and for each combination we calculate the associated pseudo r-square values and apply some statistical tests to check its significance. We also consider the union and intersection of the SNP combinations, identified by the GA, as potentially significant SNP combinations. After identifying multiple statistically significant SNP combinations, in the fourth and fifth steps we focus on extracting rules to explain the association between the SNPs and the disease. In the fourth step we apply a classification method, called Decision Tree Forest, to calculate the importance values of individual SNPs that belong to at least one of the SNP combinations found by the GA. Since each marker in a SNP combination is in bi-allelic form, genotypes of a SNP can affect the disease status. Different genotypes of SNPs are considered to define candidate rules. Then utilizing the calculated importance values and the occurrence percentage of the candidate rule in the data set, in the fifth step we perform our proposed rule extraction method to select the rules among the candidate ones. In literature there are many classification approaches such as the decision tree, decision forest and random forest. Each of these methods considers SNP interactions which are explanatory for a large subset of patients. However, in real life some SNP interactions that are observed only in a small subset of patients might cause the disease. The existing classification methods do not identify such interactions as significant. However, of the proposed five-step multi-SNP combinations procedure extracts these interactions as well as the others. This is a significant contribution to the research on identifying significant interactions that may cause a human to have the disease
    corecore