205 research outputs found
Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases
Recent advances of information technology in biomedical sciences and other
applied areas have created numerous large diverse data sets with a high
dimensional feature space, which provide us a tremendous amount of information
and new opportunities for improving the quality of human life. Meanwhile, great
challenges are also created driven by the continuous arrival of new data that
requires researchers to convert these raw data into scientific knowledge in
order to benefit from it. Association studies of complex diseases using SNP
data have become more and more popular in biomedical research in recent years.
In this paper, we present a review of recent statistical advances and
challenges for analyzing correlated high dimensional SNP data in genomic
association studies for complex diseases. The review includes both general
feature reduction approaches for high dimensional correlated data and more
specific approaches for SNPs data, which include unsupervised haplotype
mapping, tag SNP selection, and supervised SNPs selection using statistical
testing/scoring, statistical modeling and machine learning methods with an
emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics
Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Population diversity as quantified by inter-population variation in patterns of linkage disequilibrium
Ph.DDOCTOR OF PHILOSOPH
Searching Genome-wide Disease Association Through SNP Data
Taking the advantage of the high-throughput Single Nucleotide Polymorphism (SNP) genotyping technology, Genome-Wide Association Studies (GWASs) are regarded holding promise for unravelling complex relationships between genotype and phenotype. GWASs aim to identify genetic variants associated with disease by assaying and analyzing hundreds of thousands of SNPs. Traditional single-locus-based and two-locus-based methods have been standardized and led to many interesting findings. Recently, a substantial number of GWASs indicate that, for most disorders, joint genetic effects (epistatic interaction) across the whole genome are broadly existing in complex traits. At present, identifying high-order epistatic interactions from GWASs is computationally and methodologically challenging.
My dissertation research focuses on the problem of searching genome-wide association with considering three frequently encountered scenarios, i.e. one case one control, multi-cases multi-controls, and Linkage Disequilibrium (LD) block structure. For the first scenario, we present a simple and fast method, named DCHE, using dynamic clustering. Also, we design two methods, a Bayesian inference based method and a heuristic method, to detect genome-wide multi-locus epistatic interactions on multiple diseases. For the last scenario, we propose a block-based Bayesian approach to model the LD and conditional disease association simultaneously. Experimental results on both synthetic and real GWAS datasets show that the proposed methods improve the detection accuracy of disease-specific associations and lessen the computational cost compared with current popular methods
Algorithms for Computational Genetics Epidemiology
The most intriguing problems in genetics epidemiology are to predict genetic disease susceptibility and to associate single nucleotide polymorphisms (SNPs) with diseases. In such these studies, it is necessary to resolve the ambiguities in genetic data. The primary obstacle for ambiguity resolution is that the physical methods for separating two haplotypes from an individual genotype (phasing) are too expensive. Although computational haplotype inference is a well-explored problem, high error rates continue to deteriorate association accuracy. Secondly, it is essential to use a small subset of informative SNPs (tag SNPs) accurately representing the rest of the SNPs (tagging). Tagging can achieve budget savings by genotyping only a limited number of SNPs and computationally inferring all other SNPs. Recent successes in high throughput genotyping technologies drastically increase the length of available SNP sequences. This elevates importance of informative SNP selection for compaction of huge genetic data in order to make feasible fine genotype analysis. Finally, even if complete and accurate data is available, it is unclear if common statistical methods can determine the susceptibility of complex diseases. The dissertation explores above computational problems with a variety of methods, including linear algebra, graph theory, linear programming, and greedy methods. The contributions include (1)significant speed-up of popular phasing tools without compromising their quality, (2)stat-of-the-art tagging tools applied to disease association, and (3)graph-based method for disease tagging and predicting disease susceptibility
Recommended from our members
Topics in Signal Processing: applications in genomics and genetics
The information in genomic or genetic data is influenced by various complex processes and appropriate mathematical modeling is required for studying the underlying processes and the data. This dissertation focuses on the formulation of mathematical models for certain problems in genomics and genetics studies and the development of algorithms for proposing efficient solutions. A Bayesian approach for the transcription factor (TF) motif discovery is examined and the extensions are proposed to deal with many interdependent parameters of the TF-DNA binding. The problem is described by statistical terms and a sequential Monte Carlo sampling method is employed for the estimation of unknown parameters. In particular, a class-based resampling approach is applied for the accurate estimation of a set of intrinsic properties of the DNA binding sites. Through statistical analysis of the gene expressions, a motif-based computational approach is developed for the inference of novel regulatory networks in a given bacterial genome. To deal with high false-discovery rates in the genome-wide TF binding predictions, the discriminative learning approaches are examined in the context of sequence classification, and a novel mathematical model is introduced to the family of kernel-based Support Vector Machines classifiers. Furthermore, the problem of haplotype phasing is examined based on the genetic data obtained from cost-effective genotyping technologies. Based on the identification and augmentation of a small and relatively more informative genotype set, a sparse dictionary selection algorithm is developed to infer the haplotype pairs for the sampled population. In a relevant context, to detect redundant information in the single nucleotide polymorphism (SNP) sites, the problem of representative (tag) SNP selection is introduced. An information theoretic heuristic is designed for the accurate selection of tag SNPs that capture the genetic diversity in a large sample set from multiple populations. The method is based on a multi-locus mutual information measure, reflecting a biological principle in the population genetics that is linkage disequilibrium
Mining whole genome sequence data to efficiently attribute individuals to source populations
Acknowledgements: The Campylobacter work in this project was supported by Food Standards Scotland project FSS00017 and the Scottish Government (Rural and Environment Science and Analytical Services Division) project A13559368.Peer reviewedPublisher PD
Discrete Algorithms for Analysis of Genotype Data
Accessibility of high-throughput genotyping technology makes possible genome-wide association studies for common complex diseases. When dealing with common diseases, it is necessary to search and analyze multiple independent causes resulted from interactions of multiple genes scattered over the entire genome. The optimization formulations for searching disease-associated risk/resistant factors and predicting disease susceptibility for given case-control study have been introduced. Several discrete methods for disease association search exploiting greedy strategy and topological properties of case-control studies have been developed. New disease susceptibility prediction methods based on the developed search methods have been validated on datasets from case-control studies for several common diseases. Our experiments compare favorably the proposed algorithms with the existing association search and susceptibility prediction methods
Association of Interacting Genes in the Toll-Like Receptor Signaling Pathway and the Antibody Response to Pertussis Vaccination
BACKGROUND: Activation of the Toll-like receptor (TLR) signaling pathway through TLR4 may be important in the induction of protective immunity against Bordetella pertussis with TLR4-mediated activation of dendritic and B cells, induction of cytokine expression, and reversal of tolerance as crucial steps. We examined whether single nucleotide polymorphisms (SNPs) in genes of the TLR4 pathway and their interaction are associated with the response to whole-cell vaccine (WCV) pertussis vaccination in 490 one-year-old children. METHODOLOGY/PRINCIPAL FINDINGS: We analyzed associations of 75 haplotype-tagging SNPs in genes in the TLR4 signaling pathway with pertussis toxin (PT)-IgG titers. We found significant associations between the PT-IgG titer and SNPs in CD14, TLR4, TOLLIP, TIRAP, IRAK3, IRAK4, TICAM1, and TNFRSF4 in one or more of the analyses. The strongest evidence for association was found for two SNPs (rs5744034 and rs5743894) in TOLLIP that were almost completely in linkage disequilibrium, provided statistically significant associations in all tests with the lowest p-values, and displayed a dominant mode of inheritance. However, none of these single gene associations would withstand correction for multiple testing. In addition, Multifactor Dimensionality Reduction Analysis, an approach that does not need correction for multiple testing, showed significant and strong two and three locus interactions between SNPs in TOLLIP (rs4963060), TLR4 (rs6478317) and IRAK1 (rs1059703). CONCLUSIONS/SIGNIFICANCE: We have identified significant interactions between genes in the TLR pathway in the induction of vaccine-induced immunity. These interactions underline that these genes are functionally related and together form a true biological relationship in a protein-protein interaction network. Practically all our findings may be explained by genetic variation in directly or indirectly interacting proteins at the extra- and intracytoplasmic sites of the cell membrane of antigen-presenting cells, B cells, or both. Fine tuning of interacting proteins in the TLR pathway appears important for the induction of an optimal vaccine response
- …