10,108 research outputs found
Recommended from our members
Statistical Methodology for Sequence Analysis
Rare disease variants are receiving increasing importance in the past few years as the potential cause for many complex diseases, after the common disease variants failed to explain a large part of the missing heritability. With the advancement in sequencing techniques as well as computational capabilities, statistical methodology for analyzing rare variants is now a hot topic, especially in case-control association studies. In this thesis, we initially present two related statistical methodologies designed for case-control studies to predict the number of common and rare variants in a particular genomic region underlying the complex disease. Genome-wide association studies are nowadays routinely performed to identify a few putative marker loci or a candidate region for further analysis. These methods are designed to work with SNP data on such a genomic region highlighted by GWAS studies for potential disease variants. The fundamental idea is to use Bayesian methodology to obtain bivariate posterior distributions on counts of common and rare variants. While the first method uses randomly generated (minimal) ancestral recombination graphs, the second method uses ensemble clustering method to explore the space of genealogical trees that represent the inherent structure in the test subjects. In contrast to the aforesaid methods which work with SNP data, the third chapter deals with next-generation sequencing data to detect the presence of rare variants in a genomic region. We present a non-parametric statistical methodology for rare variant association testing, using the well-known Kolmogorov-Smirnov framework adapted for genetic data. it is a fast, model-free robust statistic, designed for situations where both deleterious and protective variants are present. It is also unique in utilizing the variant locations in the test statistic
Bayesian Model Comparison in Genetic Association Analysis: Linear Mixed Modeling and SNP Set Testing
We consider the problems of hypothesis testing and model comparison under a
flexible Bayesian linear regression model whose formulation is closely
connected with the linear mixed effect model and the parametric models for SNP
set analysis in genetic association studies. We derive a class of analytic
approximate Bayes factors and illustrate their connections with a variety of
frequentist test statistics, including the Wald statistic and the variance
component score statistic. Taking advantage of Bayesian model averaging and
hierarchical modeling, we demonstrate some distinct advantages and
flexibilities in the approaches utilizing the derived Bayes factors in the
context of genetic association studies. We demonstrate our proposed methods
using real or simulated numerical examples in applications of single SNP
association testing, multi-locus fine-mapping and SNP set association testing
A Strategy analysis for genetic association studies with known inbreeding
Background: Association studies consist in identifying the genetic variants which are related to a specific disease through the use of statistical multiple hypothesis testing or segregation analysis in pedigrees. This type of studies has been very successful in the case of Mendelian monogenic disorders while it has been less successful in identifying genetic variants related to complex diseases where the insurgence depends on the interactions between different genes and the environment. The current technology allows to genotype more than a million of markers and this number has been rapidly increasing in the last years with the imputation based on templates sets and whole genome sequencing. This type of data introduces a great amount of noise in the statistical analysis and usually requires a great number of samples. Current methods seldom take into account gene-gene and gene-environment interactions which are fundamental especially in complex diseases. In this paper we propose to use a non-parametric additive model to detect the genetic variants related to diseases which accounts for interactions of unknown order. Although this is not new to
the current literature, we show that in an isolated population, where the most related subjects share also most of their genetic code, the use of additive models may be improved if the available genealogical tree is taken into account. Specifically, we form a sample of cases and controls with the highest inbreeding by means of the Hungarian method, and estimate the set of genes/environmental variables, associated with the disease, by means of Random Forest.
Results: We have evidence, from statistical theory, simulations and two applications, that we build a suitable
procedure to eliminate stratification between cases and controls and that it also has enough precision in
identifying genetic variants responsible for a disease. This procedure has been successfully used for the betathalassemia, which is a well known Mendelian disease, and also to the common asthma where we have identified
candidate genes that underlie to the susceptibility of the asthma. Some of such candidate genes have been also found related to common asthma in the current literature.
Conclusions: The data analysis approach, based on selecting the most related cases and controls along with the Random Forest model, is a powerful tool for detecting genetic variants associated to a disease in isolated
populations. Moreover, this method provides also a prediction model that has accuracy in estimating the unknown disease status and that can be generally used to build kit tests for a wide class of Mendelian diseases
Statistical Methods For Detecting Genetic Risk Factors of a Disease with Applications to Genome-Wide Association Studies
This thesis aims to develop various statistical methods for analysing the data derived from genome wide association studies (GWAS).
The GWAS often involves genotyping individual human genetic variation, using high-throughput genome-wide single nucleotide polymorphism (SNP) arrays, in thousands of individuals and testing for association between those variants and a given disease under the assumption of common disease/common variant.
Although GWAS have identified many potential genetic factors in the genome that affect the risks to complex
diseases, there is still much of the genetic heritability that remains unexplained. The power of
detecting new genetic risk variants can be improved by considering multiple genetic variants simultaneously with novel statistical methods.
Improving the analysis of the GWAS data has received much attention from statisticians and other scientific researchers over the past decade.
There are several challenges arising in analysing the GWAS data. First, determining the risk SNPs might be difficult due to non-random correlation between SNPs that can inflate type I and II errors in statistical inference. When a group of SNPs are considered together in the context of haplotypes/genotypes, the distribution of the haplotypes/genotypes is sparse, which makes it difficult to detect risk haplotypes/genotypes in terms of disease penetrance.
In this work, we proposed four new methods to identify risk haplotypes/genotypes based on their frequency differences between cases and controls. To evaluate the performances of our methods, we simulated datasets under wide range of scenarios according to both retrospective and prospective designs.
In the first method, we first reconstruct haplotypes by using unphased genotypes, followed by clustering and thresholding the inferred haplotypes into risk and non-risk groups with a two-component binomial-mixture model. In the method, the parameters were estimated by using the modified Expectation-Maximization algorithm, where the maximisation step was replaced the posterior sampling of the component parameters. We also elucidated the relationships between risk and non-risk haplotypes under different modes of inheritance and genotypic relative risk.
In the second method, we fitted a three-component mixture model to genotype data directly, followed by an odds-ratio thresholding.
In the third method, we combined the existing haplotype reconstruction software PHASE and permutation method to infer risk haplotypes.
In the fourth method, we proposed a new way to score the genotypes by clustering and combined it with a logistic regression approach to infer risk haplotypes.
The simulation studies showed that the first three methods outperformed the multiple testing method of (Zhu, 2010) in terms of average specificity and sensitivity (AVSS) in all scenarios considered. The logistic regression methods also outperformed the standard logistic regression method.
We applied our methods to two GWAS datasets on coronary artery disease (CAD) and hypertension (HT), detecting several new risk haplotypes and recovering a number of the existing disease-associated genetic variants in the literature
Methodological Issues in Multistage Genome-Wide Association Studies
Because of the high cost of commercial genotyping chip technologies, many
investigations have used a two-stage design for genome-wide association
studies, using part of the sample for an initial discovery of ``promising''
SNPs at a less stringent significance level and the remainder in a joint
analysis of just these SNPs using custom genotyping. Typical cost savings of
about 50% are possible with this design to obtain comparable levels of overall
type I error and power by using about half the sample for stage I and carrying
about 0.1% of SNPs forward to the second stage, the optimal design depending
primarily upon the ratio of costs per genotype for stages I and II. However,
with the rapidly declining costs of the commercial panels, the generally low
observed ORs of current studies, and many studies aiming to test multiple
hypotheses and multiple endpoints, many investigators are abandoning the
two-stage design in favor of simply genotyping all available subjects using a
standard high-density panel. Concern is sometimes raised about the absence of a
``replication'' panel in this approach, as required by some high-profile
journals, but it must be appreciated that the two-stage design is not a
discovery/replication design but simply a more efficient design for discovery
using a joint analysis of the data from both stages. Once a subset of
highly-significant associations has been discovered, a truly independent
``exact replication'' study is needed in a similar population of the same
promising SNPs using similar methods.Comment: Published in at http://dx.doi.org/10.1214/09-STS288 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Particle algorithms for optimization on binary spaces
We discuss a unified approach to stochastic optimization of pseudo-Boolean
objective functions based on particle methods, including the cross-entropy
method and simulated annealing as special cases. We point out the need for
auxiliary sampling distributions, that is parametric families on binary spaces,
which are able to reproduce complex dependency structures, and illustrate their
usefulness in our numerical experiments. We provide numerical evidence that
particle-driven optimization algorithms based on parametric families yield
superior results on strongly multi-modal optimization problems while local
search heuristics outperform them on easier problems
- …