127 research outputs found
Identification of SNP interactions using logic regression
Interactions of single nucleotide polymorphisms (SNPs) are assumed to be responsible for complex diseases such as sporadic breast cancer. Important goals of studies concerned with such genetic data are thus to identify combinations of SNPs that lead to a higher risk of developing a disease and to measure the importance of these interactions. There are many approaches based on classification methods such as CART and Random Forests that allow measuring the importance of single variables. But with none of these methods the importance of combinations of variables can be quantified directly. In this paper, we show how logic regression can be employed to identify SNP interactions explanatory for the disease status in a case- control study and propose two measures for quantifying the importance of these interactions for classification. These approaches are then applied, on the one hand, to simulated data sets, and on the other hand, to the SNP data of the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. --Single Nucleotide Polymorphism,Feature Selection,Variable Importance Measure,GENICA
Imputing missing genotypes with weighted k nearest neighbors
Motivation: Missing values are a common problem in genetic association studies concerned with single nucleotide polymorphisms (SNPs). Since most statistical methods cannot handle missing values, they have to be removed prior to the actual analysis. Considering only complete observations, however, often leads to an immense loss of information. Therefore, procedures are needed that can be used to replace such missing values. In this article, we propose a method based on weighted k nearest neighbors that can be employed for imputing such missing genotypes. Results: In a comparison to other imputation approaches, our procedure called KNNcatImpute shows the lowest rates of falsely imputed genotypes when applied to the SNP data from the GENICA study, a study dedicated to the identification of genetic and gene-environment interactions associated with sporadic breast cancer. Moreover, in contrast to other imputation methods that take all variables into account when replacing missing values of a particular variable, KNNcatImpute is not restricted to association studies comprising several ten to a few hundred SNPs, but can also be applied to data from whole-genome studies, as an application to a subset of the HapMap data shows. --
Minimization of Boolean expressions using matrix algebra
The more variables a logic expression contain, the more complicated is
the interpretation of this expression. Since in a statistical sense prime
implicants can be interpreted as interactions of binary variables, it is
thus advantageous to convert such a logic expression into a disjunctive
normal form consisting of prime implicants.
In this paper, we present two algorithms based on matrix algebra
for the identification of all prime implicants comprised in a logic
expression and for the minimization of this set of prime implicants
A note on the simultaneous computation of thousands of Pearson’s Chi^2-statistics
In genetic association studies, important and common goals are the
identification of single nucleotide polymorphisms (SNPs) showing a
distribution that differs between several groups and the detection of
SNPs with a coherent pattern. In the former situation, tens of thousands
of SNPs should be tested, whereas in the latter case typically
several ten SNPs are considered leading to thousands of statistics that
need to be computed.
A test statistic appropriate for both goals is Pearson’s Chi^2-statistic.
However, computing this (or another) statistic for each SNP or pair
of SNPs separately is very time-consuming.
In this article, we show how simple matrix computation can be
employed to calculate the Chi^2-statistic for all SNPs simultaneously
Comparison of the empirical bayes and the significance analysis of microarrays
Microarrays enable to measure the expression levels of tens of thousands of genes simultaneously. One important statistical question in such experiments is which of the several thousand genes are differentially expressed. Answering this question requires methods that can deal with multiple testing problems. One such approach is the control of the False Discovery Rate (FDR). Two recently developed methods for the identification of differentially expressed genes and the estimation of the FDR are the SAM (Significance Analysis of Microarrays) procedure and an empirical Bayes approach. In the two group case, both methods are based on a modified version of the standard t-statistic. However, it is also possible to use the Wilcoxon rank sum statistic. While there already exists a version of the empirical Bayes approach based on this rank statistic, we introduce in this paper a new version of SAM based on Wilcoxon rank sums. We furthermore compare these four procedures by applying them to simulated and real gene expression data. --Identification of differentially expressed genes,Gene expression,Multiple Testing,False Discovery Rate
Imputing missing genotypes with weighted k nearest neighbors
Missing values are a common problem in genetic association studies concerned with single nucleotide polymorphisms (SNPs). Since most statistical methods cannot handle missing values, they have to be removed prior to the actual analysis.
Considering only complete observations, however, often leads to an immense loss of information. Therefore, procedures are needed that can be used to replace such missing values. In this article, we propose a method based on weighted k nearest neighbors that can be employed for imputing such missing genotypes
Detecting high-order interactions of single nucleotide polymorphisms using genetic programming
Motivation: Not individual single nucleotide polymorphisms (SNPs), but high-order interactions of SNPs are assumed to be responsible for complex diseases such as cancer. Therefore, one of the major goals of genetic association studies concerned with such genotype data is the identification of these high-order interactions. This search is additionally impeded by the fact that these interactions often are only explanatory for a relatively small subgroup of patients. Most of the feature selection methods proposed in the literature, unfortunately, fail at this task, since they can either only identify individual variables or interactions of a low order, or try to find rules that are explanatory for a high percentage of the observations. In this paper, we present a procedure based on genetic programming and multi-valued logic that enables the identification of high-order interactions of categorical variables such as SNPs. This method called GPAS (Genetic Programming for Association Studies) cannot only be used for feature selection, but can also be employed for discrimination. Results: In an application to the genotype data from the GENICA study, an association study concerned with sporadic breast cancer, GPAS is able to identify high-order interactions of SNPs leading to a considerably increased breast cancer risk for different subsets of patients that are not found by other feature selection methods. As an application to a subset of the HapMap data shows, GPAS is not restricted to association studies comprising several ten SNPs, but can also be employed to analyze whole-genome data. --
- …