11,133 research outputs found
Finding Statistically Significant Interactions between Continuous Features
The search for higher-order feature interactions that are statistically
significantly associated with a class variable is of high relevance in fields
such as Genetics or Healthcare, but the combinatorial explosion of the
candidate space makes this problem extremely challenging in terms of
computational efficiency and proper correction for multiple testing. While
recent progress has been made regarding this challenge for binary features, we
here present the first solution for continuous features. We propose an
algorithm which overcomes the combinatorial explosion of the search space of
higher-order interactions by deriving a lower bound on the p-value for each
interaction, which enables us to massively prune interactions that can never
reach significance and to thereby gain more statistical power. In our
experiments, our approach efficiently detects all significant interactions in a
variety of synthetic and real-world datasets.Comment: 13 pages, 5 figures, 2 tables, accepted to the 28th International
Joint Conference on Artificial Intelligence (IJCAI 2019
Controlling False Positives in Association Rule Mining
Association rule mining is an important problem in the data mining area. It
enumerates and tests a large number of rules on a dataset and outputs rules
that satisfy user-specified constraints. Due to the large number of rules being
tested, rules that do not represent real systematic effect in the data can
satisfy the given constraints purely by random chance. Hence association rule
mining often suffers from a high risk of false positive errors. There is a lack
of comprehensive study on controlling false positives in association rule
mining. In this paper, we adopt three multiple testing correction
approaches---the direct adjustment approach, the permutation-based approach and
the holdout approach---to control false positives in association rule mining,
and conduct extensive experiments to study their performance. Our results show
that (1) Numerous spurious rules are generated if no correction is made. (2)
The three approaches can control false positives effectively. Among the three
approaches, the permutation-based approach has the highest power of detecting
real association rules, but it is very computationally expensive. We employ
several techniques to reduce its cost effectively.Comment: VLDB201
Principal component gene set enrichment (PCGSE)
Motivation: Although principal component analysis (PCA) is widely used for
the dimensional reduction of biomedical data, interpretation of PCA results
remains daunting. Most existing methods attempt to explain each principal
component (PC) in terms of a small number of variables by generating
approximate PCs with few non-zero loadings. Although useful when just a few
variables dominate the population PCs, these methods are often inadequate for
characterizing the PCs of high-dimensional genomic data. For genomic data,
reproducible and biologically meaningful PC interpretation requires methods
based on the combined signal of functionally related sets of genes. While gene
set testing methods have been widely used in supervised settings to quantify
the association of groups of genes with clinical outcomes, these methods have
seen only limited application for testing the enrichment of gene sets relative
to sample PCs. Results: We describe a novel approach, principal component gene
set enrichment (PCGSE), for computing the statistical association between gene
sets and the PCs of genomic data. The PCGSE method performs a two-stage
competitive gene set test using the correlation between each gene and each PC
as the gene-level test statistic with flexible choice of both the gene set test
statistic and the method used to compute the null distribution of the gene set
statistic. Using simulated data with simulated gene sets and real gene
expression data with curated gene sets, we demonstrate that biologically
meaningful and computationally efficient results can be obtained from a simple
parametric version of the PCGSE method that performs a correlation-adjusted
two-sample t-test between the gene-level test statistics for gene set members
and genes not in the set. Availability:
http://cran.r-project.org/web/packages/PCGSE/index.html Contact:
[email protected] or [email protected]
Confidence Statements for Ordering Quantiles
This work proposes Quor, a simple yet effective nonparametric method to
compare independent samples with respect to corresponding quantiles of their
populations. The method is solely based on the order statistics of the samples,
and independence is its only requirement. All computations are performed using
exact distributions with no need for any asymptotic considerations, and yet can
be run using a fast quadratic-time dynamic programming idea. Computational
performance is essential in high-dimensional domains, such as gene expression
data. We describe the approach and discuss on the most important assumptions,
building a parallel with assumptions and properties of widely used techniques
for the same problem. Experiments using real data from biomedical studies are
performed to empirically compare Quor and other methods in a classification
task over a selection of high-dimensional data sets
- …