66,066 research outputs found
Sequential Design for Ranking Response Surfaces
We propose and analyze sequential design methods for the problem of ranking
several response surfaces. Namely, given response surfaces over a
continuous input space , the aim is to efficiently find the index of
the minimal response across the entire . The response surfaces are not
known and have to be noisily sampled one-at-a-time. This setting is motivated
by stochastic control applications and requires joint experimental design both
in space and response-index dimensions. To generate sequential design
heuristics we investigate stepwise uncertainty reduction approaches, as well as
sampling based on posterior classification complexity. We also make connections
between our continuous-input formulation and the discrete framework of pure
regret in multi-armed bandits. To model the response surfaces we utilize
kriging surrogates. Several numerical examples using both synthetic data and an
epidemics control problem are provided to illustrate our approach and the
efficacy of respective adaptive designs.Comment: 26 pages, 7 figures (updated several sections and figures
Fast Identification of Biological Pathways Associated with a Quantitative Trait Using Group Lasso with Overlaps
Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within
biological pathways, the incorporation of prior pathways information into a
statistical model is expected to increase the power to detect true associations
in a genetic association study. Most existing pathways-based methods rely on
marginal SNP statistics and do not fully exploit the dependence patterns among
SNPs within pathways. We use a sparse regression model, with SNPs grouped into
pathways, to identify causal pathways associated with a quantitative trait.
Notable features of our "pathways group lasso with adaptive weights" (P-GLAW)
algorithm include the incorporation of all pathways in a single regression
model, an adaptive pathway weighting procedure that accounts for factors
biasing pathway selection, and the use of a bootstrap sampling procedure for
the ranking of important pathways. P-GLAW takes account of the presence of
overlapping pathways and uses a novel combination of techniques to optimise
model estimation, making it fast to run, even on whole genome datasets. In a
comparison study with an alternative pathways method based on univariate SNP
statistics, our method demonstrates high sensitivity and specificity for the
detection of important pathways, showing the greatest relative gains in
performance where marginal SNP effect sizes are small.Comment: 29 page
How to Host a Data Competition: Statistical Advice for Design and Analysis of a Data Competition
Data competitions rely on real-time leaderboards to rank competitor entries
and stimulate algorithm improvement. While such competitions have become quite
popular and prevalent, particularly in supervised learning formats, their
implementations by the host are highly variable. Without careful planning, a
supervised learning competition is vulnerable to overfitting, where the winning
solutions are so closely tuned to the particular set of provided data that they
cannot generalize to the underlying problem of interest to the host. This paper
outlines some important considerations for strategically designing relevant and
informative data sets to maximize the learning outcome from hosting a
competition based on our experience. It also describes a post-competition
analysis that enables robust and efficient assessment of the strengths and
weaknesses of solutions from different competitors, as well as greater
understanding of the regions of the input space that are well-solved. The
post-competition analysis, which complements the leaderboard, uses exploratory
data analysis and generalized linear models (GLMs). The GLMs not only expand
the range of results we can explore, they also provide more detailed analysis
of individual sub-questions including similarities and differences between
algorithms across different types of scenarios, universally easy or hard
regions of the input space, and different learning objectives. When coupled
with a strategically planned data generation approach, the methods provide
richer and more informative summaries to enhance the interpretation of results
beyond just the rankings on the leaderboard. The methods are illustrated with a
recently completed competition to evaluate algorithms capable of detecting,
identifying, and locating radioactive materials in an urban environment.Comment: 36 page
Bayesian analysis of ranking data with the constrained Extended Plackett-Luce model
Multistage ranking models, including the popular Plackett-Luce distribution
(PL), rely on the assumption that the ranking process is performed
sequentially, by assigning the positions from the top to the bottom one
(forward order). A recent contribution to the ranking literature relaxed this
assumption with the addition of the discrete-valued reference order parameter,
yielding the novel Extended Plackett-Luce model (EPL). Inference on the EPL and
its generalization into a finite mixture framework was originally addressed
from the frequentist perspective. In this work, we propose the Bayesian
estimation of the EPL with order constraints on the reference order parameter.
The proposed restrictions reflect a meaningful rank assignment process. By
combining the restrictions with the data augmentation strategy and the
conjugacy of the Gamma prior distribution with the EPL, we facilitate the
construction of a tuned joint Metropolis-Hastings algorithm within Gibbs
sampling to simulate from the posterior distribution. The Bayesian approach
allows to address more efficiently the inference on the additional
discrete-valued parameter and the assessment of its estimation uncertainty. The
usefulness of the proposal is illustrated with applications to simulated and
real datasets.Comment: 20 pages, 4 figures, 4 tables. arXiv admin note: substantial text
overlap with arXiv:1803.0288
- …