37,321 research outputs found
Statistical Tests for Detecting Differential RNA-Transcript Expression from Read Counts
As a fruit of the current revolution in sequencing technology, transcriptomes can now be analyzed at an unprecedented level of detail. These advances have been exploited for detecting differential expressed genes across biological samples and for quantifying the abundances of various RNA transcripts within one gene. However, explicit strategies for detecting the hidden differential abundances of RNA transcripts in biological samples have not been defined. In this work, we present two novel statistical tests to address this issue: a 'gene structure sensitive' Poisson test for detecting differential expression when the transcript structure of the gene is known, and a kernel-based test called Maximum Mean Discrepancy when it is unknown. We analyzed the proposed approaches on simulated read data for two artificial samples as well as on factual reads generated by the Illumina Genome Analyzer for two _C. elegans_ samples. Our analysis shows that the Poisson test identifies genes with differential transcript expression considerably better that previously proposed RNA transcript quantification approaches for this task. The MMD test is able to detect a large fraction (75%) of such differential cases without the knowledge of the annotated transcripts. It is therefore well-suited to analyze RNA-Seq experiments when the genome annotations are incomplete or not available, where other approaches have to fail
Functional Sequential Treatment Allocation
Consider a setting in which a policy maker assigns subjects to treatments,
observing each outcome before the next subject arrives. Initially, it is
unknown which treatment is best, but the sequential nature of the problem
permits learning about the effectiveness of the treatments. While the
multi-armed-bandit literature has shed much light on the situation when the
policy maker compares the effectiveness of the treatments through their mean,
much less is known about other targets. This is restrictive, because a cautious
decision maker may prefer to target a robust location measure such as a
quantile or a trimmed mean. Furthermore, socio-economic decision making often
requires targeting purpose specific characteristics of the outcome
distribution, such as its inherent degree of inequality, welfare or poverty. In
the present paper we introduce and study sequential learning algorithms when
the distributional characteristic of interest is a general functional of the
outcome distribution. Minimax expected regret optimality results are obtained
within the subclass of explore-then-commit policies, and for the unrestricted
class of all policies
Peak Detection as Multiple Testing
This paper considers the problem of detecting equal-shaped non-overlapping
unimodal peaks in the presence of Gaussian ergodic stationary noise, where the
number, location and heights of the peaks are unknown. A multiple testing
approach is proposed in which, after kernel smoothing, the presence of a peak
is tested at each observed local maximum. The procedure provides strong control
of the family wise error rate and the false discovery rate asymptotically as
both the signal-to-noise ratio (SNR) and the search space get large, where the
search space may grow exponentially as a function of SNR. Simulations assuming
a Gaussian peak shape and a Gaussian autocorrelation function show that desired
error levels are achieved for relatively low SNR and are robust to partial peak
overlap. Simulations also show that detection power is maximized when the
smoothing bandwidth is close to the bandwidth of the signal peaks, akin to the
well-known matched filter theorem in signal processing. The procedure is
illustrated in an analysis of electrical recordings of neuronal cell activity.Comment: 37 pages, 8 figure
Discovering Valuable Items from Massive Data
Suppose there is a large collection of items, each with an associated cost
and an inherent utility that is revealed only once we commit to selecting it.
Given a budget on the cumulative cost of the selected items, how can we pick a
subset of maximal value? This task generalizes several important problems such
as multi-arm bandits, active search and the knapsack problem. We present an
algorithm, GP-Select, which utilizes prior knowledge about similarity be- tween
items, expressed as a kernel function. GP-Select uses Gaussian process
prediction to balance exploration (estimating the unknown value of items) and
exploitation (selecting items of high value). We extend GP-Select to be able to
discover sets that simultaneously have high utility and are diverse. Our
preference for diversity can be specified as an arbitrary monotone submodular
function that quantifies the diminishing returns obtained when selecting
similar items. Furthermore, we exploit the structure of the model updates to
achieve an order of magnitude (up to 40X) speedup in our experiments without
resorting to approximations. We provide strong guarantees on the performance of
GP-Select and apply it to three real-world case studies of industrial
relevance: (1) Refreshing a repository of prices in a Global Distribution
System for the travel industry, (2) Identifying diverse, binding-affine
peptides in a vaccine de- sign task and (3) Maximizing clicks in a web-scale
recommender system by recommending items to users
High-dimensional estimation with geometric constraints
Consider measuring an n-dimensional vector x through the inner product with
several measurement vectors, a_1, a_2, ..., a_m. It is common in both signal
processing and statistics to assume the linear response model y_i = +
e_i, where e_i is a noise term. However, in practice the precise relationship
between the signal x and the observations y_i may not follow the linear model,
and in some cases it may not even be known. To address this challenge, in this
paper we propose a general model where it is only assumed that each observation
y_i may depend on a_i only through . We do not assume that the
dependence is known. This is a form of the semiparametric single index model,
and it includes the linear model as well as many forms of the generalized
linear model as special cases. We further assume that the signal x has some
structure, and we formulate this as a general assumption that x belongs to some
known (but arbitrary) feasible set K. We carefully detail the benefit of using
the signal structure to improve estimation. The theory is based on the mean
width of K, a geometric parameter which can be used to understand its effective
dimension in estimation problems. We determine a simple, efficient two-step
procedure for estimating the signal based on this model -- a linear estimation
followed by metric projection onto K. We give general conditions under which
the estimator is minimax optimal up to a constant. This leads to the intriguing
conclusion that in the high noise regime, an unknown non-linearity in the
observations does not significantly reduce one's ability to determine the
signal, even when the non-linearity may be non-invertible. Our results may be
specialized to understand the effect of non-linearities in compressed sensing.Comment: This version incorporates minor revisions suggested by referee
- …