2,615 research outputs found
Optimal Clustering under Uncertainty
Classical clustering algorithms typically either lack an underlying
probability framework to make them predictive or focus on parameter estimation
rather than defining and minimizing a notion of error. Recent work addresses
these issues by developing a probabilistic framework based on the theory of
random labeled point processes and characterizing a Bayes clusterer that
minimizes the number of misclustered points. The Bayes clusterer is analogous
to the Bayes classifier. Whereas determining a Bayes classifier requires full
knowledge of the feature-label distribution, deriving a Bayes clusterer
requires full knowledge of the point process. When uncertain of the point
process, one would like to find a robust clusterer that is optimal over the
uncertainty, just as one may find optimal robust classifiers with uncertain
feature-label distributions. Herein, we derive an optimal robust clusterer by
first finding an effective random point process that incorporates all
randomness within its own probabilistic structure and from which a Bayes
clusterer can be derived that provides an optimal robust clusterer relative to
the uncertainty. This is analogous to the use of effective class-conditional
distributions in robust classification. After evaluating the performance of
robust clusterers in synthetic mixtures of Gaussians models, we apply the
framework to granular imaging, where we make use of the asymptotic
granulometric moment theory for granular images to relate robust clustering
theory to the application.Comment: 19 pages, 5 eps figures, 1 tabl
Generalized Species Sampling Priors with Latent Beta reinforcements
Many popular Bayesian nonparametric priors can be characterized in terms of
exchangeable species sampling sequences. However, in some applications,
exchangeability may not be appropriate. We introduce a {novel and
probabilistically coherent family of non-exchangeable species sampling
sequences characterized by a tractable predictive probability function with
weights driven by a sequence of independent Beta random variables. We compare
their theoretical clustering properties with those of the Dirichlet Process and
the two parameters Poisson-Dirichlet process. The proposed construction
provides a complete characterization of the joint process, differently from
existing work. We then propose the use of such process as prior distribution in
a hierarchical Bayes modeling framework, and we describe a Markov Chain Monte
Carlo sampler for posterior inference. We evaluate the performance of the prior
and the robustness of the resulting inference in a simulation study, providing
a comparison with popular Dirichlet Processes mixtures and Hidden Markov
Models. Finally, we develop an application to the detection of chromosomal
aberrations in breast cancer by leveraging array CGH data.Comment: For correspondence purposes, Edoardo M. Airoldi's email is
[email protected]; Federico Bassetti's email is
[email protected]; Michele Guindani's email is
[email protected] ; Fabrizo Leisen's email is
[email protected]. To appear in the Journal of the American
Statistical Associatio
Studies in Astronomical Time Series Analysis. VI. Bayesian Block Representations
This paper addresses the problem of detecting and characterizing local
variability in time series and other forms of sequential data. The goal is to
identify and characterize statistically significant variations, at the same
time suppressing the inevitable corrupting observational errors. We present a
simple nonparametric modeling technique and an algorithm implementing it - an
improved and generalized version of Bayesian Blocks (Scargle 1998) - that finds
the optimal segmentation of the data in the observation interval. The structure
of the algorithm allows it to be used in either a real-time trigger mode, or a
retrospective mode. Maximum likelihood or marginal posterior functions to
measure model fitness are presented for events, binned counts, and measurements
at arbitrary times with known error distributions. Problems addressed include
those connected with data gaps, variable exposure, extension to piecewise
linear and piecewise exponential representations, multi-variate time series
data, analysis of variance, data on the circle, other data modes, and dispersed
data. Simulations provide evidence that the detection efficiency for weak
signals is close to a theoretical asymptotic limit derived by (Arias-Castro,
Donoho and Huo 2003). In the spirit of Reproducible Research (Donoho et al.
2008) all of the code and data necessary to reproduce all of the figures in
this paper are included as auxiliary material.Comment: Added some missing script files and updated other ancillary data
(code and data files). To be submitted to the Astophysical Journa
Multiscale change-point segmentation: beyond step functions.
Modern multiscale type segmentation methods are known to detect multiple change-points with high statistical accuracy, while allowing for fast computation. Underpinning (minimax) estimation theory has been developed mainly for models that assume the signal as a piecewise constant function. In this paper, for a large collection of multiscale segmentation methods (including various existing procedures), such theory will be extended to certain function classes beyond step functions in a nonparametric regression setting. This extends the interpretation of such methods on the one hand and on the other hand reveals these methods as robust to deviation from piecewise constant functions. Our main finding is the adaptation over nonlinear approximation classes for a universal thresholding, which includes bounded variation functions, and (piecewise) Holder functions of smoothness order 0 < alpha <= 1 as special cases. From this we derive statistical guarantees on feature detection in terms of jumps and modes. Another key finding is that these multiscale segmentation methods perform nearly (up to a log-factor) as well as the oracle piecewise constant segmentation estimator (with known jump locations), and the best piecewise constant approximants of the (unknown) true signal. Theoretical findings are examined by various numerical simulations
- …