3,925 research outputs found
Optimal Clustering under Uncertainty
Classical clustering algorithms typically either lack an underlying
probability framework to make them predictive or focus on parameter estimation
rather than defining and minimizing a notion of error. Recent work addresses
these issues by developing a probabilistic framework based on the theory of
random labeled point processes and characterizing a Bayes clusterer that
minimizes the number of misclustered points. The Bayes clusterer is analogous
to the Bayes classifier. Whereas determining a Bayes classifier requires full
knowledge of the feature-label distribution, deriving a Bayes clusterer
requires full knowledge of the point process. When uncertain of the point
process, one would like to find a robust clusterer that is optimal over the
uncertainty, just as one may find optimal robust classifiers with uncertain
feature-label distributions. Herein, we derive an optimal robust clusterer by
first finding an effective random point process that incorporates all
randomness within its own probabilistic structure and from which a Bayes
clusterer can be derived that provides an optimal robust clusterer relative to
the uncertainty. This is analogous to the use of effective class-conditional
distributions in robust classification. After evaluating the performance of
robust clusterers in synthetic mixtures of Gaussians models, we apply the
framework to granular imaging, where we make use of the asymptotic
granulometric moment theory for granular images to relate robust clustering
theory to the application.Comment: 19 pages, 5 eps figures, 1 tabl
Improving randomness characterization through Bayesian model selection
Nowadays random number generation plays an essential role in technology with
important applications in areas ranging from cryptography, which lies at the
core of current communication protocols, to Monte Carlo methods, and other
probabilistic algorithms. In this context, a crucial scientific endeavour is to
develop effective methods that allow the characterization of random number
generators. However, commonly employed methods either lack formality (e.g. the
NIST test suite), or are inapplicable in principle (e.g. the characterization
derived from the Algorithmic Theory of Information (ATI)). In this letter we
present a novel method based on Bayesian model selection, which is both
rigorous and effective, for characterizing randomness in a bit sequence. We
derive analytic expressions for a model's likelihood which is then used to
compute its posterior probability distribution. Our method proves to be more
rigorous than NIST's suite and the Borel-Normality criterion and its
implementation is straightforward. We have applied our method to an
experimental device based on the process of spontaneous parametric
downconversion, implemented in our laboratory, to confirm that it behaves as a
genuine quantum random number generator (QRNG). As our approach relies on
Bayesian inference, which entails model generalizability, our scheme transcends
individual sequence analysis, leading to a characterization of the source of
the random sequences itself.Comment: 25 page
Event detection in location-based social networks
With the advent of social networks and the rise of mobile technologies, users have become ubiquitous sensors capable of monitoring various real-world events in a crowd-sourced manner. Location-based social networks have proven to be faster than traditional media channels in reporting and geo-locating breaking news, i.e. Osama Bin Laden’s death was first confirmed on Twitter even before the announcement from the communication department at the White House. However, the deluge of user-generated data on these networks requires intelligent systems capable of identifying and characterizing such events in a comprehensive manner. The data mining community coined the term, event detection , to refer to the task of uncovering emerging patterns in data streams . Nonetheless, most data mining techniques do not reproduce the underlying data generation process, hampering to self-adapt in fast-changing scenarios. Because of this, we propose a probabilistic machine learning approach to event detection which explicitly models the data generation process and enables reasoning about the discovered events. With the aim to set forth the differences between both approaches, we present two techniques for the problem of event detection in Twitter : a data mining technique called Tweet-SCAN and a machine learning technique called Warble. We assess and compare both techniques in a dataset of tweets geo-located in the city of Barcelona during its annual festivities. Last but not least, we present the algorithmic changes and data processing frameworks to scale up the proposed techniques to big data workloads.This work is partially supported by Obra Social “la Caixa”, by the Spanish Ministry of Science and Innovation under contract (TIN2015-65316), by the Severo Ochoa Program (SEV2015-0493), by SGR programs of the Catalan Government (2014-SGR-1051, 2014-SGR-118), Collectiveware (TIN2015-66863-C2-1-R) and BSC/UPC NVIDIA GPU Center of Excellence.We would also like to thank the reviewers for their constructive feedback.Peer ReviewedPostprint (author's final draft
Merging history of three bimodal clusters
We present a combined X-ray and optical analysis of three bimodal galaxy
clusters selected as merging candidates at z ~ 0.1. These targets are part of
MUSIC (MUlti--Wavelength Sample of Interacting Clusters), which is a general
project designed to study the physics of merging clusters by means of
multi-wavelength observations. Observations include spectro-imaging with
XMM-Newton EPIC camera, multi-object spectroscopy (260 new redshifts), and
wide-field imaging at the ESO 3.6m and 2.2m telescopes. We build a global
picture of these clusters using X-ray luminosity and temperature maps together
with galaxy density and velocity distributions. Idealized numerical simulations
were used to constrain the merging scenario for each system. We show that A2933
is very likely an equal-mass advanced pre-merger ~ 200 Myr before the core
collapse, while A2440 and A2384 are post-merger systems ~ 450 Myr and ~1.5 Gyr
after core collapse, respectively). In the case of A2384, we detect a
spectacular filament of galaxies and gas spreading over more than 1 h^{-1} Mpc,
which we infer to have been stripped during the previous collision. The
analysis of the MUSIC sample allows us to outline some general properties of
merging clusters: a strong luminosity segregation of galaxies in recent
post-mergers; the existence of preferential axes --corresponding to the merging
directions-- along which the BCGs and structures on various scales are aligned;
the concomitance, in most major merger cases, of secondary merging or accretion
events, with groups infalling onto the main cluster, and in some cases the
evidence of previous merging episodes in one of the main components. These
results are in good agreement with the hierarchical scenario of structure
formation, in which clusters are expected to form by successive merging events,
and matter is accreted along large--scale filaments
Speeding Up MCMC by Delayed Acceptance and Data Subsampling
The complexity of the Metropolis-Hastings (MH) algorithm arises from the
requirement of a likelihood evaluation for the full data set in each iteration.
Payne and Mallick (2015) propose to speed up the algorithm by a delayed
acceptance approach where the acceptance decision proceeds in two stages. In
the first stage, an estimate of the likelihood based on a random subsample
determines if it is likely that the draw will be accepted and, if so, the
second stage uses the full data likelihood to decide upon final acceptance.
Evaluating the full data likelihood is thus avoided for draws that are unlikely
to be accepted. We propose a more precise likelihood estimator which
incorporates auxiliary information about the full data likelihood while only
operating on a sparse set of the data. We prove that the resulting delayed
acceptance MH is more efficient compared to that of Payne and Mallick (2015).
The caveat of this approach is that the full data set needs to be evaluated in
the second stage. We therefore propose to substitute this evaluation by an
estimate and construct a state-dependent approximation thereof to use in the
first stage. This results in an algorithm that (i) can use a smaller subsample
m by leveraging on recent advances in Pseudo-Marginal MH (PMMH) and (ii) is
provably within of the true posterior.Comment: Accepted for publication in Journal of Computational and Graphical
Statistic
- …