646 research outputs found
SEMIPARAMETRIC MIXTURES FOR BACKGROUND DENSITY ESTIMATION IN PARTICLE PHYSICS
The work in this thesis aims to develop a method to estimate the density of events in particle physics experiments, through a semiparametric mixture of a known parametric “signal” density and an unknown nonparametric “background” density. This method relies on an assumption of local smoothness of the background around the signal. The nonparametric component is estimated with a
local orthogonal polynomial expansion (LOrPE), the level of overall smoothness of which is selected through a local version of least squares cross-validation. The estimate of the background is constructed iteratively through weighting of the original signal and background sample. The mixing proportion is chosen via maximum penalized local likelihood and the penalization term is a representation of the local complexity. This term is obtained with a novel estimator of the effective degrees of freedom, that relies on rejection sampling to localize the variability of the data around the interest region. Simulation studies show how the procedure operates, in its local version and in the global one, which is also presented
Power-law distributions in binned empirical data
Many man-made and natural phenomena, including the intensity of earthquakes,
population of cities and size of international wars, are believed to follow
power-law distributions. The accurate identification of power-law patterns has
significant consequences for correctly understanding and modeling complex
systems. However, statistical evidence for or against the power-law hypothesis
is complicated by large fluctuations in the empirical distribution's tail, and
these are worsened when information is lost from binning the data. We adapt the
statistically principled framework for testing the power-law hypothesis,
developed by Clauset, Shalizi and Newman, to the case of binned data. This
approach includes maximum-likelihood fitting, a hypothesis test based on the
Kolmogorov--Smirnov goodness-of-fit statistic and likelihood ratio tests for
comparing against alternative explanations. We evaluate the effectiveness of
these methods on synthetic binned data with known structure, quantify the loss
of statistical power due to binning, and apply the methods to twelve real-world
binned data sets with heavy-tailed patterns.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS710 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Line transect abundance estimation with uncertain detection on the trackline
Bibliography: leaves 225-233.After critically reviewing developments in line transect estimation theory to date, general likelihood functions are derived for the case in which detection probabilities are modelled as functions of any number of explanatory variables and detection of animals on the trackline (i.e. directly in the observer's path) is not certain. Existing models are shown to correspond to special cases of the general models. Maximum likelihood estimators are derived for some special cases of the general model and some existing line transect estimators are shown to correspond to maximum likelihood estimators for other special cases. The likelihoods are shown to be extensions of existing mark-recapture likelihoods as well as being generalizations of existing line transect likelihoods. Two new abundance estimators are developed. The first is a Horvitz-Thompson-like estimator which utilizes the fact that for point estimation of abundance the density of perpendicular distances in the population can be treated as known in appropriately designed line transect surveys. The second is based on modelling the probability density function of detection probabilities in the population. Existing line transect estimators are shown to correspond to special cases of the new Horvitz-Thompson-like estimator, so that this estimator, together with the general likelihoods, provides a unifying framework for estimating abundance from line transect surveys
Estimation of univariate Gaussian mixtures for huge raw datasets by using binned datasets
Le congrès a été annulé mais les actes publiésNational audiencePopularity of unsupervised learning is magnified by the regular increase of sample sizes. Indeed, it provides opportunity to reveal information previously out of scope. However, the volume of data leads to some issues related to prohibitive calculation times and also to high energy consumption and the need of high computational ressources. Resorting to binned data depending on an adaptive grid is expected to give proper answer to such green computing issues while not harming the related estimation issues. A first attempt is conducted in the context of univariate Gaussian mixtures, included a numerical illustration and some theoretical advances
Efficient Computation of Log-likelihood Function in Clustering Overdispersed Count Data
In this work, we present an overdispersed count data clustering algorithm, which uses the mesh method for computing the log-likelihood function, of the multinomial Dirichlet, multinomial generalized Dirichlet, and multinomial Beta-Liouville distributions. Count data are often used in many areas such as information retrieval, data mining, and computer vision. The multinomial Dirichlet distribution (MDD) is one of the widely used methods of modeling multi-categorical count data with overdispersion. In recent works, the use of the mesh algorithm, which involves the approximation of the multinomial Dirichlet distribution's (MDD) log-likelihood function, based on the Bernoulli polynomials; has been proposed instead of using the traditional numerical computation of the log-likelihood function which either results in instability, or leads to long run times that make its use infeasible when modeling large-scale data. Therefore, we extend the mesh algorithm approach for computing the log likelihood function of more flexible distributions, namely multinomial generalized Dirichlet (MGD) and multinomial Beta-Liouville (MBL). A finite mixture model based on these distributions, is optimized by expectation maximization, and attempts to achieve a high accuracy for count data clustering. Through a set of experiments, the proposed approach shows its merits in two real-world clustering problems, that concern natural scenes categorization and facial expression recognition
Computational statistics using the Bayesian Inference Engine
This paper introduces the Bayesian Inference Engine (BIE), a general
parallel, optimised software package for parameter inference and model
selection. This package is motivated by the analysis needs of modern
astronomical surveys and the need to organise and reuse expensive derived data.
The BIE is the first platform for computational statistics designed explicitly
to enable Bayesian update and model comparison for astronomical problems.
Bayesian update is based on the representation of high-dimensional posterior
distributions using metric-ball-tree based kernel density estimation. Among its
algorithmic offerings, the BIE emphasises hybrid tempered MCMC schemes that
robustly sample multimodal posterior distributions in high-dimensional
parameter spaces. Moreover, the BIE is implements a full persistence or
serialisation system that stores the full byte-level image of the running
inference and previously characterised posterior distributions for later use.
Two new algorithms to compute the marginal likelihood from the posterior
distribution, developed for and implemented in the BIE, enable model comparison
for complex models and data sets. Finally, the BIE was designed to be a
collaborative platform for applying Bayesian methodology to astronomy. It
includes an extensible object-oriented and easily extended framework that
implements every aspect of the Bayesian inference. By providing a variety of
statistical algorithms for all phases of the inference problem, a scientist may
explore a variety of approaches with a single model and data implementation.
Additional technical details and download details are available from
http://www.astro.umass.edu/bie. The BIE is distributed under the GNU GPL.Comment: Resubmitted version. Additional technical details and download
details are available from http://www.astro.umass.edu/bie. The BIE is
distributed under the GNU GP
- …