448 research outputs found
Missing -mass: Investigating the Missing Parts of Distributions
Estimating the underlying distribution from \textit{iid} samples is a
classical and important problem in statistics. When the alphabet size is large
compared to number of samples, a portion of the distribution is highly likely
to be unobserved or sparsely observed. The missing mass, defined as the sum of
probabilities over the missing letters , and the Good-Turing
estimator for missing mass have been important tools in large-alphabet
distribution estimation. In this article, given a positive function from
to the reals, the missing -mass, defined as the sum of
over the missing letters , is introduced and studied. The
missing -mass can be used to investigate the structure of the missing part
of the distribution. Specific applications for special cases such as
order- missing mass () and the missing Shannon entropy
() include estimating distance from uniformity of the missing
distribution and its partial estimation. Minimax estimation is studied for
order- missing mass for integer values of and exact minimax
convergence rates are obtained. Concentration is studied for a class of
functions and specific results are derived for order- missing mass
and missing Shannon entropy. Sub-Gaussian tail bounds with near-optimal
worst-case variance factors are derived. Two new notions of concentration,
named strongly sub-Gamma and filtered sub-Gaussian concentration, are
introduced and shown to result in right tail bounds that are better than those
obtained from sub-Gaussian concentration
Optimal estimation of high-order missing masses, and the rare-type match problem
Consider a random sample from an unknown discrete
distribution on a countable alphabet
, and let be the empirical frequencies of
distinct symbols 's in the sample. We consider the problem of estimating
the -order missing mass, which is a discrete functional of defined as
This is
generalization of the missing mass whose estimation is a classical problem in
statistics, being the subject of numerous studies both in theory and methods.
First, we introduce a nonparametric estimator of
and a corresponding non-asymptotic confidence interval through concentration
properties of . Then, we investigate minimax
estimation of , which is the main contribution of
our work. We show that minimax estimation is not feasible over the class of all
discrete distributions on , and not even for distributions with
regularly varying tails, which only guarantee that our estimator is consistent
for . This leads to introduce the stronger
assumption of second-order regular variation for the tail behaviour of ,
which is proved to be sufficient for minimax estimation of
, making the proposed estimator an optimal minimax
estimator of . Our interest in the -order
missing mass arises from forensic statistics, where the estimation of the
-order missing mass appears in connection to the estimation of the
likelihood ratio
,
known as the "fundamental problem of forensic mathematics". We present
theoretical guarantees to nonparametric estimation of
Recommended from our members
Learning Structured Distributions: Power-Law and Low-Rank
Utilizing the structure of a probabilistic model can significantlyincrease its compression efficiency and learning speed. We considerthese potential improvements under two naturally-omnipresentstructures.Power-Law: English words and many other natural phenomena arewell-known to follow a power-law distribution. Yet this ubiquitousstructure has never been shown to help compress or predict thesephenomena. It is known that the class of unrestricted distributionsover alphabet of size k and blocks of length n can never becompressed with diminishing per-symbol redundancy, when k>n. Weshow that under power-law structure, in expectation we can compresswith diminishing per-symbol redundancy for k growing as large assub-exponential in n.For learning a power-law distribution, we rigorously explain theefficacy of the absolute-discount estimator using less pessimisticnotions. We show that (1) it is adaptive to an effectivedimension and (2) it is stronglyrelated to the Good--Turing estimator and inherits itscompetitive properties.Low-Rank: We study learning low-rank conditional probabilitymatrices under expected KL-risk. This choice accentuates smoothing,the careful handling of low-probability elements. We define a lossfunction, determine sample-complexity bound for its global minimizer,and show that this bound is optimal up to logarithmic terms. Wepropose an iterative algorithm that extends classical non-negativematrix factorization to naturally incorporate additive smoothing andprove that it converges to the stationary points of our loss function.Power-Law and Low-Rank: We consider learning distributions inthe presence of both low-rank and power-law structures. We studyKneser-Ney smoothing, a successful estimator for the N-gram languagemodels through the lens of competitive distribution estimation. Wefirst establish some competitive properties for the contextualprobability estimation problem. This leads to Partial Low Rank,a powerful generalization of Kneser-Ney that we conjecture to haveeven stronger competitive properties. Empirically, it significantlyimproves the performance on language modeling, even matching thefeed-forward neural models, and gives similar gains on the task ofpredicting attack types for the Global Terrorism Database
Tilastollisia ja informaatioteoreettisia data-analyysimenetelmiä
In this Thesis, we develop theory and methods for computational data analysis. The problems in data analysis are approached from three perspectives: statistical learning theory, the Bayesian framework, and the information-theoretic minimum description length (MDL) principle. Contributions in statistical learning theory address the possibility of generalization to unseen cases, and regression analysis with partially observed data with an application to mobile device positioning. In the second part of the Thesis, we discuss so called Bayesian network classifiers, and show that they are closely related to logistic regression models. In the final part, we apply the MDL principle to tracing the history of old manuscripts, and to noise reduction in digital signals."Data on esitys, jolla ei itsessään ole merkitystä. Kun dataa käsitellään ja sille annetaan merkitys, siitä voi syntyä informaatiota ja lopulta tietoa." [Wikipedia]. Datan muuntaminen informaatioksi on data-analyysia. Tähän sisältyvät datasta oppiminen ja siihen pohjautuvien päätelmien teko. Nykyaikaisessa data-analyysissa keskeisimpiin tieteenaloihin kuuluu tietojenkäsittelytiede, jonka roolina on tehokkaiden tietokoneessa suoritettavien sääntöjen ja algoritmien kehittäminen. Data-analyysissa tarvitaan myös muiden tieteenalojen osaamista, esimerkkeinä matematiikka, tilastotiede, tieteenfilosofia ja monet sovelletut tieteenalat kuten insinööritiede ja bioinformatiikka. Analyysin kohteena oleva data voi olla vaikkapa mittaustuloksia, kirjoitettua tekstiä tai kuvia --- näitä kaikkia datan olomuotoja esiintyy väitöskirjassa, jonka nimi on suomeksi "Tilastollisia ja informaatioteoreettisia data-analyysimenetelmiä".
Väitöskirjassa data-analyysin ongelmia lähestytään kolmesta näkökulmasta, jotka ovat tilastollisen oppimisen teoria (engl. statistical learning theory), Bayes-menetelmät sekä informaatioteoreettinen lyhimmän kuvauspituuden periaate (engl. minimum description length (MDL) principle). Tilastollisen oppimisen teorian puitteissa käsitellään mahdollisuutta tehdä induktiivisia (yleistäviä) päätelmiä, jotka koskevat toistaiseksi kokonaan havaitsemattomia tapauksia, sekä lineaarisen mallin oppimista vain osittain havaitusta datasta. Jälkimmäinen tutkimus mahdollistaa tehokkaan radioaaltojen etenemisen mallintamisen, mikä puolestaan helpottaa mm. mobiililaitteiden paikannusta.
Väitöskirjan toisessa osassa osoitetaan läheinen yhteys ns. Bayes-verkkoluokittelijoiden ja logistisen regression välillä. Näiden kahden parhaita puolia yhdistelemällä johdetaan uusi tehokkaiden luokittelualgoritmien perhe, jonka välityksellä voidaan saavuttaa tasapaino luokittelijan monimutkaisuuden ja oppimisnopeuden välillä. Väitöskirjan viimeisessä osassa sovelletaan MDL-periaatetta kahteen erityyppiseen ongelmaan. Ensimmäisenä ongelmana pyritään rekonstruoimaan useina erilaisina kappaleina esiintyvän tekstin syntyhistoria. Aineistona on käytetty Pyhän Henrikin latinankielisen pyhimyslegendan n. 50 erilaista tekstiversiota. Tuloksena saatava tekstiversioiden "sukupuu" tarjoaa kiinnostavaa tietoa Suomen ja Pohjoismaiden keskiajan historiasta. Toisena ongelmana tutkitaan digitaalisten signaalien, kuten digikuvien, laadun parantamista kohinaa vähentämällä. Mahdollisuus käyttää alunperin huonolaatuista signaalia on hyödyllinen mm. lääketieteellisissä kuvantamissovelluksissa
Population size estimation via alternative parametrizations for Poisson mixture models
We exploit a suitable moment-based reparametrization of the Poisson mixtures distributions for developing classical and Bayesian inference for the unknown size of a finite population in the presence of count data. Here we put particular emphasis on suitable mappings between ordinary moments and recurrence coefficients that will allow us to implement standard maximization routines and MCMC routines in a more convenient parameter space. We assess the comparative performance of our approach in real data applications and in a simulation study
Polynomial methods in statistical inference: Theory and practice
Recent advances in genetics, computer vision, and text mining are accompanied by analyzing data coming from a large domain, where the domain size is comparable or larger than the number of samples. In this dissertation, we apply the polynomial methods to several statistical questions with rich history and wide applications. The goal is to understand the fundamental limits of the problems in the large domain regime, and to design sample optimal and time efficient algorithms with provable guarantees.
The first part investigates the problem of property estimation. Consider the problem of estimating the Shannon entropy of a distribution over elements from independent samples. We obtain the minimax mean-square error within universal multiplicative constant factors if exceeds a constant factor of ; otherwise there exists no consistent estimator. This refines the recent result on the minimal sample size for consistent entropy estimation. The apparatus of best polynomial approximation plays a key role in both the construction of optimal estimators and, via a duality argument, the minimax lower bound.
We also consider the problem of estimating the support size of a discrete distribution whose minimum non-zero mass is at least . Under the independent sampling model, we show that the sample complexity, i.e., the minimal sample size to achieve an additive error of with probability at least 0.1 is within universal constant factors of , which improves the state-of-the-art result of . Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in time and attains the sample complexity within constant factors. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.
When the distribution is supported on a discrete set, estimating the support size is also known as the distinct elements problem, where the goal is to estimate the number of distinct colors in an urn containing balls based on samples drawn with replacements.
Based on discrete polynomial approximation and interpolation, we propose an estimator with additive error guarantee that achieves the optimal sample complexity within factors, and in fact within constant factors for most cases. The estimator can be computed in time for an accurate estimation. The result also applies to sampling without replacement provided the sample size is a vanishing fraction of the urn size. One of the key auxiliary results is a sharp bound on the minimum singular values of a real rectangular Vandermonde matrix, which might be of independent interest.
The second part studies the problem of learning Gaussian mixtures. The method of moments is one of the most widely used methods in statistics for parameter estimation, by means of solving the system of equations that match the population and estimated moments. However, in practice and especially for the important case of mixture models, one frequently needs to contend with the difficulties of non-existence or non-uniqueness of statistically meaningful solutions, as well as the high computational cost of solving large polynomial systems. Moreover, theoretical analysis of the method of moments are mainly confined to asymptotic normality style of results established under strong assumptions.
We consider estimating a -component Gaussian location mixture with a common (possibly unknown) variance parameter. To overcome the aforementioned theoretic and algorithmic hurdles, a crucial step is to denoise the moment estimates by projecting to the truncated moment space (via semidefinite programming) before solving the method of moments equations. Not only does this regularization ensures existence and uniqueness of solutions, it also yields fast solvers by means of Gauss quadrature. Furthermore, by proving new moment comparison theorems in the Wasserstein distance via polynomial interpolation and majorization techniques, we establish the statistical guarantees and adaptive optimality of the proposed procedure, as well as oracle inequality in misspecified models. These results can also be viewed as provable algorithms for generalized method of moments which involves non-convex optimization and lacks theoretical guarantees
Recommended from our members
Rare-Event Estimation and Calibration for Large-Scale Stochastic Simulation Models
Stochastic simulation has been widely applied in many domains. More recently, however, the rapid surge of sophisticated problems such as safety evaluation of intelligent systems has posed various challenges to conventional statistical methods. Motivated by these challenges, in this thesis, we develop novel methodologies with theoretical guarantees and numerical applications to tackle them from different perspectives.
In particular, our works can be categorized into two areas: (1) rare-event estimation (Chapters 2 to 5) where we develop approaches to estimating the probabilities of rare events via simulation; (2) model calibration (Chapters 6 and 7) where we aim at calibrating the simulation model so that it is close to reality.
In Chapter 2, we study rare-event simulation for a class of problems where the target hitting sets of interest are defined via modern machine learning tools such as neural networks and random forests. We investigate an importance sampling scheme that integrates the dominating point machinery in large deviations and sequential mixed integer programming to locate the underlying dominating points. We provide efficiency guarantees and numerical demonstration of our approach.
In Chapter 3, we propose a new efficiency criterion for importance sampling, which we call probabilistic efficiency. Conventionally, an estimator is regarded as efficient if its relative error is sufficiently controlled. It is widely known that when a rare-event set contains multiple "important regions" encoded by the dominating points, importance sampling needs to account for all of them via mixing to achieve efficiency. We argue that the traditional analysis recipe could suffer from intrinsic looseness by using relative error as an efficiency criterion. Thus, we propose the new efficiency notion to tighten this gap. In particular, we show that under the standard Gartner-Ellis large deviations regime, an importance sampling that uses only the most significant dominating points is sufficient to attain this efficiency notion.
In Chapter 4, we consider the estimation of rare-event probabilities using sample proportions output by crude Monte Carlo. Due to the recent surge of sophisticated rare-event problems, efficiency-guaranteed variance reduction may face implementation challenges, which motivate one to look at naive estimators. In this chapter we construct confidence intervals for the target probability using this naive estimator from various techniques, and then analyze their validity as well as tightness respectively quantified by the coverage probability and relative half-width.
In Chapter 5, we propose the use of extreme value analysis, in particular the peak-over-threshold method which is popularly employed for extremal estimation of real datasets, in the simulation setting. More specifically, we view crude Monte Carlo samples as data to fit on a generalized Pareto distribution. We test this idea on several numerical examples. The results show that in the absence of efficient variance reduction schemes, it appears to offer potential benefits to enhance crude Monte Carlo estimates.
In Chapter 6, we investigate a framework to develop calibration schemes in parametric settings, which satisfies rigorous frequentist statistical guarantees via a basic notion that we call eligibility set designed to bypass non-identifiability via a set-based estimation. We investigate a feature extraction-then-aggregation approach to construct these sets that target at multivariate outputs. We demonstrate our methodology on several numerical examples, including an application to calibration of a limit order book market simulator.
In Chapter 7, we study a methodology to tackle the NASA Langley Uncertainty Quantification Challenge, a model calibration problem under both aleatory and epistemic uncertainties. Our methodology is based on an integration of distributionally robust optimization and importance sampling. The main computation machinery in this integrated methodology amounts to solving sampled linear programs. We present theoretical statistical guarantees of our approach via connections to nonparametric hypothesis testing, and numerical performances including parameter calibration and downstream decision and risk evaluation tasks
- …