1,873 research outputs found
Maximum Fidelity
The most fundamental problem in statistics is the inference of an unknown
probability distribution from a finite number of samples. For a specific
observed data set, answers to the following questions would be desirable: (1)
Estimation: Which candidate distribution provides the best fit to the observed
data?, (2) Goodness-of-fit: How concordant is this distribution with the
observed data?, and (3) Uncertainty: How concordant are other candidate
distributions with the observed data? A simple unified approach for univariate
data that addresses these traditionally distinct statistical notions is
presented called "maximum fidelity". Maximum fidelity is a strict frequentist
approach that is fundamentally based on model concordance with the observed
data. The fidelity statistic is a general information measure based on the
coordinate-independent cumulative distribution and critical yet previously
neglected symmetry considerations. An approximation for the null distribution
of the fidelity allows its direct conversion to absolute model concordance (p
value). Fidelity maximization allows identification of the most concordant
model distribution, generating a method for parameter estimation, with
neighboring, less concordant distributions providing the "uncertainty" in this
estimate. Maximum fidelity provides an optimal approach for parameter
estimation (superior to maximum likelihood) and a generally optimal approach
for goodness-of-fit assessment of arbitrary models applied to univariate data.
Extensions to binary data, binned data, multidimensional data, and classical
parametric and nonparametric statistical tests are described. Maximum fidelity
provides a philosophically consistent, robust, and seemingly optimal foundation
for statistical inference. All findings are presented in an elementary way to
be immediately accessible to all researchers utilizing statistical analysis.Comment: 66 pages, 32 figures, 7 tables, submitte
Data-Driven Robust Optimization
The last decade witnessed an explosion in the availability of data for
operations research applications. Motivated by this growing availability, we
propose a novel schema for utilizing data to design uncertainty sets for robust
optimization using statistical hypothesis tests. The approach is flexible and
widely applicable, and robust optimization problems built from our new sets are
computationally tractable, both theoretically and practically. Furthermore,
optimal solutions to these problems enjoy a strong, finite-sample probabilistic
guarantee. \edit{We describe concrete procedures for choosing an appropriate
set for a given application and applying our approach to multiple uncertain
constraints. Computational evidence in portfolio management and queuing confirm
that our data-driven sets significantly outperform traditional robust
optimization techniques whenever data is available.Comment: 38 pages, 15 page appendix, 7 figures. This version updated as of
Oct. 201
Bernoulli Regression Models: Revisiting the Specification of Statistical Models with Binary Dependent Variables
The latent variable and generalized linear modelling approaches do not provide a systematic approach for modelling discrete choice observational data. Another alternative, the probabilistic reduction (PR) approach, provides a systematic way to specify such models that can yield reliable statistical and substantive inferences. The purpose of this paper is to re-examine the underlying probabilistic foundations of conditional statistical models with binary dependent variables using the PR approach. This leads to the development of the Bernoulli Regression Model, a family of statistical models, which includes the binary logistic regression model. The paper provides an explicit presentation of probabilistic model assumptions, guidance on model specification and estimation, and empirical application
Exact Approaches for Bias Detection and Avoidance with Small, Sparse, or Correlated Categorical Data
Every day, traditional statistical methodology are used world wide to study a variety of topics and provides insight regarding countless subjects. Each technique is based on a distinct set of assumptions to ensure valid results. Additionally, many statistical approaches rely on large sample behavior and may collapse or degenerate in the presence of small, spare, or correlated data. This dissertation details several advancements to detect these conditions, avoid their consequences, and analyze data in a different way to yield trustworthy results.
One of the most commonly used modeling techniques for outcomes with only two possible categorical values (eg. live/die, pass/fail, better/worse, ect.) is logistic regression. While some potential complications with this approach are widely known, many investigators are unaware that their particular data does not meet the foundational assumptions, since they are not easy to verify. We have developed a routine for determining if a researcher should be concerned about potential bias in logistic regression results, so they can take steps to mitigate the bias or use a different procedure altogether to model the data.
Correlated data may arise from common situations such as multi-site medical studies, research on family units, or investigations on student achievement within classrooms. In these circumstance the associations between cluster members must be included in any statistical analysis testing the hypothesis of a connection be-tween two variables in order for results to be valid.
Previously investigators had to choose between using a method intended for small or sparse data while assuming independence between observations or a method that allowed for correlation between observations, while requiring large samples to be reliable. We present a new method that allows for small, clustered samples to be assessed for a relationship between a two-level predictor (eg. treatment/control) and a categorical outcome (eg. low/medium/high)
Line transect abundance estimation with uncertain detection on the trackline
Bibliography: leaves 225-233.After critically reviewing developments in line transect estimation theory to date, general likelihood functions are derived for the case in which detection probabilities are modelled as functions of any number of explanatory variables and detection of animals on the trackline (i.e. directly in the observer's path) is not certain. Existing models are shown to correspond to special cases of the general models. Maximum likelihood estimators are derived for some special cases of the general model and some existing line transect estimators are shown to correspond to maximum likelihood estimators for other special cases. The likelihoods are shown to be extensions of existing mark-recapture likelihoods as well as being generalizations of existing line transect likelihoods. Two new abundance estimators are developed. The first is a Horvitz-Thompson-like estimator which utilizes the fact that for point estimation of abundance the density of perpendicular distances in the population can be treated as known in appropriately designed line transect surveys. The second is based on modelling the probability density function of detection probabilities in the population. Existing line transect estimators are shown to correspond to special cases of the new Horvitz-Thompson-like estimator, so that this estimator, together with the general likelihoods, provides a unifying framework for estimating abundance from line transect surveys
Spot Volatility Estimation of Ito Semimartingales Using Delta Sequences
This thesis studies a unifying class of nonparametric spot volatility estimators proposed by Mancini et. al.(2013). This method is based on delta sequences and is conceived to include many of the existing estimators in the field as special cases. The thesis first surveys the asymptotic theory of the proposed estimators under an infill asymptotic scheme and fixed time horizon, when the state variable follows a Brownian semimartingale. Then, some extensions to include jumps and financial microstructure noise in the observed price process are also presented. The main goal of the thesis is to assess the suitability of the proposed methods with both high-frequency simulated data and real transaction data from the stock market. In conclusion, double exponential kernel shows the best properties when estimating. Besides, the theorem is robust with the presence of jumps and microstructure noise and the U-shape curves of intraday spot volatility are achieved
Mixtures of tails in clustered automobile collision claims
Knowledge of the tail shape of claim distributions provides important actuarial information. This paper discusses how two techniques commonly used in assessing the most appropriate underlying distribution can be usefully combined. The maximum likelihood approach is theoretically appealing since it is preferable to many other estimators in the sense of best asymptotic normality. Likelihood based tests are, however, not always capable to discriminate among non-nested classes of distributions. Extremal value theory offers an attractive tool to overcome this problem. It shows that a much larger set of distributions is nested in their tails by the so-called tail parameter. This paper shows that both estimation strategies can be usefully combined when the data generating process is characterized by strong clustering in time and size. We find that the extreme value theory is a useful starting point in detecting the appropriate distribution class. Once that has been achieved, the likelihood-based EM-algorithm is proposed to capture the clustering phenomena. Clustering is particularly pervasive in actuarial data. An empirical application to a four-year data set of Dutch automobile collision claims is therefore used to illustrate the approach
- …