1,873 research outputs found

    Maximum Fidelity

    Full text link
    The most fundamental problem in statistics is the inference of an unknown probability distribution from a finite number of samples. For a specific observed data set, answers to the following questions would be desirable: (1) Estimation: Which candidate distribution provides the best fit to the observed data?, (2) Goodness-of-fit: How concordant is this distribution with the observed data?, and (3) Uncertainty: How concordant are other candidate distributions with the observed data? A simple unified approach for univariate data that addresses these traditionally distinct statistical notions is presented called "maximum fidelity". Maximum fidelity is a strict frequentist approach that is fundamentally based on model concordance with the observed data. The fidelity statistic is a general information measure based on the coordinate-independent cumulative distribution and critical yet previously neglected symmetry considerations. An approximation for the null distribution of the fidelity allows its direct conversion to absolute model concordance (p value). Fidelity maximization allows identification of the most concordant model distribution, generating a method for parameter estimation, with neighboring, less concordant distributions providing the "uncertainty" in this estimate. Maximum fidelity provides an optimal approach for parameter estimation (superior to maximum likelihood) and a generally optimal approach for goodness-of-fit assessment of arbitrary models applied to univariate data. Extensions to binary data, binned data, multidimensional data, and classical parametric and nonparametric statistical tests are described. Maximum fidelity provides a philosophically consistent, robust, and seemingly optimal foundation for statistical inference. All findings are presented in an elementary way to be immediately accessible to all researchers utilizing statistical analysis.Comment: 66 pages, 32 figures, 7 tables, submitte

    Data-Driven Robust Optimization

    Full text link
    The last decade witnessed an explosion in the availability of data for operations research applications. Motivated by this growing availability, we propose a novel schema for utilizing data to design uncertainty sets for robust optimization using statistical hypothesis tests. The approach is flexible and widely applicable, and robust optimization problems built from our new sets are computationally tractable, both theoretically and practically. Furthermore, optimal solutions to these problems enjoy a strong, finite-sample probabilistic guarantee. \edit{We describe concrete procedures for choosing an appropriate set for a given application and applying our approach to multiple uncertain constraints. Computational evidence in portfolio management and queuing confirm that our data-driven sets significantly outperform traditional robust optimization techniques whenever data is available.Comment: 38 pages, 15 page appendix, 7 figures. This version updated as of Oct. 201

    Bernoulli Regression Models: Revisiting the Specification of Statistical Models with Binary Dependent Variables

    Get PDF
    The latent variable and generalized linear modelling approaches do not provide a systematic approach for modelling discrete choice observational data. Another alternative, the probabilistic reduction (PR) approach, provides a systematic way to specify such models that can yield reliable statistical and substantive inferences. The purpose of this paper is to re-examine the underlying probabilistic foundations of conditional statistical models with binary dependent variables using the PR approach. This leads to the development of the Bernoulli Regression Model, a family of statistical models, which includes the binary logistic regression model. The paper provides an explicit presentation of probabilistic model assumptions, guidance on model specification and estimation, and empirical application

    Exact Approaches for Bias Detection and Avoidance with Small, Sparse, or Correlated Categorical Data

    Get PDF
    Every day, traditional statistical methodology are used world wide to study a variety of topics and provides insight regarding countless subjects. Each technique is based on a distinct set of assumptions to ensure valid results. Additionally, many statistical approaches rely on large sample behavior and may collapse or degenerate in the presence of small, spare, or correlated data. This dissertation details several advancements to detect these conditions, avoid their consequences, and analyze data in a different way to yield trustworthy results. One of the most commonly used modeling techniques for outcomes with only two possible categorical values (eg. live/die, pass/fail, better/worse, ect.) is logistic regression. While some potential complications with this approach are widely known, many investigators are unaware that their particular data does not meet the foundational assumptions, since they are not easy to verify. We have developed a routine for determining if a researcher should be concerned about potential bias in logistic regression results, so they can take steps to mitigate the bias or use a different procedure altogether to model the data. Correlated data may arise from common situations such as multi-site medical studies, research on family units, or investigations on student achievement within classrooms. In these circumstance the associations between cluster members must be included in any statistical analysis testing the hypothesis of a connection be-tween two variables in order for results to be valid. Previously investigators had to choose between using a method intended for small or sparse data while assuming independence between observations or a method that allowed for correlation between observations, while requiring large samples to be reliable. We present a new method that allows for small, clustered samples to be assessed for a relationship between a two-level predictor (eg. treatment/control) and a categorical outcome (eg. low/medium/high)

    Line transect abundance estimation with uncertain detection on the trackline

    Get PDF
    Bibliography: leaves 225-233.After critically reviewing developments in line transect estimation theory to date, general likelihood functions are derived for the case in which detection probabilities are modelled as functions of any number of explanatory variables and detection of animals on the trackline (i.e. directly in the observer's path) is not certain. Existing models are shown to correspond to special cases of the general models. Maximum likelihood estimators are derived for some special cases of the general model and some existing line transect estimators are shown to correspond to maximum likelihood estimators for other special cases. The likelihoods are shown to be extensions of existing mark-recapture likelihoods as well as being generalizations of existing line transect likelihoods. Two new abundance estimators are developed. The first is a Horvitz-Thompson-like estimator which utilizes the fact that for point estimation of abundance the density of perpendicular distances in the population can be treated as known in appropriately designed line transect surveys. The second is based on modelling the probability density function of detection probabilities in the population. Existing line transect estimators are shown to correspond to special cases of the new Horvitz-Thompson-like estimator, so that this estimator, together with the general likelihoods, provides a unifying framework for estimating abundance from line transect surveys

    Spot Volatility Estimation of Ito Semimartingales Using Delta Sequences

    Get PDF
    This thesis studies a unifying class of nonparametric spot volatility estimators proposed by Mancini et. al.(2013). This method is based on delta sequences and is conceived to include many of the existing estimators in the field as special cases. The thesis first surveys the asymptotic theory of the proposed estimators under an infill asymptotic scheme and fixed time horizon, when the state variable follows a Brownian semimartingale. Then, some extensions to include jumps and financial microstructure noise in the observed price process are also presented. The main goal of the thesis is to assess the suitability of the proposed methods with both high-frequency simulated data and real transaction data from the stock market. In conclusion, double exponential kernel shows the best properties when estimating. Besides, the theorem is robust with the presence of jumps and microstructure noise and the U-shape curves of intraday spot volatility are achieved

    Mixtures of tails in clustered automobile collision claims

    Get PDF
    Knowledge of the tail shape of claim distributions provides important actuarial information. This paper discusses how two techniques commonly used in assessing the most appropriate underlying distribution can be usefully combined. The maximum likelihood approach is theoretically appealing since it is preferable to many other estimators in the sense of best asymptotic normality. Likelihood based tests are, however, not always capable to discriminate among non-nested classes of distributions. Extremal value theory offers an attractive tool to overcome this problem. It shows that a much larger set of distributions is nested in their tails by the so-called tail parameter. This paper shows that both estimation strategies can be usefully combined when the data generating process is characterized by strong clustering in time and size. We find that the extreme value theory is a useful starting point in detecting the appropriate distribution class. Once that has been achieved, the likelihood-based EM-algorithm is proposed to capture the clustering phenomena. Clustering is particularly pervasive in actuarial data. An empirical application to a four-year data set of Dutch automobile collision claims is therefore used to illustrate the approach
    • …
    corecore