415 research outputs found

    Algorithmic Advances for the Design and Analysis of Randomized Experiments

    Get PDF
    Randomized experiments are the gold standard for investigating the causal effect of treatment on a population. In this dissertation, we present algorithmic advances for three different problems arising in the design and analysis of randomized experiments: covariate balancing, variance estimation, and bipartite experiments. In the first chapter, we describe an inherent trade-off between covariate balancing and robustness, which we formulate as a distributional discrepancy problem. In order to navigate this trade-off, we present the Gram–Schmidt Walk Design which is based on the recent discrepancy algorithm of Bansal, Dadush, Garg, and Lovett (2019). By tightening the algorithmic analysis, we derive bounds on the mean squared error of the Horvitz–Thompson estimator under this design in terms of a ridge regression of the outcomes on the covariates, which we interpret as regression by design. We carry out further analysis including tail bounds on effect estimator, methods for constructing confidence intervals, and an extension of the design which accommodates non-linear responses via kernel methods. In the second chapter, we study the problem of estimating the variance of treat- ment effect estimators under interference. It is well-known that unbiased variance estimation is impossible without strong assumptions on the outcomes, due to the fundamental problem of causal inference. Thus, we study a class of conservative es- timators which are based on variance bounds. We identify conditions under which the variance bounds themselves are admissible and provide a general algorithmic framework to construct admissible variance bounds, according to the experimenter’s preferences and prior substantive knowledge. In the final chapter, we present methodology for the newly proposed bipartite experimental framework, where units which receive treatment are distinct from units on which outcomes are measured, and the two are connected via a bipartite graph. We investigate a linear exposure-response assumption which allows more complex interactions. We propose the Exposure Re-weighted Linear (ERL) estimator which we show is unbiased in finite samples and consistent and asymptotically normal in large samples provided the bipartite graph is sufficiently sparse. We provide a variance estimator which facilitates confidence intervals based on the normal approximation. Finally, we present Exposure-Design, a correlation clustering based design for improving precision of the ERL estimator

    Topics at the interface of optimization and statistics

    Get PDF
    Optimization has been an important tool in statistics for a long time. For example, the problem of parameter estimation in a statistical model, either by maximizing a likelihood function or using least squares approach, reduces to solving an optimization problem. Not only has optimization been utilized in solving traditional statistical problems, it also plays a crucial role in more recent areas such as statistical learning. In particular, in most statistical learning models, one learns the best parameters for the model through minimizing some cost function under certain constraints. In the past decade or so, there has been an increasing trend in going to reverse direction: Using statistics as a powerful tool in optimization. As learning algorithms become more efficient, researchers have focused on finding ways to apply learning models to improve the performance of existing optimization algorithms. Following their footsteps, in this thesis, we study a recent algorithm for generating cutting planes in mixed integer linear programming problems and show how one can apply learning algorithms to improve the algorithm. In addition, we use the decision theory framework to evaluate whether the solution given by the sample average approximation, a commonly used method to solve stochastic programming problems, is ``good". In particular, we show that the sample average solution is admissible for an uncertain linear objective over a fixed compact set and for a convex quadratic function with an uncertain linear term over box constraints when the dimension is less than 4. Finally, we combine tools from mixed integer programming and Bayesian statistics to solve the catalog matching problem in astronomy, which tries to associate an object's detections coming from independent catalogs. This problem has been studied by many researchers. However, the most current algorithm to tackle the problem is only shown to work with 3 catalogs. In this thesis, we extend this algorithm to allow for matching across a higher number of catalogs. In addition, we introduce a new algorithm that is more efficient and scales much better with large number of catalogs

    On the efficiency and consistency of likelihood estimation in multivariate conditionally heteroskedastic dynamic regression models

    Get PDF
    We rank the efficiency of several likelihood-based parametric and semiparametric estimators of conditional mean and variance parameters in multivariate dynamic models with i.i.d. spherical innovations, and show that Gaussian pseudo maximum likelihood estimators are inefficient except under normality. We also provide conditions for partial adaptivity of semiparametric procedures, and relate them to the consistency of distributionally misspecified maximum likelihood estimators. We propose Hausman tests that compare Gaussian pseudo maximum likelihood estimators with more efficient but less robust competitors. We also study the efficiency of sequential estimators of the shape parameters. Finally, we provide finite sample results through Monte Carlo simulations.Adaptivity, ARCH, Elliptical Distributions, Financial Returns, Hausman tests, Semiparametric Estimators, Sequential Estimators.

    Regression modelling with I-priors

    Full text link
    We introduce the I-prior methodology as a unifying framework for estimating a variety of regression models, including varying coefficient, multilevel, longitudinal models, and models with functional covariates and responses. It can also be used for multi-class classification, with low or high dimensional covariates. The I-prior is generally defined as a maximum entropy prior. For a regression function, the I-prior is Gaussian with covariance kernel proportional to the Fisher information on the regression function, which is estimated by its posterior distribution under the I-prior. The I-prior has the intuitively appealing property that the more information is available on a linear functional of the regression function, the larger the prior variance, and the smaller the influence of the prior mean on the posterior distribution. Advantages compared to competing methods, such as Gaussian process regression or Tikhonov regularization, are ease of estimation and model comparison. In particular, we develop an EM algorithm with a simple E and M step for estimating hyperparameters, facilitating estimation for complex models. We also propose a novel parsimonious model formulation, requiring a single scale parameter for each (possibly multidimensional) covariate and no further parameters for interaction effects. This simplifies estimation because fewer hyperparameters need to be estimated, and also simplifies model comparison of models with the same covariates but different interaction effects; in this case, the model with the highest estimated likelihood can be selected. Using a number of widely analyzed real data sets we show that predictive performance of our methodology is competitive. An R-package implementing the methodology is available (Jamil, 2019)

    Computational Methods for the Analysis of Complex Data

    Get PDF
    This PhD dissertation bridges the disciplines of Operations Research and Statistics to develop novel computational methods for the extraction of knowledge from complex data. In this research, complex data stands for datasets with many instances and/or variables, with different types of variables, with dependence structures among the variables, collected from different sources (heterogeneous), possibly with non-identical population class sizes, with different misclassification costs, or characterized by extreme instances (heavy-tailed data), among others. Recently, the complexity of the raw data in addition to new requests posed by practitioners (interpretable models, cost-sensitive models or models which are efficient in terms of running times) entail a challenge from a scientific perspective. The main contributions of this PhD dissertation are encompassed in three different research frameworks: Regression, Classification and Bayesian inference. Concerning the first, we consider linear regression models, where a continuous outcome variable is to be predicted by a set of features. On the one hand, seeking for interpretable solutions in heterogeneous datasets, we propose a novel version of the Lasso in which the performance of the method on groups of interest is controlled. On the other hand, we use mathematical optimization tools to propose a sparse linear regression model (that is, a model whose solution only depends on a subset of predictors) specifically designed for datasets with categorical and hierarchical features. Regarding the task of Classification, in this PhD dissertation we have explored in depth the Naïve Bayes classifier. This method has been adapted to obtain a sparse solution and also, it has been modified to deal with cost-sensitive datasets. For both problems, novel strategies for reducing high running times are presented. Finally, the last contribution of this dissertation concerns Bayesian inference methods. In particular, in the setting of heavy-tailed data, we consider a semi-parametric Bayesian approach to estimate the Elliptical distribution. The structure of this dissertation is as follows. Chapter 1 contains the theoretical background needed to develop the following chapters. In particular, two main research areas are reviewed: sparse and cost-sensitive statistical learning and Bayesian Statistics. Chapter 2 proposes a Lasso-based method in which quadratic performance constraints to bound the prediction errors in the individuals of interest are added to Lasso-based objective functions. This constrained sparse regression model is defined by a nonlinear optimization problem. Specifically, it has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts. Chapter 3 studies linear regression models built on categorical predictor variables that have a hierarchical structure. The model is flexible in the sense that the user decides the level of detail in the information used to build it, having into account data privacy considerations. To trade off the accuracy of the linear regression model and its complexity, a Mixed Integer Convex Quadratic Problem with Linear Constraints is solved. In Chapter 4, a sparse version of the Naïve Bayes classifier, which is characterized by the following three properties, is proposed. On the one hand, the selection of the subset of variables is done in terms of the correlation structure of the predictor variables. On the other hand, such selection can be based on different performance measures. Additionally, performance constraints on groups of higher interest can be included. This smart search integrates the flexibility in terms of performance for classification, yielding competitive running times. The approach introduced in Chapter 2 is also explored in Chapter 5 for improving the performance of the Naïve Bayes classifier in the classes of most interest to the user. Unlike the traditional version of the classifier, which is a two-step classifier (estimation first and classification next), the novel approach integrates both stages. The method is formulated via an optimization problem where the likelihood function is maximized with constraints on the classification rates for the groups of interest. When dealing with datasets of especial characteristics (for example, heavy tails in contexts as Economics and Finance), Bayesian statistical techniques have shown their potential in the literature. In Chapter 6, Elliptical distributions, which are generalizations of the multivariate normal distribution to both longer tails and elliptical contours, are examined, and Bayesian methods to perform semi-parametric inference for them are used. Finally, Chapter 7 closes the thesis with general conclusions and future lines of research

    Least Ambiguous Set-Valued Classifiers with Bounded Error Levels

    Full text link
    In most classification tasks there are observations that are ambiguous and therefore difficult to correctly label. Set-valued classifiers output sets of plausible labels rather than a single label, thereby giving a more appropriate and informative treatment to the labeling of ambiguous instances. We introduce a framework for multiclass set-valued classification, where the classifiers guarantee user-defined levels of coverage or confidence (the probability that the true label is contained in the set) while minimizing the ambiguity (the expected size of the output). We first derive oracle classifiers assuming the true distribution to be known. We show that the oracle classifiers are obtained from level sets of the functions that define the conditional probability of each class. Then we develop estimators with good asymptotic and finite sample properties. The proposed estimators build on existing single-label classifiers. The optimal classifier can sometimes output the empty set, but we provide two solutions to fix this issue that are suitable for various practical needs.Comment: Final version to be published in the Journal of the American Statistical Association at https://www.tandfonline.com/doi/abs/10.1080/01621459.2017.1395341?journalCode=uasa2

    Model-Oriented Data Analysis; Proceedings of an IIASA Workshop, Eisenach, GDR, March 9-13, 1987)

    Get PDF
    The main topics of this workshop were (1) optimal experimental design, (2) regression analysis, and (3) model testing and applications. Under the topic "Optimal experimental design" new optimality criteria based on asymptotic properties of relevant statistics were discussed. The use of additional restrictions on the designs were also discussed, inadequate and nonlinear models were considered and Bayesian approaches to the design problem in the nonlinear case were a focal point of the special session. It was emphasized that experimental design is a field of much current interest. During the sessions devoted to "Regression analysis" it became clear that there is an essential progress in statistics for nonlinear models. Here, besides the asymptotic behavior of several estimators the non-asymptotic properties of some interesting statistics were discussed. The distribution of the maximum-likelihood (ML) estimator in normal models and alternative estimators to the least-squares or ML estimators were discussed intensively. Several approaches to "resampling" were considered in connection with linear, nonlinear and semiparametric models. Some new results were reported concerning simulated likelihoods which provide a powerful tool for statistics in several types of models. The advantages and problems of bootstrapping, jackknifing and related methods were considered in a number of papers. Under the topic of "Model testing and applications" the papers covered a broad spectrum of problems. Methods for the detection of outliers and the consequences of transformations of data were discussed. Furthermore, robust regression methods, empirical Bayesian approaches and the stability of estimators were considered, together with numerical problems in data analysis and the use of computer packages
    • …
    corecore