415 research outputs found
Algorithmic Advances for the Design and Analysis of Randomized Experiments
Randomized experiments are the gold standard for investigating the causal effect of treatment on a population. In this dissertation, we present algorithmic advances for three different problems arising in the design and analysis of randomized experiments: covariate balancing, variance estimation, and bipartite experiments. In the first chapter, we describe an inherent trade-off between covariate balancing and robustness, which we formulate as a distributional discrepancy problem. In order to navigate this trade-off, we present the Gram–Schmidt Walk Design which is based on the recent discrepancy algorithm of Bansal, Dadush, Garg, and Lovett (2019). By tightening the algorithmic analysis, we derive bounds on the mean squared error of the Horvitz–Thompson estimator under this design in terms of a ridge regression of the outcomes on the covariates, which we interpret as regression by design. We carry out further analysis including tail bounds on effect estimator, methods for constructing confidence intervals, and an extension of the design which accommodates non-linear responses via kernel methods. In the second chapter, we study the problem of estimating the variance of treat- ment effect estimators under interference. It is well-known that unbiased variance estimation is impossible without strong assumptions on the outcomes, due to the fundamental problem of causal inference. Thus, we study a class of conservative es- timators which are based on variance bounds. We identify conditions under which the variance bounds themselves are admissible and provide a general algorithmic framework to construct admissible variance bounds, according to the experimenter’s preferences and prior substantive knowledge. In the final chapter, we present methodology for the newly proposed bipartite experimental framework, where units which receive treatment are distinct from units on which outcomes are measured, and the two are connected via a bipartite graph. We investigate a linear exposure-response assumption which allows more complex interactions. We propose the Exposure Re-weighted Linear (ERL) estimator which we show is unbiased in finite samples and consistent and asymptotically normal in large samples provided the bipartite graph is sufficiently sparse. We provide a variance estimator which facilitates confidence intervals based on the normal approximation. Finally, we present Exposure-Design, a correlation clustering based design for improving precision of the ERL estimator
Topics at the interface of optimization and statistics
Optimization has been an important tool in statistics for a long time. For example, the problem of parameter estimation in a statistical model, either by maximizing a likelihood function or using least squares approach, reduces to solving an optimization problem. Not only has optimization been utilized in solving traditional statistical problems, it also plays a crucial role in more recent areas such as statistical learning. In particular, in most statistical learning models, one learns the best parameters for the model through minimizing some cost function under certain constraints.
In the past decade or so, there has been an increasing trend in going to reverse direction: Using statistics as a powerful tool in optimization. As learning algorithms become more efficient, researchers have focused on finding ways to apply learning models to improve the performance of existing optimization algorithms. Following their footsteps, in this thesis, we study a recent algorithm for generating cutting planes in mixed integer linear programming problems and show how one can apply learning algorithms to improve the algorithm.
In addition, we use the decision theory framework to evaluate whether the solution given by the sample average approximation, a commonly used method to solve stochastic programming problems, is ``good". In particular, we show that the sample average solution is admissible for an uncertain linear objective over a fixed compact set and for a convex quadratic function with an uncertain linear term over box constraints when the dimension is less than 4.
Finally, we combine tools from mixed integer programming and Bayesian statistics to solve the catalog matching problem in astronomy, which tries to associate an object's detections coming from independent catalogs. This problem has been studied by many researchers. However, the most current algorithm to tackle the problem is only shown to work with 3 catalogs. In this thesis, we extend this algorithm to allow for matching across a higher number of catalogs. In addition, we introduce a new algorithm that is more efficient and scales much better with large number of catalogs
On the efficiency and consistency of likelihood estimation in multivariate conditionally heteroskedastic dynamic regression models
We rank the efficiency of several likelihood-based parametric and semiparametric estimators of conditional mean and variance parameters in multivariate dynamic models with i.i.d. spherical innovations, and show that Gaussian pseudo maximum likelihood estimators are inefficient except under normality. We also provide conditions for partial adaptivity of semiparametric procedures, and relate them to the consistency of distributionally misspecified maximum likelihood estimators. We propose Hausman tests that compare Gaussian pseudo maximum likelihood estimators with more efficient but less robust competitors. We also study the efficiency of sequential estimators of the shape parameters. Finally, we provide finite sample results through Monte Carlo simulations.Adaptivity, ARCH, Elliptical Distributions, Financial Returns, Hausman tests, Semiparametric Estimators, Sequential Estimators.
Regression modelling with I-priors
We introduce the I-prior methodology as a unifying framework for estimating a
variety of regression models, including varying coefficient, multilevel,
longitudinal models, and models with functional covariates and responses. It
can also be used for multi-class classification, with low or high dimensional
covariates.
The I-prior is generally defined as a maximum entropy prior. For a regression
function, the I-prior is Gaussian with covariance kernel proportional to the
Fisher information on the regression function, which is estimated by its
posterior distribution under the I-prior. The I-prior has the intuitively
appealing property that the more information is available on a linear
functional of the regression function, the larger the prior variance, and the
smaller the influence of the prior mean on the posterior distribution.
Advantages compared to competing methods, such as Gaussian process regression
or Tikhonov regularization, are ease of estimation and model comparison. In
particular, we develop an EM algorithm with a simple E and M step for
estimating hyperparameters, facilitating estimation for complex models. We also
propose a novel parsimonious model formulation, requiring a single scale
parameter for each (possibly multidimensional) covariate and no further
parameters for interaction effects. This simplifies estimation because fewer
hyperparameters need to be estimated, and also simplifies model comparison of
models with the same covariates but different interaction effects; in this
case, the model with the highest estimated likelihood can be selected.
Using a number of widely analyzed real data sets we show that predictive
performance of our methodology is competitive. An R-package implementing the
methodology is available (Jamil, 2019)
Computational Methods for the Analysis of Complex Data
This PhD dissertation bridges the disciplines of Operations Research and Statistics to develop
novel computational methods for the extraction of knowledge from complex data. In this research,
complex data stands for datasets with many instances and/or variables, with different
types of variables, with dependence structures among the variables, collected from different
sources (heterogeneous), possibly with non-identical population class sizes, with different misclassification
costs, or characterized by extreme instances (heavy-tailed data), among others.
Recently, the complexity of the raw data in addition to new requests posed by practitioners
(interpretable models, cost-sensitive models or models which are efficient in terms of running
times) entail a challenge from a scientific perspective. The main contributions of this PhD dissertation
are encompassed in three different research frameworks: Regression, Classification
and Bayesian inference. Concerning the first, we consider linear regression models, where a
continuous outcome variable is to be predicted by a set of features. On the one hand, seeking
for interpretable solutions in heterogeneous datasets, we propose a novel version of the Lasso
in which the performance of the method on groups of interest is controlled. On the other hand,
we use mathematical optimization tools to propose a sparse linear regression model (that is, a
model whose solution only depends on a subset of predictors) specifically designed for datasets
with categorical and hierarchical features. Regarding the task of Classification, in this PhD dissertation
we have explored in depth the Naïve Bayes classifier. This method has been adapted
to obtain a sparse solution and also, it has been modified to deal with cost-sensitive datasets.
For both problems, novel strategies for reducing high running times are presented. Finally, the
last contribution of this dissertation concerns Bayesian inference methods. In particular, in the
setting of heavy-tailed data, we consider a semi-parametric Bayesian approach to estimate the
Elliptical distribution.
The structure of this dissertation is as follows. Chapter 1 contains the theoretical background
needed to develop the following chapters. In particular, two main research areas are
reviewed: sparse and cost-sensitive statistical learning and Bayesian Statistics.
Chapter 2 proposes a Lasso-based method in which quadratic performance constraints to
bound the prediction errors in the individuals of interest are added to Lasso-based objective
functions. This constrained sparse regression model is defined by a nonlinear optimization
problem. Specifically, it has a direct application in heterogeneous samples where data are collected from distinct sources, as it is standard in many biomedical contexts.
Chapter 3 studies linear regression models built on categorical predictor variables that have
a hierarchical structure. The model is flexible in the sense that the user decides the level of
detail in the information used to build it, having into account data privacy considerations. To
trade off the accuracy of the linear regression model and its complexity, a Mixed Integer Convex
Quadratic Problem with Linear Constraints is solved.
In Chapter 4, a sparse version of the Naïve Bayes classifier, which is characterized by the
following three properties, is proposed. On the one hand, the selection of the subset of variables
is done in terms of the correlation structure of the predictor variables. On the other hand, such
selection can be based on different performance measures. Additionally, performance constraints
on groups of higher interest can be included. This smart search integrates the flexibility
in terms of performance for classification, yielding competitive running times.
The approach introduced in Chapter 2 is also explored in Chapter 5 for improving the performance
of the Naïve Bayes classifier in the classes of most interest to the user. Unlike the traditional
version of the classifier, which is a two-step classifier (estimation first and classification
next), the novel approach integrates both stages. The method is formulated via an optimization
problem where the likelihood function is maximized with constraints on the classification rates
for the groups of interest.
When dealing with datasets of especial characteristics (for example, heavy tails in contexts
as Economics and Finance), Bayesian statistical techniques have shown their potential in the
literature. In Chapter 6, Elliptical distributions, which are generalizations of the multivariate
normal distribution to both longer tails and elliptical contours, are examined, and Bayesian
methods to perform semi-parametric inference for them are used.
Finally, Chapter 7 closes the thesis with general conclusions and future lines of research
Least Ambiguous Set-Valued Classifiers with Bounded Error Levels
In most classification tasks there are observations that are ambiguous and
therefore difficult to correctly label. Set-valued classifiers output sets of
plausible labels rather than a single label, thereby giving a more appropriate
and informative treatment to the labeling of ambiguous instances. We introduce
a framework for multiclass set-valued classification, where the classifiers
guarantee user-defined levels of coverage or confidence (the probability that
the true label is contained in the set) while minimizing the ambiguity (the
expected size of the output). We first derive oracle classifiers assuming the
true distribution to be known. We show that the oracle classifiers are obtained
from level sets of the functions that define the conditional probability of
each class. Then we develop estimators with good asymptotic and finite sample
properties. The proposed estimators build on existing single-label classifiers.
The optimal classifier can sometimes output the empty set, but we provide two
solutions to fix this issue that are suitable for various practical needs.Comment: Final version to be published in the Journal of the American
Statistical Association at
https://www.tandfonline.com/doi/abs/10.1080/01621459.2017.1395341?journalCode=uasa2
Model-Oriented Data Analysis; Proceedings of an IIASA Workshop, Eisenach, GDR, March 9-13, 1987)
The main topics of this workshop were (1) optimal experimental design, (2) regression analysis, and (3) model testing and applications.
Under the topic "Optimal experimental design" new optimality criteria based on asymptotic properties of relevant statistics were discussed. The use of additional restrictions on the designs were also discussed, inadequate and nonlinear models were considered and Bayesian approaches to the design problem in the nonlinear case were a focal point of the special session. It was emphasized that experimental design is a field of much current interest.
During the sessions devoted to "Regression analysis" it became clear that there is an essential progress in statistics for nonlinear models. Here, besides the asymptotic behavior of several estimators the non-asymptotic properties of some interesting statistics were discussed. The distribution of the maximum-likelihood (ML) estimator in normal models and alternative estimators to the least-squares or ML estimators were discussed intensively.
Several approaches to "resampling" were considered in connection with linear, nonlinear and semiparametric models. Some new results were reported concerning simulated likelihoods which provide a powerful tool for statistics in several types of models. The advantages and problems of bootstrapping, jackknifing and related methods were considered in a number of papers.
Under the topic of "Model testing and applications" the papers covered a broad spectrum of problems. Methods for the detection of outliers and the consequences of transformations of data were discussed. Furthermore, robust regression methods, empirical Bayesian approaches and the stability of estimators were considered, together with numerical problems in data analysis and the use of computer packages
- …