9 research outputs found
A Mathematical Programming Approach for Integrated Multiple Linear Regression Subset Selection and Validation
Subset selection for multiple linear regression aims to construct a
regression model that minimizes errors by selecting a small number of
explanatory variables. Once a model is built, various statistical tests and
diagnostics are conducted to validate the model and to determine whether the
regression assumptions are met. Most traditional approaches require human
decisions at this step. For example, the user adding or removing a variable
until a satisfactory model is obtained. However, this trial-and-error strategy
cannot guarantee that a subset that minimizes the errors while satisfying all
regression assumptions will be found. In this paper, we propose a fully
automated model building procedure for multiple linear regression subset
selection that integrates model building and validation based on mathematical
programming. The proposed model minimizes mean squared errors while ensuring
that the majority of the important regression assumptions are met. We also
propose an efficient constraint to approximate the constraint for the
coefficient t-test. When no subset satisfies all of the considered regression
assumptions, our model provides an alternative subset that satisfies most of
these assumptions. Computational results show that our model yields better
solutions (i.e., satisfying more regression assumptions) compared to the
state-of-the-art benchmark models while maintaining similar explanatory power
Specification of Mixed Logit Models Using an Optimization Approach
Mixed logit models are a widely-used tool for studying discrete outcome problems. Modeling development entails answering three important questions that highly affect the quality of the specification: (i) what variables are considered in the analysis? (ii) what are going to be the coefficients for these variables? and (iii) what density function these coefficients will follow? The literature provides guidance; however, a strong statistical background and an ad hoc search process are required to obtain the best model specification. Knowledge of the problem context and data is required. Given a dataset including discrete outcomes and associated characteristics the problem to be addressed in this thesis is to investigate to what extend a relatively simple metaheuristic such as Simulated Annealing, can determine the best model specification for a mixed logit model and answer the above questions. A mathematical programing formulation is proposed and simulated annealing is implemented to find solutions for the proposed formulation. Three experiments were performed to test the effectiveness of the proposed algorithm. A comparison with existing model specifications for the same datasets was performed. The results suggest that the proposed algorithm is able to find an adequate model specification in terms of goodness of fit thereby reducing involvement of the analyst
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives
We consider a discrete optimization formulation for learning sparse
classifiers, where the outcome depends upon a linear combination of a small
subset of features. Recent work has shown that mixed integer programming (MIP)
can be used to solve (to optimality) -regularized regression problems
at scales much larger than what was conventionally considered possible. Despite
their usefulness, MIP-based global optimization approaches are significantly
slower compared to the relatively mature algorithms for -regularization
and heuristics for nonconvex regularized problems. We aim to bridge this gap in
computation times by developing new MIP-based algorithms for
-regularized classification. We propose two classes of scalable
algorithms: an exact algorithm that can handle features in a
few minutes, and approximate algorithms that can address instances with
in times comparable to the fast -based algorithms. Our
exact algorithm is based on the novel idea of \textsl{integrality generation},
which solves the original problem (with binary variables) via a sequence of
mixed integer programs that involve a small number of binary variables. Our
approximate algorithms are based on coordinate descent and local combinatorial
search. In addition, we present new estimation error bounds for a class of
-regularized estimators. Experiments on real and synthetic data
demonstrate that our approach leads to models with considerably improved
statistical performance (especially, variable selection) when compared to
competing methods.Comment: To appear in JML
Identification of Factors Contributing to Traffic Crashes by Analysis of Text Narratives
The fatalities, injuries, and property damage that result from traffic crashes impose a significant burden on society. Current research and practice in traffic safety rely on analysis of quantitative data from crash reports to understand crash severity contributors and develop countermeasures. Despite advances from this effort, quantitative crash data suffers from drawbacks, such as the limited ability to capture all the information relevant to the crashes and the potential errors introduced during data collection. Crash narratives can help address these limitations, as they contain detailed descriptions of the context and sequence of events of the crash. However, the unstructured nature of text data within narratives has challenged exploration of crash narratives. In response, this dissertation aims to develop an analysis framework and methods to enable the extraction of insights from crash narratives and thus improve our level of understanding of traffic crashes to a new level. The methodological development of this dissertation is split into three objectives. The first objective is to devise an approach for extraction of severity contributing insights from crash narratives by investigating interpretable machine learning and text mining techniques. The second objective is to enable an enhanced identification of crash severity contributors in the form of meaningful phrases by integrating recent advancements in Natural Language Processing (NLP). The third objective is to develop an approach for semantic search of information of interest in crash narratives. The obtained results indicate that the developed approaches enable the extraction of valuable insights from crash narratives to 1) uncover factors that quantitative may not reveal, 2) confirm results from classic statistical analysis on crash data, and 3) fix inconsistencies in quantitative data. The outcomes of this dissertation add substantial value to traffic safety, as the developed approaches allow analysts to exploit the rich information in crash narratives for a more comprehensive and accurate diagnosis of traffic crashes