9 research outputs found

    Feature Subset Selection for Logistic Regression via Mixed Integer Optimization

    Get PDF

    A Mathematical Programming Approach for Integrated Multiple Linear Regression Subset Selection and Validation

    Get PDF
    Subset selection for multiple linear regression aims to construct a regression model that minimizes errors by selecting a small number of explanatory variables. Once a model is built, various statistical tests and diagnostics are conducted to validate the model and to determine whether the regression assumptions are met. Most traditional approaches require human decisions at this step. For example, the user adding or removing a variable until a satisfactory model is obtained. However, this trial-and-error strategy cannot guarantee that a subset that minimizes the errors while satisfying all regression assumptions will be found. In this paper, we propose a fully automated model building procedure for multiple linear regression subset selection that integrates model building and validation based on mathematical programming. The proposed model minimizes mean squared errors while ensuring that the majority of the important regression assumptions are met. We also propose an efficient constraint to approximate the constraint for the coefficient t-test. When no subset satisfies all of the considered regression assumptions, our model provides an alternative subset that satisfies most of these assumptions. Computational results show that our model yields better solutions (i.e., satisfying more regression assumptions) compared to the state-of-the-art benchmark models while maintaining similar explanatory power

    Specification of Mixed Logit Models Using an Optimization Approach

    Full text link
    Mixed logit models are a widely-used tool for studying discrete outcome problems. Modeling development entails answering three important questions that highly affect the quality of the specification: (i) what variables are considered in the analysis? (ii) what are going to be the coefficients for these variables? and (iii) what density function these coefficients will follow? The literature provides guidance; however, a strong statistical background and an ad hoc search process are required to obtain the best model specification. Knowledge of the problem context and data is required. Given a dataset including discrete outcomes and associated characteristics the problem to be addressed in this thesis is to investigate to what extend a relatively simple metaheuristic such as Simulated Annealing, can determine the best model specification for a mixed logit model and answer the above questions. A mathematical programing formulation is proposed and simulated annealing is implemented to find solutions for the proposed formulation. Three experiments were performed to test the effectiveness of the proposed algorithm. A comparison with existing model specifications for the same datasets was performed. The results suggest that the proposed algorithm is able to find an adequate model specification in terms of goodness of fit thereby reducing involvement of the analyst

    Massively-Parallel Feature Selection for Big Data

    Full text link
    We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of pp-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

    Learning Sparse Classifiers: Continuous and Mixed Integer Optimization Perspectives

    Full text link
    We consider a discrete optimization formulation for learning sparse classifiers, where the outcome depends upon a linear combination of a small subset of features. Recent work has shown that mixed integer programming (MIP) can be used to solve (to optimality) 0\ell_0-regularized regression problems at scales much larger than what was conventionally considered possible. Despite their usefulness, MIP-based global optimization approaches are significantly slower compared to the relatively mature algorithms for 1\ell_1-regularization and heuristics for nonconvex regularized problems. We aim to bridge this gap in computation times by developing new MIP-based algorithms for 0\ell_0-regularized classification. We propose two classes of scalable algorithms: an exact algorithm that can handle p50,000p\approx 50,000 features in a few minutes, and approximate algorithms that can address instances with p106p\approx 10^6 in times comparable to the fast 1\ell_1-based algorithms. Our exact algorithm is based on the novel idea of \textsl{integrality generation}, which solves the original problem (with pp binary variables) via a sequence of mixed integer programs that involve a small number of binary variables. Our approximate algorithms are based on coordinate descent and local combinatorial search. In addition, we present new estimation error bounds for a class of 0\ell_0-regularized estimators. Experiments on real and synthetic data demonstrate that our approach leads to models with considerably improved statistical performance (especially, variable selection) when compared to competing methods.Comment: To appear in JML

    Identification of Factors Contributing to Traffic Crashes by Analysis of Text Narratives

    Full text link
    The fatalities, injuries, and property damage that result from traffic crashes impose a significant burden on society. Current research and practice in traffic safety rely on analysis of quantitative data from crash reports to understand crash severity contributors and develop countermeasures. Despite advances from this effort, quantitative crash data suffers from drawbacks, such as the limited ability to capture all the information relevant to the crashes and the potential errors introduced during data collection. Crash narratives can help address these limitations, as they contain detailed descriptions of the context and sequence of events of the crash. However, the unstructured nature of text data within narratives has challenged exploration of crash narratives. In response, this dissertation aims to develop an analysis framework and methods to enable the extraction of insights from crash narratives and thus improve our level of understanding of traffic crashes to a new level. The methodological development of this dissertation is split into three objectives. The first objective is to devise an approach for extraction of severity contributing insights from crash narratives by investigating interpretable machine learning and text mining techniques. The second objective is to enable an enhanced identification of crash severity contributors in the form of meaningful phrases by integrating recent advancements in Natural Language Processing (NLP). The third objective is to develop an approach for semantic search of information of interest in crash narratives. The obtained results indicate that the developed approaches enable the extraction of valuable insights from crash narratives to 1) uncover factors that quantitative may not reveal, 2) confirm results from classic statistical analysis on crash data, and 3) fix inconsistencies in quantitative data. The outcomes of this dissertation add substantial value to traffic safety, as the developed approaches allow analysts to exploit the rich information in crash narratives for a more comprehensive and accurate diagnosis of traffic crashes
    corecore