Com- putational Subset Model Selection Algorithms and Applications

Abstract

This dissertation develops new computationally e±cient algorithms for identifying the subset of variables that minimizes any desired information criteria in model selection. In recent years, the statistical literature has placed more and more empha- sis on information theoretic model selection criteria. A model selection crite- rion chooses model that \closely approximates the true underlying model. Recent years have also seen many exciting developments in the model se- lection techniques. As demand increases for data mining of massive data sets with many variables, the demand for model selection techniques are be- coming much stronger and needed. To this end, we introduce a new Implicit Enumeration (IE) algorithm and a hybridized IE with the Genetic Algorithm (GA) in this dissertation. The proposed Implicit Enumeration algorithm is the ¯rst algorithm that explicitly uses an information criterion as the objective function. The algo- rithm works with a variety of information criteria including some for which the existing branch and bound algorithms developed by Furnival and Wil- son (1974) and Gatu and Kontoghiorghies (2003) are not applicable. It also ¯nds the \best subset model directly without the need of ¯nding the \best subset of each size as the branch and bound techniques do. The proposed methods are demonstrated in multiple, multivariate, logis- tic regression and discriminant analysis problems. The implicit enumeration algorithm converged to the optimal solution on real and simulated data sets v with up to 80 predictors, thus having 280 = 1; 208; 925; 819; 614; 630; 000; 000; 000 possible subset models in the model portfolio. To our knowledge, none of the existing exact algorithms have the capability of optimally solving such problems of this size

    Similar works