3 research outputs found

    Efficient Model-Free Subsampling Method for Massive Data

    No full text
    Subsampling plays a crucial role in tackling problems associated with the storage and statistical learning of massive datasets. However, most existing subsampling methods are model-based, which means their performances can drop significantly when the underlying model is misspecified. Such an issue calls for model-free subsampling methods that are robust under diverse model specifications. Recently, several model-free subsampling methods have been developed. However, the computing time of these methods grows explosively with the sample size, making them impractical for handling massive data. In this article, an efficient model-free subsampling method is proposed, which segments the original data into some regular data blocks and obtains subsamples from each data block by the data-driven subsampling method. Compared with existing model-free subsampling methods, the proposed method has a significant speed advantage and performs more robustly for datasets with complex underlying distributions. As demonstrated in simulation experiments, the proposed method is an order of magnitude faster than other commonly used model-free subsampling methods when the sample size of the original dataset reaches the order of 107. Moreover, simulation experiments and case studies show that the proposed method is more robust than other model-free subsampling methods under diverse model specifications and subsample sizes.</p

    Augmenting definitive screening designs: Going outside the box

    No full text
    Definitive screening designs (DSDs) have grown rapidly in popularity since their introduction by Jones and Nachtsheim (2011). Their appeal is that the second-order response surface (RS) model can be estimated in any subset of three factors, without having to perform a follow-up experiment. However, their usefulness as a one-step RS modeling strategy depends heavily on the sparsity of second-order effects and the dominance of first-order terms over pure quadratic terms. To address these limitations, we show how viewing a projection of the design region as spherical and augmenting the DSD with axial points in factors found to involve second-order effects remedies the deficiencies of a stand-alone DSD. We show that augmentation with a second design consisting of axial points is often the Ds-optimal augmentation, as well as minimizing the average prediction variance. Supplemented by this strategy, DSDs are highly effective initial screening designs that support estimation of the second-order RS model in three or four factors.</p

    Fast Approximation of the Shapley Values Based on Order-of-Addition Experimental Designs

    No full text
    Shapley value is originally a concept in econometrics to fairly distribute both gains and costs to players in a coalition game. In the recent decades, its application has been extended to other areas such as marketing, engineering and machine learning. For example, it produces reasonable solutions for problems in sensitivity analysis, local model explanation towards the interpretable machine learning, node importance in social network, attribution models, etc. However, it could be very expensive to compute the Shapley value. Specifically, in a d-player coalition game, calculating a Shapley value requires the evaluation of d! or 2d marginal contribution values, depending on whether we are taking the permutation or combination formulation of the Shapley value. Hence it becomes infeasible to calculate the Shapley value when d is reasonably large. A common remedy is to take a random sample of the permutations to surrogate for the complete list of permutations. We find an advanced sampling scheme can be designed to yield much more accurate estimation of the Shapley value than the simple random sampling (SRS). Our sampling scheme is based on combinatorial structures in the field of design of experiments (DOE), particularly the order-of-addition experimental designs for the study of how the orderings of components would affect the output. We show that the obtained estimates are unbiased, and can sometimes deterministically recover the original Shapley value. Both theoretical and simulations results show that our DOE-based sampling scheme outperforms SRS in terms of estimation accuracy. Surprisingly, it is also slightly faster than SRS. Lastly, real data analysis is conducted for the C. elegans nervous system and the 9/11 terrorist network.</p
    corecore