10 research outputs found
Multivariate-Sign-Based High-Dimensional Tests for the Two-Sample Location Problem
<p>This article concerns tests for the two-sample location problem when data dimension is larger than the sample size. Existing multivariate-sign-based procedures are not robust against high dimensionality, producing tests with Type I error rates far away from nominal levels. This is mainly due to the biases from estimating location parameters. We propose a novel test to overcome this issue by using the “leave-one-out” idea. The proposed test statistic is scalar-invariant and thus is particularly useful when different components have different scales in high-dimensional data. Asymptotic properties of the test statistic are studied. Compared with other existing approaches, simulation studies show that the proposed method behaves well in terms of sizes and power. Supplementary materials for this article are available online.</p
Data-Driven Determination of the Number of Jumps in Regression Curves
In nonparametric regression with jump discontinuities, one major challenge is to determine the number of jumps in a regression curve. Most existing methods to solve that problem are based on either a sequence of hypothesis tests or model selection, by introducing some extra tuning parameters that may not be easy to determine in practice. This article aims to develop a data-driven new methodology for determining the number of jumps, using an order-preserved sample-splitting strategy together with a cross-validation-based criterion. Statistical consistency of the determined number of jumps by our proposed method is established. More interestingly, the proposed method allows us to move beyond just point estimation, and it can quantify uncertainty of the proposed estimate. The key idea behind our method is the construction of a series of statistics with marginal symmetry property and this property can be used for choosing a data-driven threshold to control the false discovery rate of our method. The proposed method is computationally efficient. Numerical experiments indicate that it has a reliable performance in finite-sample cases. An R package jra is developed to implement the proposed method.</p
A Distribution-Free Multivariate Control Chart
<div><p>Monitoring multivariate quality variables or data streams remains an important and challenging problem in statistical process control (SPC). Although the multivariate SPC has been extensively studied in the literature, designing distribution-free control schemes are still challenging and yet to be addressed well. This paper develops a new nonparametric methodology for monitoring location parameters when only a small reference dataset is available. The key idea is to construct a series of conditionally distribution-free test statistics in the sense that their distributions are free of the underlying distribution given the empirical distribution functions. The conditional probability that the charting statistic exceeds the control limit at present given that there is no alarm before the current time point can be guaranteed to attain a specified false alarm rate. The success of the proposed method lies in the use of data-dependent control limits, which are determined based on the observations on-line rather than decided before monitoring. Our theoretical and numerical studies show that the proposed control chart is able to deliver satisfactory in-control run-length performance for any distributions with any dimension. It is also very efficient in detecting multivariate process shifts when the process distribution is heavy-tailed or skewed. Supplementary materials for this article are available online.</p></div
A new multivariate EWMA scheme for monitoring covariance matrices
<div><p></p><p>To monitor covariance matrices, most of the existing control charts are based on some omnibus test and thus usually are not powerful when one is interested in detecting shifts that occur in a small number of elements of the covariance matrix. A new multivariate exponentially weighted moving average control chart is developed for the monitor of the covariance matrices by integrating the classical -norm-based test with a maximum-norm-based test. Numerical studies show that the new control chart affords more balanced performance under various shift directions than the existing ones and is thus an effective tool for multivariate SPC applications. The implementation of the proposed control chart is demonstrated with an example from the health care industry.</p></div
A Change Point Approach for Phase-I Analysis in Multivariate Profile Monitoring and Diagnosis
<div><p>Process monitoring and fault diagnosis using profile data remains an important and challenging problem in statistical process control (SPC). Although the analysis of profile data has been extensively studied in the SPC literature, the challenges associated with monitoring and diagnosis of multichannel (multiple) nonlinear profiles are yet to be addressed. Motivated by an application in multi-operation forging processes, we propose a new modeling, monitoring and diagnosis framework for phase-I analysis of multichannel profiles. The proposed framework is developed under the assumption that different profile channels have similar structure so that we can gain strength by borrowing information from all channels. The multi-dimensional functional principal component analysis is incorporated into change-point models to construct monitoring statistics. Simulation results show that the proposed approach has good performance in identifying change-points in various situations compared with some existing methods. The codes for implementing the proposed procedure are available in the supplementary material.</p></div
Threshold Selection in Feature Screening for Error Rate Control
Hard thresholding rule is commonly adopted in feature screening procedures to screen out unimportant predictors for ultrahigh-dimensional data. However, different thresholds are required to adapt to different contexts of screening problems and an appropriate thresholding magnitude usually varies from the model and error distribution. With an ad-hoc choice, it is unclear whether all of the important predictors are selected or not, and it is very likely that the procedures would include many unimportant features. We introduce a data-adaptive threshold selection procedure with error rate control, which is applicable to most kinds of popular screening methods. The key idea is to apply the sample-splitting strategy to construct a series of statistics with marginal symmetry property and then to utilize the symmetry for obtaining an approximation to the number of false discoveries. We show that the proposed method is able to asymptotically control the false discovery rate and per family error rate under certain conditions and still retains all of the important predictors. Three important examples are presented to illustrate the merits of the new proposed procedures. Numerical experiments indicate that the proposed methodology works well for many existing screening methods. Supplementary materials for this article are available online.</p
Model-Free Statistical Inference on High-Dimensional Data*
This paper aims to develop an effective model-free inference procedure for high-dimensional data. We first reformulate the hypothesis testing problem via sufficient dimension reduction framework. With the aid of new reformulation, we propose a new test statistic and show that its asymptotic distribution is χ2 distribution whose degree of freedom does not depend on the unknown population distribution. We further conduct power analysis under local alternative hypotheses. In addition, we study how to control the false discovery rate of the proposed χ2 tests, which are correlated, to identify important predictors under a model-free framework. To this end, we propose a multiple testing procedure and establish its theoretical guarantees. Monte Carlo simulation studies are conducted to assess the performance of the proposed tests and an empirical analysis of a real-world data set is used to illustrate the proposed methodology.</p
Data-driven selection of the number of change-points via error rate control
In multiple change-point analysis, one of the main difficulties is to determine the number of change-points. Various consistent selection methods, including the use of Schwarz information criterion and cross-validation, have been proposed to balance the model fitting and complexity. However, there is lack of systematic approaches to provide theoretical guarantee of significance in determining the number of changes. In this paper, we introduce a data-adaptive selection procedure via error rate control based on order-preserving sample-splitting, which is applicable to most existing change-point methods. The key idea is to construct a series of statistics with global symmetry property and then utilize the symmetry to derive a data-driven threshold. Under this general framework, we are able to rigorously investigate the false discovery proportion control, and show that the proposed method controls the false discovery rate (FDR) asymptotically under mild conditions while retaining the true change-points. Numerical experiments indicate that our selection procedure works well for many change-detection methods and is able to yield accurate FDR control in finite samples. Keywords: Empirical distribution; False discovery rate; Multiple change-point model; Sample-splitting; Symmetry; Uniform convergence.</p
Optimal Subsampling via Predictive Inference
In the big data era, subsampling or sub-data selection techniques are often adopted to extract a fraction of informative individuals from the massive data. Existing subsampling algorithms focus mainly on obtaining a representative subset to achieve the best estimation accuracy under a given class of models. In this paper, we consider a semi-supervised setting wherein a small or moderate sized “labeled” data is available in addition to a much larger sized “unlabeled” data. The goal is to sample from the unlabeled data with a given budget to obtain informative individuals that are characterized by their unobserved responses. We propose an optimal subsampling procedure that is able to maximize the diversity of the selected subsample and control the false selection rate (FSR) simultaneously, allowing us to explore reliable information as much as possible. The key ingredients of our method are the use of predictive inference for quantifying the uncertainty of response predictions and a reformulation of the objective into a constrained optimization problem. We show that the proposed method is asymptotically optimal in the sense that the diversity of the subsample converges to its oracle counterpart with FSR control. Numerical simulations and a real-data example validate the superior performance of the proposed strategy.</p
Self-starting monitoring scheme for Poisson count data with varying population sizes
<div><p>In this paper we consider the problem of monitoring Poisson rates when the population sizes are time-varying and the nominal value of the process parameter is unavailable. Almost all previous control schemes for the detection of increases in the Poisson rate in Phase II are constructed based on assumed knowledge of the process parameters, e.g., the expectation of the count of a rare event when the process of interest is in control. In practice, however, this parameter is usually unknown and not able to be estimated with a sufficiently large number of reference samples. A self-starting EWMA control scheme based on a parametric bootstrap method is proposed. The success of the proposed method lies in the use of probability control limits, which are determined based on the observations during rather than before monitoring. Simulation studies show that our proposed scheme has good in-control and out-of-control performance under various situations. In particular, our proposed scheme is useful in rare event studies during the start-up stage of a monitoring process.</p></div