240 research outputs found
H-relative error estimation approach for multiplicative regression model with random effect
Relative error approaches are more of concern compared to absolute error ones
such as the least square and least absolute deviation, when it needs scale
invariant of output variable, for example with analyzing stock and survival
data. An h-relative error estimation method via the h-likelihood is developed
to avoid heavy and intractable integration for a multiplicative regression
model with random effect. Statistical properties of the parameters and random
effect in the model are studied. To estimate the parameters, we propose an
h-relative error computation procedure. Numerical studies including simulation
and real examples show the proposed method performs well.Comment: 16 page
Evaluating the diagnostic powers of variables and their linear combinations when the gold standard is continuous
The receiver operating characteristic (ROC) curve is a very useful tool for
analyzing the diagnostic/classification power of instruments/classification
schemes as long as a binary-scale gold standard is available. When the gold
standard is continuous and there is no confirmative threshold, ROC curve
becomes less useful. Hence, there are several extensions proposed for
evaluating the diagnostic potential of variables of interest. However, due to
the computational difficulties of these nonparametric based extensions, they
are not easy to be used for finding the optimal combination of variables to
improve the individual diagnostic power. Therefore, we propose a new measure,
which extends the AUC index for identifying variables with good potential to be
used in a diagnostic scheme. In addition, we propose a threshold gradient
descent based algorithm for finding the best linear combination of variables
that maximizes this new measure, which is applicable even when the number of
variables is huge. The estimate of the proposed index and its asymptotic
property are studied. The performance of the proposed method is illustrated
using both synthesized and real data sets
Extended T-process Regression Models
Gaussian process regression (GPR) model has been widely used to fit data when
the regression function is unknown and its nice properties have been well
established. In this article, we introduce an extended t-process regression
(eTPR) model, which gives a robust best linear unbiased predictor (BLUP). Owing
to its succinct construction, it inherits many attractive properties from the
GPR model, such as having closed forms of marginal and predictive distributions
to give an explicit form for robust BLUP procedures, and easy to cope with
large dimensional covariates with an efficient implementation by slightly
modifying existing BLUP procedures. Properties of the robust BLUP are studied.
Simulation studies and real data applications show that the eTPR model gives a
robust fit in the presence of outliers in both input and output spaces and has
a good performance in prediction, compared with the GPR and locally weighted
scatterplot smoothing (LOESS) methods
A robust estimation for the extended t-process regression model
Robust estimation and variable selection procedure are developed for the
extended t-process regression model with functional data. Statistical
properties such as consistency of estimators and predictions are obtained.
Numerical studies show that the proposed method performs well.Comment: 20 page
Distributed sequential method for analyzing massive data
To analyse a very large data set containing lengthy variables, we adopt a
sequential estimation idea and propose a parallel divide-and-conquer method. We
conduct several conventional sequential estimation procedures separately, and
properly integrate their results while maintaining the desired statistical
properties. Additionally, using a criterion from the statistical experiment
design, we adopt an adaptive sample selection, together with an adaptive
shrinkage estimation method, to simultaneously accelerate the estimation
procedure and identify the effective variables. We confirm the cogency of our
methods through theoretical justifications and numerical results derived from
synthesized data sets. We then apply the proposed method to three real data
sets, including those pertaining to appliance energy use and particulate matter
concentration
Modeling Function-Valued Processes with Nonseparable and/or Nonstationary Covariance Structure
We discuss a general Bayesian framework on modeling multidimensional
function-valued processes by using a Gaussian process or a heavy-tailed process
as a prior, enabling us to handle nonseparable and/or nonstationary covariance
structure. The nonstationarity is introduced by a convolution-based approach
through a varying anisotropy matrix, whose parameters vary along the input
space and are estimated via a local empirical Bayesian method. For the varying
matrix, we propose to use a spherical parametrization, leading to unconstrained
and interpretable parameters. The unconstrained nature allows the parameters to
be modeled as a nonparametric function of time, spatial location or other
covariates. The interpretation of the parameters is based on closed-form
expressions, providing valuable insights into nonseparable covariance
structures. Furthermore, to extract important information in data with complex
covariance structure, the Bayesian framework can decompose the function-valued
processes using the eigenvalues and eigensurfaces calculated from the estimated
covariance structure. The results are demonstrated by simulation studies and by
an application to wind intensity data. Supplementary materials for this article
are available online.Comment: Added subsection 2.2.1: Local Interpretation of the Varying
Anisotropy Matrix; Replaced simulation studies; Replaced application by two
new ones; Corrected typo
Least Product Relative Error Estimation
A least product relative error criterion is proposed for multiplicative
regression models. It is invariant under scale transformation of the outcome
and covariates. In addition, the objective function is smooth and convex,
resulting in a simple and uniquely defined estimator of the regression
parameter. It is shown that the estimator is asymptotically normal and that the
simple plugging-in variance estimation is valid. Simulation results confirm
that the proposed method performs well. An application to body fat calculation
is presented to illustrate the new method
Sequential estimation for GEE with adaptive variables and subject selection
Modeling correlated or highly stratified multiple-response data becomes a
common data analysis task due to modern data monitoring facilities and methods.
Generalized estimating equations (GEE) is one of the popular statistical
methods for analyzing this kind of data. In this paper, we present a sequential
estimation procedure for obtaining GEE-based estimates. In addition to the
conventional random sampling, the proposed method features adaptive subject
recruiting and variable selection. Moreover, we equip our method with an
adaptive shrinkage property so that it can decide the effective variables
during the estimation procedure and build a confidence set with a pre-specified
precision for the corresponding parameters. In addition to the statistical
properties of the proposed procedure, we assess our method using both simulated
data and real data sets.Comment:
Nearly Semiparametric Efficient Estimation of Quantile Regression
As a competitive alternative to least squares regression, quantile regression
is popular in analyzing heterogenous data. For quantile regression model
specified for one single quantile level , major difficulties of
semiparametric efficient estimation are the unavailability of a parametric
efficient score and the conditional density estimation. In this paper, with the
help of the least favorable submodel technique, we first derive the
semiparametric efficient scores for linear quantile regression models that are
assumed for a single quantile level, multiple quantile levels and all the
quantile levels in respectively. Our main discovery is a one-step
(nearly) semiparametric efficient estimation for the regression coefficients of
the quantile regression models assumed for multiple quantile levels, which has
several advantages: it could be regarded as an optimal way to pool information
across multiple/other quantiles for efficiency gain; it is computationally
feasible and easy to implement, as the initial estimator is easily available;
due to the nature of quantile regression models under investigation, the
conditional density estimation is straightforward by plugging in an initial
estimator. The resulting estimator is proved to achieve the corresponding
semiparametric efficiency lower bound under regularity conditions. Numerical
studies including simulations and an example of birth weight of children
confirms that the proposed estimator leads to higher efficiency compared with
the Koenker-Bassett quantile regression estimator for all quantiles of
interest.Comment: 33 page
Active learning for binary classification with variable selection
Modern computing and communication technologies can make data collection
procedures very efficient. However, our ability to analyze large data sets
and/or to extract information out from them is hard-pressed to keep up with our
capacities for data collection. Among these huge data sets, some of them are
not collected for any particular research purpose. For a classification
problem, this means that the essential label information may not be readily
obtainable, in the data set in hands, and an extra labeling procedure is
required such that we can have enough label information to be used for
constructing a classification model. When the size of a data set is huge, to
label each subject in it will cost a lot in both capital and time. Thus, it is
an important issue to decide which subjects should be labeled first in order to
efficiently reduce the training cost/time. Active learning method is a
promising outlet for this situation, because with the active learning ideas, we
can select the unlabeled subjects sequentially without knowing their label
information. In addition, there will be no confirmed information about the
essential variables for constructing an efficient classification rule. Thus,
how to merge a variable selection scheme with an active learning procedure is
of interest. In this paper, we propose a procedure for building binary
classification models when the complete label information is not available in
the beginning of the training stage. We study an model-based active learning
procedure with sequential variable selection schemes, and discuss the results
of the proposed procedure from both theoretical and numerical aspects.Comment: 16 pages, 1 figur
- …