40 research outputs found
Optimal Shrinkage Estimation of Fixed Effects in Linear Panel Data Models
Shrinkage methods are frequently used to estimate fixed effects to reduce the
noisiness of the least square estimators. However, widely used shrinkage
estimators guarantee such noise reduction only under strong distributional
assumptions. I develop an estimator for the fixed effects that obtains the best
possible mean squared error within a class of shrinkage estimators. This class
includes conventional shrinkage estimators and the optimality does not require
distributional assumptions. The estimator has an intuitive form and is easy to
implement. Moreover, the fixed effects are allowed to vary with time and to be
serially correlated, and the shrinkage optimally incorporates the underlying
correlation structure in this case. In such a context, I also provide a method
to forecast fixed effects one period ahead
Optimal Shrinkage Estimation of Mean Parameters in Family of Distributions With Quadratic Variance
This paper discusses the simultaneous inference of mean parameters in a family of distributions with quadratic variance function. We first introduce a class of semiparametric/parametric shrinkage estimators and establish their asymptotic optimality properties. Two specific cases, the location-scale family and the natural exponential family with quadratic variance function, are then studied in detail. We conduct a comprehensive simulation study to compare the performance of the proposed methods with existing shrinkage estimators. We also apply the method to real data and obtain encouraging results
A clustering algorithm for multivariate data streams with correlated components
Common clustering algorithms require multiple scans of all the data to
achieve convergence, and this is prohibitive when large databases, with data
arriving in streams, must be processed. Some algorithms to extend the popular
K-means method to the analysis of streaming data are present in literature
since 1998 (Bradley et al. in Scaling clustering algorithms to large databases.
In: KDD. p. 9-15, 1998; O'Callaghan et al. in Streaming-data algorithms for
high-quality clustering. In: Proceedings of IEEE international conference on
data engineering. p. 685, 2001), based on the memorization and recursive update
of a small number of summary statistics, but they either don't take into
account the specific variability of the clusters, or assume that the random
vectors which are processed and grouped have uncorrelated components.
Unfortunately this is not the case in many practical situations. We here
propose a new algorithm to process data streams, with data having correlated
components and coming from clusters with different covariance matrices. Such
covariance matrices are estimated via an optimal double shrinkage method, which
provides positive definite estimates even in presence of a few data points, or
of data having components with small variance. This is needed to invert the
matrices and compute the Mahalanobis distances that we use for the data
assignment to the clusters. We also estimate the total number of clusters from
the data.Comment: title changed, rewritte
Nonparametric estimation of genewise variance for microarray data
Estimation of genewise variance arises from two important applications in
microarray data analysis: selecting significantly differentially expressed
genes and validation tests for normalization of microarray data. We approach
the problem by introducing a two-way nonparametric model, which is an extension
of the famous Neyman--Scott model and is applicable beyond microarray data. The
problem itself poses interesting challenges because the number of nuisance
parameters is proportional to the sample size and it is not obvious how the
variance function can be estimated when measurements are correlated. In such a
high-dimensional nonparametric problem, we proposed two novel nonparametric
estimators for genewise variance function and semiparametric estimators for
measurement correlation, via solving a system of nonlinear equations. Their
asymptotic normality is established. The finite sample property is demonstrated
by simulation studies. The estimators also improve the power of the tests for
detecting statistically differentially expressed genes. The methodology is
illustrated by the data from microarray quality control (MAQC) project.Comment: Published in at http://dx.doi.org/10.1214/10-AOS802 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Mixture Modeling and Outlier Detection in Microarray Data Analysis
Microarray technology has become a dynamic tool in gene expression analysis
because it allows for the simultaneous measurement of thousands of gene expressions.
Uniqueness in experimental units and microarray data platforms, coupled with how
gene expressions are obtained, make the field open for interesting research questions.
In this dissertation, we present our investigations of two independent studies related
to microarray data analysis.
First, we study a recent platform in biology and bioinformatics that compares
the quality of genetic information from exfoliated colonocytes in fecal matter with
genetic material from mucosa cells within the colon. Using the intraclass correlation
coe�cient (ICC) as a measure of reproducibility, we assess the reliability of density
estimation obtained from preliminary analysis of fecal and mucosa data sets. Numerical findings clearly show that the distribution is comprised of two components.
For measurements between 0 and 1, it is natural to assume that the data points are
from a beta-mixture distribution. We explore whether ICC values should be modeled
with a beta mixture or transformed first and fit with a normal mixture. We find that
the use of mixture of normals in the inverse-probit transformed scale is less sensitive toward model mis-specification; otherwise a biased conclusion could be reached. By
using the normal mixture approach to compare the ICC distributions of fecal and
mucosa samples, we observe the quality of reproducible genes in fecal array data to
be comparable with that in mucosa arrays.
For microarray data, within-gene variance estimation is often challenging due
to the high frequency of low replication studies. Several methodologies have been
developed to strengthen variance terms by borrowing information across genes. However, even with such accommodations, variance may be initiated by the presence of
outliers. For our second study, we propose a robust modification of optimal shrinkage variance estimation to improve outlier detection. In order to increase power, we
suggest grouping standardized data so that information shared across genes is similar
in distribution. Simulation studies and analysis of real colon cancer microarray data
reveal that our methodology provides a technique which is insensitive to outliers, free of distributional assumptions, effective for small sample size, and data adaptive
A Nonparametric Mean-Variance Smoothing Method to Assess Arabidopsis Cold Stress Transcriptional Regulator CBF2 Overexpression Microarray Data
Microarray is a powerful tool for genome-wide gene expression analysis. In microarray expression data, often mean and variance have certain relationships. We present a non-parametric mean-variance smoothing method (NPMVS) to analyze differentially expressed genes. In this method, a nonlinear smoothing curve is fitted to estimate the relationship between mean and variance. Inference is then made upon shrinkage estimation of posterior means assuming variances are known. Different methods have been applied to simulated datasets, in which a variety of mean and variance relationships were imposed. The simulation study showed that NPMVS outperformed the other two popular shrinkage estimation methods in some mean-variance relationships; and NPMVS was competitive with the two methods in other relationships. A real biological dataset, in which a cold stress transcription factor gene, CBF2, was overexpressed, has also been analyzed with the three methods. Gene ontology and cis-element analysis showed that NPMVS identified more cold and stress responsive genes than the other two methods did. The good performance of NPMVS is mainly due to its shrinkage estimation for both means and variances. In addition, NPMVS exploits a non-parametric regression between mean and variance, instead of assuming a specific parametric relationship between mean and variance. The source code written in R is available from the authors on request