543 research outputs found
Penalized model-based clustering with cluster-specific diagonal covariance matrices and grouped variables
Clustering analysis is one of the most widely used statistical tools in many
emerging areas such as microarray data analysis. For microarray and other
high-dimensional data, the presence of many noise variables may mask underlying
clustering structures. Hence removing noise variables via variable selection is
necessary. For simultaneous variable selection and parameter estimation,
existing penalized likelihood approaches in model-based clustering analysis all
assume a common diagonal covariance matrix across clusters, which however may
not hold in practice. To analyze high-dimensional data, particularly those with
relatively low sample sizes, this article introduces a novel approach that
shrinks the variances together with means, in a more general situation with
cluster-specific (diagonal) covariance matrices. Furthermore, selection of
grouped variables via inclusion or exclusion of a group of variables altogether
is permitted by a specific form of penalty, which facilitates incorporating
subject-matter knowledge, such as gene functions in clustering microarray
samples for disease subtype discovery. For implementation, EM algorithms are
derived for parameter estimation, in which the M-steps clearly demonstrate the
effects of shrinkage and thresholding. Numerical examples, including an
application to acute leukemia subtype discovery with microarray gene expression
data, are provided to demonstrate the utility and advantage of the proposed
method.Comment: Published in at http://dx.doi.org/10.1214/08-EJS194 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
J Comput Biol
Gene expression measurements allow determining sets of up- or down-regulated, or unchanged genes in a particular experimental condition. Additional biological knowledge can suggest examples of genes from one of these sets. For instance, known target genes of a transcriptional activator are expected, but are not certain to go down after this activator is knocked out. Available differential expression analysis tools do not take such imprecise examples into account. Here we put forward a novel partially supervised mixture modeling methodology for differential expression analysis. Our approach, guided by imprecise examples, clusters expression data into differentially expressed and unchanged genes. The partially supervised methodology is implemented by two methods: a newly introduced belief-based mixture modeling, and soft-label mixture modeling, a method proved efficient in other applications. We investigate on synthetic data the input example settings favorable for each method. In our tests, both belief-based and soft-label methods prove their advantage over semi-supervised mixture modeling in correcting for erroneous examples. We also compare them to alternative differential expression analysis approaches, showing that incorporation of knowledge yields better performance. We present a broad range of knowledge sources and data to which our partially supervised methodology can be applied. First, we determine targets of Ste12 based on yeast knockout data, guided by a Ste12 DNA-binding experiment. Second, we distinguish miR-1 from miR-124 targets in human by clustering expression data under transfection experiments of both microRNAs, using their computationally predicted targets as examples. Finally, we utilize literature knowledge to improve clustering of time-course expression profiles
Variable selection for the multicategory SVM via adaptive sup-norm regularization
The Support Vector Machine (SVM) is a popular classification paradigm in
machine learning and has achieved great success in real applications. However,
the standard SVM can not select variables automatically and therefore its
solution typically utilizes all the input variables without discrimination.
This makes it difficult to identify important predictor variables, which is
often one of the primary goals in data analysis. In this paper, we propose two
novel types of regularization in the context of the multicategory SVM (MSVM)
for simultaneous classification and variable selection. The MSVM generally
requires estimation of multiple discriminating functions and applies the argmax
rule for prediction. For each individual variable, we propose to characterize
its importance by the supnorm of its coefficient vector associated with
different functions, and then minimize the MSVM hinge loss function subject to
a penalty on the sum of supnorms. To further improve the supnorm penalty, we
propose the adaptive regularization, which allows different weights imposed on
different variables according to their relative importance. Both types of
regularization automate variable selection in the process of building
classifiers, and lead to sparse multi-classifiers with enhanced
interpretability and improved accuracy, especially for high dimensional low
sample size data. One big advantage of the supnorm penalty is its easy
implementation via standard linear programming. Several simulated examples and
one real gene data analysis demonstrate the outstanding performance of the
adaptive supnorm penalty in various data settings.Comment: Published in at http://dx.doi.org/10.1214/08-EJS122 the Electronic
Journal of Statistics (http://www.i-journals.org/ejs/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …