41 research outputs found
Sequential Logistic Principal Component Analysis (SLPCA): Dimensional Reduction in Streaming Multivariate Binary-State System
Sequential or online dimensional reduction is of interests due to the
explosion of streaming data based applications and the requirement of adaptive
statistical modeling, in many emerging fields, such as the modeling of energy
end-use profile. Principal Component Analysis (PCA), is the classical way of
dimensional reduction. However, traditional Singular Value Decomposition (SVD)
based PCA fails to model data which largely deviates from Gaussian
distribution. The Bregman Divergence was recently introduced to achieve a
generalized PCA framework. If the random variable under dimensional reduction
follows Bernoulli distribution, which occurs in many emerging fields, the
generalized PCA is called Logistic PCA (LPCA). In this paper, we extend the
batch LPCA to a sequential version (i.e. SLPCA), based on the sequential convex
optimization theory. The convergence property of this algorithm is discussed
compared to the batch version of LPCA (i.e. BLPCA), as well as its performance
in reducing the dimension for multivariate binary-state systems. Its
application in building energy end-use profile modeling is also investigated.Comment: 6 pages, 4 figures, conference submissio
Exponential Family Matrix Completion under Structural Constraints
We consider the matrix completion problem of recovering a structured matrix
from noisy and partial measurements. Recent works have proposed tractable
estimators with strong statistical guarantees for the case where the underlying
matrix is low--rank, and the measurements consist of a subset, either of the
exact individual entries, or of the entries perturbed by additive Gaussian
noise, which is thus implicitly suited for thin--tailed continuous data.
Arguably, common applications of matrix completion require estimators for (a)
heterogeneous data--types, such as skewed--continuous, count, binary, etc., (b)
for heterogeneous noise models (beyond Gaussian), which capture varied
uncertainty in the measurements, and (c) heterogeneous structural constraints
beyond low--rank, such as block--sparsity, or a superposition structure of
low--rank plus elementwise sparseness, among others. In this paper, we provide
a vastly unified framework for generalized matrix completion by considering a
matrix completion setting wherein the matrix entries are sampled from any
member of the rich family of exponential family distributions; and impose
general structural constraints on the underlying matrix, as captured by a
general regularizer . We propose a simple convex regularized
--estimator for the generalized framework, and provide a unified and novel
statistical analysis for this general class of estimators. We finally
corroborate our theoretical results on simulated datasets.Comment: 20 pages, 9 figure
Discovering Patient Phenotypes Using Generalized Low Rank Models
The practice of medicine is predicated on discovering commonalities or distinguishing characteristics among patients
to inform corresponding treatment. Given a patient grouping (hereafter referred to as a p henotype ), clinicians can
implement a treatment pathway accounting for the underlying cause of disease in that phenotype. Traditionally,
phenotypes have been discovered by intuition, experience in practice, and advancements in basic science, but these
approaches are often heuristic, labor intensive, and can take decades to produce actionable knowledge. Although our
understanding of disease has progressed substantially in the past century, there are still important domains in which
our phenotypes are murky, such as in behavioral health or in hospital settings. To accelerate phenotype discovery,
researchers have used machine learning to find patterns in electronic health records, but have often been thwarted by
missing data, sparsity, and data heterogeneity. In this study, we use a flexible framework called Generalized Low
Rank Modeling (GLRM) to overcome these barriers and discover phenotypes in two sources of patient data. First, we
analyze data from the 2010 Healthcare Cost and Utilization Project National Inpatient Sample (NIS), which contains
upwards of 8 million hospitalization records consisting of administrative codes and demographic information. Second,
we analyze a small (N=1746), local dataset documenting the clinical progression of autism spectrum disorder patients using granular features from the electronic health record, including text from physician notes. We demonstrate that
low rank modeling successfully captures known and putative phenotypes in these vastly different datasets