6,127 research outputs found
On clustering procedures and nonparametric mixture estimation
This paper deals with nonparametric estimation of conditional den-sities in
mixture models in the case when additional covariates are available. The
proposed approach consists of performing a prelim-inary clustering algorithm on
the additional covariates to guess the mixture component of each observation.
Conditional densities of the mixture model are then estimated using kernel
density estimates ap-plied separately to each cluster. We investigate the
expected L 1 -error of the resulting estimates and derive optimal rates of
convergence over classical nonparametric density classes provided the
clustering method is accurate. Performances of clustering algorithms are
measured by the maximal misclassification error. We obtain upper bounds of this
quantity for a single linkage hierarchical clustering algorithm. Lastly,
applications of the proposed method to mixture models involving elec-tricity
distribution data and simulated data are presented
Switching Regression Models and Causal Inference in the Presence of Discrete Latent Variables
Given a response and a vector of predictors,
we investigate the problem of inferring direct causes of among the vector
. Models for that use all of its causal covariates as predictors enjoy
the property of being invariant across different environments or interventional
settings. Given data from such environments, this property has been exploited
for causal discovery. Here, we extend this inference principle to situations in
which some (discrete-valued) direct causes of are unobserved. Such cases
naturally give rise to switching regression models. We provide sufficient
conditions for the existence, consistency and asymptotic normality of the MLE
in linear switching regression models with Gaussian noise, and construct a test
for the equality of such models. These results allow us to prove that the
proposed causal discovery method obtains asymptotic false discovery control
under mild conditions. We provide an algorithm, make available code, and test
our method on simulated data. It is robust against model violations and
outperforms state-of-the-art approaches. We further apply our method to a real
data set, where we show that it does not only output causal predictors, but
also a process-based clustering of data points, which could be of additional
interest to practitioners.Comment: 46 pages, 14 figures; real-world application added in Section 5.2;
additional numerical experiments added in the Appendix
Bayesian Cluster Enumeration Criterion for Unsupervised Learning
We derive a new Bayesian Information Criterion (BIC) by formulating the
problem of estimating the number of clusters in an observed data set as
maximization of the posterior probability of the candidate models. Given that
some mild assumptions are satisfied, we provide a general BIC expression for a
broad class of data distributions. This serves as a starting point when
deriving the BIC for specific distributions. Along this line, we provide a
closed-form BIC expression for multivariate Gaussian distributed variables. We
show that incorporating the data structure of the clustering problem into the
derivation of the BIC results in an expression whose penalty term is different
from that of the original BIC. We propose a two-step cluster enumeration
algorithm. First, a model-based unsupervised learning algorithm partitions the
data according to a given set of candidate models. Subsequently, the number of
clusters is determined as the one associated with the model for which the
proposed BIC is maximal. The performance of the proposed two-step algorithm is
tested using synthetic and real data sets.Comment: 14 pages, 7 figure
- …