11,879 research outputs found
Unsupervised feature learning with discriminative encoder
In recent years, deep discriminative models have achieved extraordinary
performance on supervised learning tasks, significantly outperforming their
generative counterparts. However, their success relies on the presence of a
large amount of labeled data. How can one use the same discriminative models
for learning useful features in the absence of labels? We address this question
in this paper, by jointly modeling the distribution of data and latent features
in a manner that explicitly assigns zero probability to unobserved data. Rather
than maximizing the marginal probability of observed data, we maximize the
joint probability of the data and the latent features using a two step EM-like
procedure. To prevent the model from overfitting to our initial selection of
latent features, we use adversarial regularization. Depending on the task, we
allow the latent features to be one-hot or real-valued vectors and define a
suitable prior on the features. For instance, one-hot features correspond to
class labels and are directly used for the unsupervised and semi-supervised
classification task, whereas real-valued feature vectors are fed as input to
simple classifiers for auxiliary supervised discrimination tasks. The proposed
model, which we dub discriminative encoder (or DisCoder), is flexible in the
type of latent features that it can capture. The proposed model achieves
state-of-the-art performance on several challenging tasks.Comment: 10 pages, 4 figures, International Conference on Data Mining, 201
Categorical Dimensions of Human Odor Descriptor Space Revealed by Non-Negative Matrix Factorization
In contrast to most other sensory modalities, the basic perceptual dimensions of olfaction remain unclear. Here, we use non-negative matrix factorization (NMF) – a dimensionality reduction technique – to uncover structure in a panel of odor profiles, with each odor defined as a point in multi-dimensional descriptor space. The properties of NMF are favorable for the analysis of such lexical and perceptual data, and lead to a high-dimensional account of odor space. We further provide evidence that odor dimensions apply categorically. That is, odor space is not occupied homogenously, but rather in a discrete and intrinsically clustered manner. We discuss the potential implications of these results for the neural coding of odors, as well as for developing classifiers on larger datasets that may be useful for predicting perceptual qualities from chemical structures
Scaling Nonparametric Bayesian Inference via Subsample-Annealing
We describe an adaptation of the simulated annealing algorithm to
nonparametric clustering and related probabilistic models. This new algorithm
learns nonparametric latent structure over a growing and constantly churning
subsample of training data, where the portion of data subsampled can be
interpreted as the inverse temperature beta(t) in an annealing schedule. Gibbs
sampling at high temperature (i.e., with a very small subsample) can more
quickly explore sketches of the final latent state by (a) making longer jumps
around latent space (as in block Gibbs) and (b) lowering energy barriers (as in
simulated annealing). We prove subsample annealing speeds up mixing time N^2 ->
N in a simple clustering model and exp(N) -> N in another class of models,
where N is data size. Empirically subsample-annealing outperforms naive Gibbs
sampling in accuracy-per-wallclock time, and can scale to larger datasets and
deeper hierarchical models. We demonstrate improved inference on million-row
subsamples of US Census data and network log data and a 307-row hospital rating
dataset, using a Pitman-Yor generalization of the Cross Categorization model.Comment: To appear in AISTATS 201
Automatic Bayesian Density Analysis
Making sense of a dataset in an automatic and unsupervised fashion is a
challenging problem in statistics and AI. Classical approaches for {exploratory
data analysis} are usually not flexible enough to deal with the uncertainty
inherent to real-world data: they are often restricted to fixed latent
interaction models and homogeneous likelihoods; they are sensitive to missing,
corrupt and anomalous data; moreover, their expressiveness generally comes at
the price of intractable inference. As a result, supervision from statisticians
is usually needed to find the right model for the data. However, since domain
experts are not necessarily also experts in statistics, we propose Automatic
Bayesian Density Analysis (ABDA) to make exploratory data analysis accessible
at large. Specifically, ABDA allows for automatic and efficient missing value
estimation, statistical data type and likelihood discovery, anomaly detection
and dependency structure mining, on top of providing accurate density
estimation. Extensive empirical evidence shows that ABDA is a suitable tool for
automatic exploratory analysis of mixed continuous and discrete tabular data.Comment: In proceedings of the Thirty-Third AAAI Conference on Artificial
Intelligence (AAAI-19
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
- …