35 research outputs found
A note on the evaluation of generative models
Probabilistic generative models can be used for compression, denoising,
inpainting, texture synthesis, semi-supervised learning, unsupervised feature
learning, and other tasks. Given this wide range of applications, it is not
surprising that a lot of heterogeneity exists in the way these models are
formulated, trained, and evaluated. As a consequence, direct comparison between
models is often difficult. This article reviews mostly known but often
underappreciated properties relating to the evaluation and interpretation of
generative models with a focus on image models. In particular, we show that
three of the currently most commonly used criteria---average log-likelihood,
Parzen window estimates, and visual fidelity of samples---are largely
independent of each other when the data is high-dimensional. Good performance
with respect to one criterion therefore need not imply good performance with
respect to the other criteria. Our results show that extrapolation from one
criterion to another is not warranted and generative models need to be
evaluated directly with respect to the application(s) they were intended for.
In addition, we provide examples demonstrating that Parzen window estimates
should generally be avoided
Transfer learning by supervised pre-training for audio-based music classification
Very few large-scale music research datasets are publicly available. There is an increasing need for such datasets, because the shift from physical to digital distribution in the music industry has given the listener access to a large body of music, which needs to be cataloged efficiently and be easily browsable. Additionally, deep learning and feature learning techniques are becoming increasingly popular for music information retrieval applications, and they typically require large amounts of training data to work well. In this paper, we propose to exploit an available large-scale music dataset, the Million Song Dataset (MSD), for classification tasks on other datasets, by reusing models trained on the MSD for feature extraction. This transfer learning approach, which we refer to as supervised pre-training, was previously shown to be very effective for computer vision problems. We show that features learned from MSD audio fragments in a supervised manner, using tag labels and user listening data, consistently outperform features learned in an unsupervised manner in this setting, provided that the learned feature extractor is of limited complexity. We evaluate our approach on the GTZAN, 1517-Artists, Unique and Magnatagatune datasets
Factoring variations in natural images with deep Gaussian mixture models
Generative models can be seen as the swiss army knives of machine learning, as many problems can be written probabilistically in terms of the distribution of the data, including prediction, reconstruction, imputation and simulation. One of the most promising directions for unsupervised learning may lie in Deep Learning methods, given their success in supervised learning. However, one of the cur- rent problems with deep unsupervised learning methods, is that they often are harder to scale. As a result there are some easier, more scalable shallow meth- ods, such as the Gaussian Mixture Model and the Student-t Mixture Model, that remain surprisingly competitive. In this paper we propose a new scalable deep generative model for images, called the Deep Gaussian Mixture Model, that is a straightforward but powerful generalization of GMMs to multiple layers. The parametrization of a Deep GMM allows it to efficiently capture products of vari- ations in natural images. We propose a new EM-based algorithm that scales well to large datasets, and we show that both the Expectation and the Maximization steps can easily be distributed over multiple machines. In our density estimation experiments we show that deeper GMM architectures generalize better than more shallow ones, with results in the same ballpark as the state of the art
Parallel one-versus-rest SVM training on the GPU
Linear SVMs are a popular choice of binary classifier. It is often necessary to train many different classifiers on a multiclass dataset in a one-versus-rest fashion, and this for several values of the regularization constant. We propose to harness GPU parallelism by training as many classifiers as possible at the same time. We optimize the primal L2-loss SVM objective using the conjugate gradient method, with an adapted backtracking line search strategy. We compared our approach to liblinear and achieved speedups of up to 17 times on our available hardware
Maximizing CNN Accelerator Efficiency Through Resource Partitioning
Convolutional neural networks (CNNs) are revolutionizing machine learning,
but they present significant computational challenges. Recently, many
FPGA-based accelerators have been proposed to improve the performance and
efficiency of CNNs. Current approaches construct a single processor that
computes the CNN layers one at a time; the processor is optimized to maximize
the throughput at which the collection of layers is computed. However, this
approach leads to inefficient designs because the same processor structure is
used to compute CNN layers of radically varying dimensions.
We present a new CNN accelerator paradigm and an accompanying automated
design methodology that partitions the available FPGA resources into multiple
processors, each of which is tailored for a different subset of the CNN
convolutional layers. Using the same FPGA resources as a single large
processor, multiple smaller specialized processors increase computational
efficiency and lead to a higher overall throughput. Our design methodology
achieves 3.8x higher throughput than the state-of-the-art approach on
evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more
recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x
Deep content-based music recommendation
Automatic music recommendation has become an increasingly relevant problem in recent years, since a lot of music is now sold and consumed digitally. Most recommender systems rely on collaborative filtering. However, this approach suffers from the cold start problem: it fails when no usage data is available, so it is not effective for recommending new and unpopular songs. In this paper, we propose to use a latent factor model for recommendation, and predict the latent factors from music audio when they cannot be obtained from usage data. We compare a traditional approach using a bag-of-words representation of the audio signals with deep convolutional neural networks, and evaluate the predictions quantitatively and qualitatively on the Million Song Dataset. We show that using predicted latent factors produces sensible recommendations, despite the fact that there is a large semantic gap between the characteristics of a song that affect user preference and the corresponding audio signal. We also show that recent advances in deep learning translate very well to the music recommendation setting, with deep convolutional neural networks significantly outperforming the traditional approach