146 research outputs found

    A moment-matching metric for latent variable generative models

    Full text link
    It can be difficult to assess the quality of a fitted model when facing unsupervised learning problems. Latent variable models, such as variation autoencoders and Gaussian mixture models, are often trained with likelihood-based approaches. In scope of Goodhart's law, when a metric becomes a target it ceases to be a good metric and therefore we should not use likelihood to assess the quality of the fit of these models. The solution we propose is a new metric for model comparison or regularization that relies on moments. The concept is to study the difference between the data moments and the model moments using a matrix norm, such as the Frobenius norm. We show how to use this new metric for model comparison and then for regularization. It is common to draw samples from the fitted distribution when evaluating latent variable models and we show that our proposed metric is faster to compute and has a smaller variance that this alternative. We conclude this article with a proof of concept of both applications and we discuss future work

    Robust subspace learning for static and dynamic affect and behaviour modelling

    Get PDF
    Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces

    Preconditioned Spectral Descent for Deep Learning

    Get PDF
    Deep learning presents notorious computational challenges. These challenges in- clude, but are not limited to, the non-convexity of learning objectives and estimat- ing the quantities needed for optimization algorithms, such as gradients. While we do not address the non-convexity, we present an optimization solution that exploits the so far unused “geometry” in the objective function in order to best make use of the estimated gradients. Previous work attempted similar goals with precon- ditioned methods in the Euclidean space, such as L-BFGS, RMSprop, and ADA- grad. In stark contrast, our approach combines a non-Euclidean gradient method with preconditioning. We provide evidence that this combination more accurately captures the geometry of the objective function compared to prior work. We theo- retically formalize our arguments and derive novel preconditioned non-Euclidean algorithms. The results are promising in both computational time and quality when applied to Restricted Boltzmann Machines, Feedforward Neural Nets, and Convolutional Neural Nets

    Constructive Approximation and Learning by Greedy Algorithms

    Get PDF
    This thesis develops several kernel-based greedy algorithms for different machine learning problems and analyzes their theoretical and empirical properties. Greedy approaches have been extensively used in the past for tackling problems in combinatorial optimization where finding even a feasible solution can be a computationally hard problem (i.e., not solvable in polynomial time). A key feature of greedy algorithms is that a solution is constructed recursively from the smallest constituent parts. In each step of the constructive process a component is added to the partial solution from the previous step and, thus, the size of the optimization problem is reduced. The selected components are given by optimization problems that are simpler and easier to solve than the original problem. As such schemes are typically fast at constructing a solution they can be very effective on complex optimization problems where finding an optimal/good solution has a high computational cost. Moreover, greedy solutions are rather intuitive and the schemes themselves are simple to design and easy to implement. There is a large class of problems for which greedy schemes generate an optimal solution or a good approximation of the optimum. In the first part of the thesis, we develop two deterministic greedy algorithms for optimization problems in which a solution is given by a set of functions mapping an instance space to the space of reals. The first of the two approaches facilitates data understanding through interactive visualization by providing means for experts to incorporate their domain knowledge into otherwise static kernel principal component analysis. This is achieved by greedily constructing embedding directions that maximize the variance at data points (unexplained by the previously constructed embedding directions) while adhering to specified domain knowledge constraints. The second deterministic greedy approach is a supervised feature construction method capable of addressing the problem of kernel choice. The goal of the approach is to construct a feature representation for which a set of linear hypotheses is of sufficient capacity — large enough to contain a satisfactory solution to the considered problem and small enough to allow good generalization from a small number of training examples. The approach mimics functional gradient descent and constructs features by fitting squared error residuals. We show that the constructive process is consistent and provide conditions under which it converges to the optimal solution. In the second part of the thesis, we investigate two problems for which deterministic greedy schemes can fail to find an optimal solution or a good approximation of the optimum. This happens as a result of making a sequence of choices which take into account only the immediate reward without considering the consequences onto future decisions. To address this shortcoming of deterministic greedy schemes, we propose two efficient randomized greedy algorithms which are guaranteed to find effective solutions to the corresponding problems. In the first of the two approaches, we provide a mean to scale kernel methods to problems with millions of instances. An approach, frequently used in practice, for this type of problems is the Nyström method for low-rank approximation of kernel matrices. A crucial step in this method is the choice of landmarks which determine the quality of the approximation. We tackle this problem with a randomized greedy algorithm based on the K-means++ cluster seeding scheme and provide a theoretical and empirical study of its effectiveness. In the second problem for which a deterministic strategy can fail to find a good solution, the goal is to find a set of objects from a structured space that are likely to exhibit an unknown target property. This discrete optimization problem is of significant interest to cyclic discovery processes such as de novo drug design. We propose to address it with an adaptive Metropolis–Hastings approach that samples candidates from the posterior distribution of structures conditioned on them having the target property. The proposed constructive scheme defines a consistent random process and our empirical evaluation demonstrates its effectiveness across several different application domains

    Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives

    Full text link
    Part 2 of this monograph builds on the introduction to tensor networks and their operations presented in Part 1. It focuses on tensor network models for super-compressed higher-order representation of data/parameters and related cost functions, while providing an outline of their applications in machine learning and data analytics. A particular emphasis is on the tensor train (TT) and Hierarchical Tucker (HT) decompositions, and their physically meaningful interpretations which reflect the scalability of the tensor network approach. Through a graphical approach, we also elucidate how, by virtue of the underlying low-rank tensor approximations and sophisticated contractions of core tensors, tensor networks have the ability to perform distributed computations on otherwise prohibitively large volumes of data/parameters, thereby alleviating or even eliminating the curse of dimensionality. The usefulness of this concept is illustrated over a number of applied areas, including generalized regression and classification (support tensor machines, canonical correlation analysis, higher order partial least squares), generalized eigenvalue decomposition, Riemannian optimization, and in the optimization of deep neural networks. Part 1 and Part 2 of this work can be used either as stand-alone separate texts, or indeed as a conjoint comprehensive review of the exciting field of low-rank tensor networks and tensor decompositions.Comment: 232 page

    Tensor Networks for Dimensionality Reduction and Large-Scale Optimizations. Part 2 Applications and Future Perspectives

    Full text link
    Part 2 of this monograph builds on the introduction to tensor networks and their operations presented in Part 1. It focuses on tensor network models for super-compressed higher-order representation of data/parameters and related cost functions, while providing an outline of their applications in machine learning and data analytics. A particular emphasis is on the tensor train (TT) and Hierarchical Tucker (HT) decompositions, and their physically meaningful interpretations which reflect the scalability of the tensor network approach. Through a graphical approach, we also elucidate how, by virtue of the underlying low-rank tensor approximations and sophisticated contractions of core tensors, tensor networks have the ability to perform distributed computations on otherwise prohibitively large volumes of data/parameters, thereby alleviating or even eliminating the curse of dimensionality. The usefulness of this concept is illustrated over a number of applied areas, including generalized regression and classification (support tensor machines, canonical correlation analysis, higher order partial least squares), generalized eigenvalue decomposition, Riemannian optimization, and in the optimization of deep neural networks. Part 1 and Part 2 of this work can be used either as stand-alone separate texts, or indeed as a conjoint comprehensive review of the exciting field of low-rank tensor networks and tensor decompositions.Comment: 232 page

    On the Statistical Approximation of Conditional Expectation Operators

    Get PDF
    Diese Dissertation erörtert die datengetriebene Approximation des sogenannten conditional expectation operators, welcher den Erwartungswert einer reellwertigen Transformation einer Zufallsvariablen bedingt auf eine zweite Zufallsvariable beschreibt. Sie stellt dieses klassische numerische Problem in einem neuen theoretischen Zusammenhang dar und beleuchtet es mit verschiedenen ausgewĂ€hlten Methoden der modernen statistischen Lerntheorie. Es werden sowohl ein bekannter parametrischer Projektionsansatz aus dem numerischen Bereich als auch ein nichtparametrisches Modell auf Basis eines reproducing kernel Hilbert space untersucht. Die Untersuchungen dieser Arbeit werden motiviert duch den speziellen Fall, in dem der conditional expectation operator die Übergangswahrscheinlichkeiten eines Markovprozesses beschreibt. In diesem Kontext sind die Spektraleigenschaften des resultierenden Markov transition operators von großem praktischen Interesse zur datenbasierten Untersuchung von komplexer Dynamik. Die oben genannten vorgestellten SchĂ€tzer werden in diesem Szenario in der Praxis verwendet. Es werden diverse neue Konvergenz- und Approximationsresultate sowohl fĂŒr stochastisch unabhĂ€ngige als auch abhĂ€ngige Daten gezeigt. Als Werkzeuge fĂŒr diese Ergebnisse dienen Konzepte aus den Theorien inverser Probleme, schwach abhĂ€ngiger stochastischer Prozesse, der St ̈orung von Spektraleigenschaften und der Konzentration von Wahrscheinlichkeitsmaßen. Zur theoretischen Rechtfertigung des nichtparametrischen Modells wird das SchĂ€tzen von kernel autocovariance operators von stationĂ€ren Zeitreihen untersucht. Diese Betrachtung kann zusĂ€tzlich vielfĂ€ltig in anderen ZusammenhĂ€ngen genutzt werden, was anhand von neuen Ergebnissen zur Konsistenz von kernelbasierter Hauptkomponentenanalyse mit schwach abhĂ€ngigen Daten demonstriert wird. Diese Dissertation ist theoretischer Natur und dient nicht zur unmittelbaren Umsetzung von neuen numerischen Methoden. Sie stellt jedoch den direkten Zusammenhang von bekannten AnsĂ€tzen in diesem Feld zu relevanten statistischen Arbeiten der letzten Jahre her, welche sowohl stĂ€rkere theoretische Ergebnisse als auch effizientere praktische SchĂ€tzer fĂŒr dieses Problem in der Zukunft möglich machen.This dissertation discusses the data-driven approximation of the so-called conditional expectation operator, which describes the expected value of a real-valued transformation of a random variable conditioned on a second random variable. It presents this classical numerical problem in a new theoretical context and examines it using various selected methods from modern statistical learning theory. Both a well-known parametric projection approach from the numerical domain and a nonparametric model based on a reproducing kernel Hilbert space are investigated. The investigations of this work are motivated by the special case in which the conditional expectation operator describes the transition probabilities of a Markov process. In this context, the spectral properties of the resulting Markov transition operator are of great practical interest for the data-based study of complex dynamics. The presented estimators are used in practice in this scenario. Various new convergence and approximation results are shown for both stochastically independent and dependent data. Concepts from the theories of inverse problems, weakly dependent stochastic processes, spectral perturbation, and concentration of measure serve as tools for these results. For the theoretical justification of the nonparametric model, the estimation of kernel autocovariance operators of stationary time series is investigated. This consideration can additionally be used in a variety of ways in other contexts, which is demonstrated in terms of new results on the consistency of kernel-based principal component analysis with weakly dependent data. This dissertation is theoretical in nature and does not serve to directly implement new numerical methods. It does, however, provide a direct link from known approaches in this field to relevant statistical work from recent years, which will make both stronger theoretical results and more efficient practical estimators for this problem possible in the future

    Simultaneous subspace clustering and cluster number estimating based on triplet relationship

    Get PDF
    In this paper we propose a unified framework to discover the number of clusters and group the data points into different clusters using subspace clustering simultaneously. Real data distributed in a high-dimensional space can be disentangled into a union of low-dimensional subspaces, which can benefit various applications. To explore such intrinsic structure, stateof- the-art subspace clustering approaches often optimize a selfrepresentation problem among all samples, to construct a pairwise affinity graph for spectral clustering. However, a graph with pairwise similarities lacks robustness for segmentation, especially for samples which lie on the intersection of two subspaces. To address this problem, we design a hyper-correlation based data structure termed as the triplet relationship, which reveals high relevance and local compactness among three samples. The triplet relationship can be derived from the self-representation matrix, and be utilized to iteratively assign the data points to clusters. Based on the triplet relationship, we propose a unified optimizing scheme to automatically calculate clustering assignments. Specifically, we optimize a model selection reward and a fusion reward by simultaneously maximizing the similarity of triplets from different clusters while minimizing the correlation of triplets from same cluster. The proposed algorithm also automatically reveals the number of clusters and fuses groups to avoid over-segmentation. Extensive experimental results on both synthetic and real-world datasets validate the effectiveness and robustness of the proposed method
    • 

    corecore