183 research outputs found

    Deep Structured Layers for Instance-Level Optimization in 2D and 3D Vision

    Get PDF
    The approach we present in this thesis is that of integrating optimization problems as layers in deep neural networks. Optimization-based modeling provides an additional set of tools enabling the design of powerful neural networks for a wide battery of computer vision tasks. This thesis shows formulations and experiments for vision tasks ranging from image reconstruction to 3D reconstruction. We first propose an unrolled optimization method with implicit regularization properties for reconstructing images from noisy camera readings. The method resembles an unrolled majorization minimization framework with convolutional neural networks acting as regularizers. We report state-of-the-art performance in image reconstruction on both noisy and noise-free evaluation setups across many datasets. We further focus on the task of monocular 3D reconstruction of articulated objects using video self-supervision. The proposed method uses a structured layer for accurate object deformation that controls a 3D surface by displacing a small number of learnable handles. While relying on a small set of training data per category for self-supervision, the method obtains state-of-the-art reconstruction accuracy with diverse shapes and viewpoints for multiple articulated objects. We finally address the shortcomings of the previous method that revolve around regressing the camera pose using multiple hypotheses. We propose a method that recovers a 3D shape from a 2D image by relying solely on 3D-2D correspondences regressed from a convolutional neural network. These correspondences are used in conjunction with an optimization problem to estimate per sample the camera pose and deformation. We quantitatively show the effectiveness of the proposed method on self-supervised 3D reconstruction on multiple categories without the need for multiple hypotheses

    International Conference on Continuous Optimization (ICCOPT) 2019 Conference Book

    Get PDF
    The Sixth International Conference on Continuous Optimization took place on the campus of the Technical University of Berlin, August 3-8, 2019. The ICCOPT is a flagship conference of the Mathematical Optimization Society (MOS), organized every three years. ICCOPT 2019 was hosted by the Weierstrass Institute for Applied Analysis and Stochastics (WIAS) Berlin. It included a Summer School and a Conference with a series of plenary and semi-plenary talks, organized and contributed sessions, and poster sessions. This book comprises the full conference program. It contains, in particular, the scientific program in survey style as well as with all details, and information on the social program, the venue, special meetings, and more

    Weakly-Labeled Data and Identity-Normalization for Facial Image Analysis

    Get PDF
    RÉSUMÉ Cette thèse traite de l’amélioration de la reconnaissance faciale et de l’analyse de l’expression du visage en utilisant des sources d’informations faibles. Les données étiquetées sont souvent rares, mais les données non étiquetées contiennent souvent des informations utiles pour l’apprentissage d’un modèle. Cette thèse décrit deux exemples d’utilisation de cette idée. Le premier est une nouvelle méthode pour la reconnaissance faciale basée sur l’exploitation de données étiquetées faiblement ou bruyamment. Les données non étiquetées peuvent être acquises d’une manière qui offre des caractéristiques supplémentaires. Ces caractéristiques, tout en n’étant pas disponibles pour les données étiquetées, peuvent encore être utiles avec un peu de prévoyance. Cette thèse traite de la combinaison d’un ensemble de données étiquetées pour la reconnaissance faciale avec des images des visages extraits de vidéos sur YouTube et des images des visages obtenues à partir d’un moteur de recherche. Le moteur de recherche web et le moteur de recherche vidéo peuvent être considérés comme de classificateurs très faibles alternatifs qui fournissent des étiquettes faibles. En utilisant les résultats de ces deux types de requêtes de recherche comme des formes d’étiquettes faibles différents, une méthode robuste pour la classification peut être développée. Cette méthode est basée sur des modèles graphiques, mais aussi incorporant une marge probabiliste. Plus précisément, en utilisant un modèle inspiré par la variational relevance vector machine (RVM), une alternative probabiliste à la support vector machine (SVM) est développée. Contrairement aux formulations précédentes de la RVM, le choix d’une probabilité a priori exponentielle est introduit pour produire une approximation de la pénalité L1. Les résultats expérimentaux où les étiquettes bruyantes sont simulées, et les deux expériences distinctes où les étiquettes bruyantes de l’image et les résultats de recherche vidéo en utilisant des noms comme les requêtes indiquent que l’information faible dans les étiquettes peut être exploitée avec succès. Puisque le modèle dépend fortement des méthodes noyau de régression clairsemées, ces méthodes sont examinées et discutées en détail. Plusieurs algorithmes différents utilisant les distributions a priori pour encourager les modèles clairsemés sont décrits en détail. Des expériences sont montrées qui illustrent le comportement de chacune de ces distributions. Utilisés en conjonction avec la régression logistique, les effets de chaque distribution sur l’ajustement du modèle et la complexité du modèle sont montrés. Les extensions aux autres méthodes d’apprentissage machine sont directes, car l’approche est ancrée dans la probabilité bayésienne. Une expérience dans la prédiction structurée utilisant un conditional random field pour une tâche d’imagerie médicale est montrée pour illustrer comment ces distributions a priori peuvent être incorporées facilement à d’autres tâches et peuvent donner de meilleurs résultats. Les données étiquetées peuvent également contenir des sources faibles d’informations qui ne peuvent pas nécessairement être utilisées pour un effet maximum. Par exemple les ensembles de données d’images des visages pour les tâches tels que, l’animation faciale contrôlée par les performances des comédiens, la reconnaissance des émotions, et la prédiction des points clés ou les repères du visage contiennent souvent des étiquettes alternatives par rapport à la tâche d’internet principale. Dans les données de reconnaissance des émotions, par exemple, des étiquettes de l’émotion sont souvent rares. C’est peut-être parce que ces images sont extraites d’une vidéo, dans laquelle seul un petit segment représente l’étiquette de l’émotion. En conséquence, de nombreuses images de l’objet sont dans le même contexte en utilisant le même appareil photo ne sont pas utilisés. Toutefois, ces données peuvent être utilisées pour améliorer la capacité des techniques d’apprentissage de généraliser pour des personnes nouvelles et pas encore vues en modélisant explicitement les variations vues précédemment liées à l’identité et à l’expression. Une fois l’identité et de la variation de l’expression sont séparées, les approches supervisées simples peuvent mieux généraliser aux identités de nouveau. Plus précisément, dans cette thèse, la modélisation probabiliste de ces sources de variation est utilisée pour identité normaliser et des diverses représentations d’images faciales. Une variété d’expériences sont décrites dans laquelle la performance est constamment améliorée, incluant la reconnaissance des émotions, les animations faciales contrôlées par des visages des comédiens sans marqueurs et le suivi des points clés sur des visages. Dans de nombreux cas dans des images faciales, des sources d’information supplémentaire peuvent être disponibles qui peuvent être utilisées pour améliorer les tâches d’intérêt. Cela comprend des étiquettes faibles qui sont prévues pendant la collecte des données, telles que la requête de recherche utilisée pour acquérir des données, ainsi que des informations d’identité dans le cas de plusieurs bases de données d’images expérimentales. Cette thèse soutient en principal que cette information doit être utilisée et décrit les méthodes pour le faire en utilisant les outils de la probabilité.----------ABSTRACT This thesis deals with improving facial recognition and facial expression analysis using weak sources of information. Labeled data is often scarce, but unlabeled data often contains information which is helpful to learning a model. This thesis describes two examples of using this insight. The first is a novel method for face-recognition based on leveraging weak or noisily labeled data. Unlabeled data can be acquired in a way which provides additional features. These features, while not being available for the labeled data, may still be useful with some foresight. This thesis discusses combining a labeled facial recognition dataset with face images extracted from videos on YouTube and face images returned from using a search engine. The web search engine and the video search engine can be viewed as very weak alternative classifier which provide “weak labels.” Using the results from these two different types of search queries as forms of weak labels, a robust method for classification can be developed. This method is based on graphical models, but also encorporates a probabilistic margin. More specifically, using a model inspired by the variational relevance vector machine (RVM), a probabilistic alternative to transductive support vector machines (TSVM) is further developed. In contrast to previous formulations of RVMs, the choice of an Exponential hyperprior is introduced to produce an approximation to the L1 penalty. Experimental results where noisy labels are simulated and separate experiments where noisy labels from image and video search results using names as queries both indicate that weak label information can be successfully leveraged. Since the model depends heavily on sparse kernel regression methods, these methods are reviewed and discussed in detail. Several different sparse priors algorithms are described in detail. Experiments are shown which illustrate the behavior of each of these sparse priors. Used in conjunction with logistic regression, each sparsity inducing prior is shown to have varying effects in terms of sparsity and model fit. Extending this to other machine learning methods is straight forward since it is grounded firmly in Bayesian probability. An experiment in structured prediction using Conditional Random Fields on a medical image task is shown to illustrate how sparse priors can easily be incorporated in other tasks, and can yield improved results. Labeled data may also contain weak sources of information that may not necessarily be used to maximum effect. For example, facial image datasets for the tasks of performance driven facial animation, emotion recognition, and facial key-point or landmark prediction often contain alternative labels from the task at hand. In emotion recognition data, for example, emotion labels are often scarce. This may be because these images are extracted from a video, in which only a small segment depicts the emotion label. As a result, many images of the subject in the same setting using the same camera are unused. However, this data can be used to improve the ability of learning techniques to generalize to new and unseen individuals by explicitly modeling previously seen variations related to identity and expression. Once identity and expression variation are separated, simpler supervised approaches can work quite well to generalize to unseen subjects. More specifically, in this thesis, probabilistic modeling of these sources of variation is used to “identity-normalize” various facial image representations. A variety of experiments are described in which performance on emotion recognition, markerless performance-driven facial animation and facial key-point tracking is consistently improved. This includes an algorithm which shows how this kind of normalization can be used for facial key-point localization. In many cases in facial images, sources of information may be available that can be used to improve tasks. This includes weak labels which are provided during data gathering, such as the search query used to acquire data, as well as identity information in the case of many experimental image databases. This thesis argues in main that this information should be used and describes methods for doing so using the tools of probability

    Sparse and Deep Representations for Face Recognition and Object Detection

    Get PDF
    Face recognition and object detection are two very fundamental visual recognition applications in computer vision. How to learn “good” feature representations using machine learning has become the cornerstone of perception-based systems. A good feature representation is often the one that is robust and discriminative to multiple instances of the same category. Starting from features such as intensity, histogram etc. in the image, followed by hand-crafted features, to the most recent sophisticated deep feature representations, we have witnessed the remarkable improvement in the ability of a feature learning algorithm to perform pattern recognition tasks such as face recognition and object detection. One of the conventional feature learning methods, dictionary learning has been proposed to learn discriminative and sparse representations for visual recognition. These dictionary learning methods can learn both representative and discriminative dictionaries, and the associated sparse representations are effective for vision tasks such as face recognition. More recently, deep features have been widely adopted by the computer vision community owing to the powerful deep neural network, which is capable of distilling information from high dimensional input spaces to a low dimensional semantic space. The research problems which comprise this dissertation lie at the cross section of conventional feature and deep feature learning approaches. Thus, in this dissertation, we study both sparse and deep representations for face recognition and object detection. First, we begin by studying the topic of spare representations. We present a simple thresholded feature learning algorithm under sparse support recovery. We show that under certain conditions, the thresholded feature exactly recovers the nonzero support of the sparse code. Secondly, based on the theoretical guarantees, we derive the model and algorithm named Dictionary Learning for Thresholded Features (DLTF), to learn the dictionary that is optimized for the thresholded feature. The DLTF dictionaries are specifically designed for using the thresholded feature at inference, which prioritize simplicity, efficiency, general usability and theoretical guarantees. Both synthetic simulations and real-data experiments (i.e. image clustering and unsupervised hashing) verify the competitive quantitative results and remarkable efficiency of applying thresholded features with DLTF dictionaries. Continuing our focus on investigating the sparse representation and its application to computer vision tasks, we address the sparse representations for unconstrained face verification/recognition problem. In the first part, we address the video-based face recognition problem since it brings more challenges due to the fact that the videos are often acquired under significant variations in poses, expressions, lighting conditions and backgrounds. In order to extract representations that are robust to these variations, we propose a structured dictionary learning framework. Specifically, we employ dictionary learning and low-rank approximation methods to preserve the invariant structure of face images in videos. The learned structured dictionary is both discriminative and reconstructive. We demonstrate the effectiveness of our approach through extensive experiments on three video-based face recognition datasets. Recently, template-based face verification has gained more popularity. Unlike traditional verification tasks, which evaluate on image-to-image or video-to-video pairs, template-based face verification/recognition methods can exploit training and/or gallery data containing a mixture of both images or videos from the person of interest. In the second part, we propose a regularized sparse coding approach for template-based face verification. First, we construct a reference dictionary, which represents the training set. Then we learn the discriminative sparse codes of the templates for verification through the proposed template regularized sparse coding approach. Finally, we measure the similarity between templates. However, in real world scenarios, training and test data are sampled from different distributions. Therefore, we also extend the dictionary learning techniques to tackle the domain adaptation problem, where the data from the training set (source domain) and test set (target domain) have different underlying distributions (domain shift). We propose a domain-adaptive dictionary learning framework to model the domain shift by generating a set of intermediate domains. These intermediate domains bridge the gap between the source and target domains. Specifically, we not only learn a common dictionary to encode the domain-shared features but also learn a set of domain specific dictionaries to model the domain shift. This separation enables us to learn more compact and reconstructive dictionaries for domain adaptation. The domain-adaptive features for recognition are finally derived by aligning all the recovered feature representations of both source and target along the domain path. We evaluate our approach on both cross-domain face recognition and object classification tasks. Finally, we study another fundamental problem in computer vision: generic object detection. Object detection has become one of the most valuable pattern recognition tasks, with great benefits in scene understanding, face recognition, action recognition, robotics and self-driving vehicles, etc. We propose a novel object detector named "Deep Regionlets" by blending deep learning and the traditional regionlet method. The proposed framework "Deep Regionlets" is able to address the limitations of traditional regionlet methods, leading to significant precision improvement by exploiting the power of deep convolutional neural networks. Furthermore, we conduct a detailed analysis of our approach to understand its merits and properties. Extensive experiments on two detection benchmark datasets show that the proposed deep regionlet approach outperforms several state-of-the-art competitors

    Quantum-Inspired Machine Learning: a Survey

    Full text link
    Quantum-inspired Machine Learning (QiML) is a burgeoning field, receiving global attention from researchers for its potential to leverage principles of quantum mechanics within classical computational frameworks. However, current review literature often presents a superficial exploration of QiML, focusing instead on the broader Quantum Machine Learning (QML) field. In response to this gap, this survey provides an integrated and comprehensive examination of QiML, exploring QiML's diverse research domains including tensor network simulations, dequantized algorithms, and others, showcasing recent advancements, practical applications, and illuminating potential future research avenues. Further, a concrete definition of QiML is established by analyzing various prior interpretations of the term and their inherent ambiguities. As QiML continues to evolve, we anticipate a wealth of future developments drawing from quantum mechanics, quantum computing, and classical machine learning, enriching the field further. This survey serves as a guide for researchers and practitioners alike, providing a holistic understanding of QiML's current landscape and future directions.Comment: 56 pages, 13 figures, 8 table

    Towards Generalized Frameworks for Object Recognition

    Get PDF
    Over the past few years, deep convolutional neural network (DCNN) based approaches have been immensely successful in tackling a diverse range of object recognition problems. Popular DCNN architectures like deep residual networks (ResNets) are highly generic, not just for classification, but also for high level tasks like detection/tracking which rely on classification DCNNs as their backbone. The generality of DCNNs however doesn't extend to image-to-image(Im2Im) regression tasks (eg: super-resolution, denoising, rgb-to-depth, relighting, etc). For such tasks, DCNNs are often highly task-specific and require specific ancillary post-processing methods. The major issue plaguing the design of generic architectures for such tasks is the tradeoff between context/locality given a fixed computation/memory budget. We first present a generic DCNN architecture for Im2Im regression that can be trained end-to-end without any further machinery. Our proposed architecture, the Recursively Branched Deconvolutional Network (RBDN), which features a cheap early multi-context image representation, an efficient recursive branching scheme with extensive parameter sharing and learnable upsampling. We provide qualitative/quantitative results on 3 diverse tasks: relighting, denoising and colorization and show that our proposed RBDN architecture obtains comparable results to the state-of-the-art on each of these tasks when used off-the-shelf without any post processing or task-specific architectural modifications. Second, we focus on gradient flow and optimization in ResNets. In particular, we theoretically analyze why pre-activation(v2) ResNets outperform the original ResNets(v1) on CIFAR datasets but not on ImageNet. Our analysis reveals that although v1-ResNets lack ensembling properties, they can have a higher effective depth in comparison to v2-ResNes. Subsequently, we show that downsampling projections (while only few in number) have a significantly detrimental effect on performance. We show that by simply replacing downsampling-projections with identity-like dense-reshape shortcuts, the classification results of standard residual architectures like ResNets, ResNeXts and SE-Nets improve by up to 1.2% on ImageNet, without any increase in computational complexity (FLOPs). Finally, we present a robust non-parametric probabilistic ensemble method for multi-classification, which outperforms the state-of-the-art ensemble methods on several machine learning and computer vision datasets for object recognition with statistically significant improvements. The approach is particularly geared towards multi-classification problems with very low training data and/or a fairly high proportion of outliers, for which training end-to-end DCNNs is not very beneficial
    • …
    corecore