969 research outputs found

    Domain Adaptation and Privileged Information for Visual Recognition

    Get PDF
    The automatic identification of entities like objects, people or their actions in visual data, such as images or video, has significantly improved, and is now being deployed in access control, social media, online retail, autonomous vehicles, and several other applications. This visual recognition capability leverages supervised learning techniques, which require large amounts of labeled training data from the target distribution representative of the particular task at hand. However, collecting such training data might be expensive, require too much time, or even be impossible. In this work, we introduce several novel approaches aiming at compensating for the lack of target training data. Rather than leveraging prior knowledge for building task-specific models, typically easier to train, we focus on developing general visual recognition techniques, where the notion of prior knowledge is better identified by additional information, available during training. Depending on the nature of such information, the learning problem may turn into domain adaptation (DA), domain generalization (DG), leaning using privileged information (LUPI), or domain adaptation with privileged information (DAPI).;When some target data samples are available and additional information in the form of labeled data from a different source is also available, the learning problem becomes domain adaptation. Unlike previous DA work, we introduce two novel approaches for the few-shot learning scenario, which require only very few labeled target samples, and even one can be very effective. The first method exploits a Siamese deep neural network architecture for learning an embedding where visual categories from the source and target distributions are semantically aligned and yet maximally separated. The second approach instead, extends adversarial learning to simultaneously maximize the confusion between source and target domains while achieving semantic alignment.;In complete absence of target data, several cheaply available source datasets related to the target distribution can be leveraged as additional information for learning a task. This is the domain generalization setting. We introduce the first deep learning approach to address the DG problem, by extending a Siamese network architecture for learning a representation of visual categories that is invariant with respect to the sources, while imposing semantic alignment and class separation to maximize generalization performance on unseen target domains.;There are situations in which target data for training might come equipped with additional information that can be modeled as an auxiliary view of the data, and that unfortunately is not available during testing. This is the LUPI scenario. We introduce a novel framework based on the information bottleneck that leverages the auxiliary view to improve the performance of visual classifiers. We do so by introducing a formulation that is general, in the sense that can be used with any visual classifier.;Finally, when the available target data is unlabeled, and there is closely related labeled source data, which is also equipped with an auxiliary view as additional information, we pose the question of how to leverage the source data views to train visual classifiers for unseen target data. This is the DAPI scenario. We extend the LUPI framework based on the information bottleneck to learn visual classifiers in DAPI settings and show that privileged information can be leveraged to improve the learning on new domains. Also, the novel DAPI framework is general and can be used with any visual classifier.;Every use of auxiliary information has been validated extensively using publicly available benchmark datasets, and several new state-of-the-art accuracy performance values have been set. Examples of application domains include visual object recognition from RGB images and from depth data, handwritten digit recognition, and gesture recognition from video

    IST Austria Thesis

    Get PDF
    The human ability to recognize objects in complex scenes has driven research in the computer vision field over couple of decades. This thesis focuses on the object recognition task in images. That is, given the image, we want the computer system to be able to predict the class of the object that appears in the image. A recent successful attempt to bridge semantic understanding of the image perceived by humans and by computers uses attribute-based models. Attributes are semantic properties of the objects shared across different categories, which humans and computers can decide on. To explore the attribute-based models we take a statistical machine learning approach, and address two key learning challenges in view of object recognition task: learning augmented attributes as mid-level discriminative feature representation, and learning with attributes as privileged information. Our main contributions are parametric and non-parametric models and algorithms to solve these frameworks. In the parametric approach, we explore an autoencoder model combined with the large margin nearest neighbor principle for mid-level feature learning, and linear support vector machines for learning with privileged information. In the non-parametric approach, we propose a supervised Indian Buffet Process for automatic augmentation of semantic attributes, and explore the Gaussian Processes classification framework for learning with privileged information. A thorough experimental analysis shows the effectiveness of the proposed models in both parametric and non-parametric views

    Multi-task CNN Model for Attribute Prediction

    Full text link
    This paper proposes a joint multi-task learning algorithm to better predict attributes in images using deep convolutional neural networks (CNN). We consider learning binary semantic attributes through a multi-task CNN model, where each CNN will predict one binary attribute. The multi-task learning allows CNN models to simultaneously share visual knowledge among different attribute categories. Each CNN will generate attribute-specific feature representations, and then we apply multi-task learning on the features to predict their attributes. In our multi-task framework, we propose a method to decompose the overall model's parameters into a latent task matrix and combination matrix. Furthermore, under-sampled classifiers can leverage shared statistics from other classifiers to improve their performance. Natural grouping of attributes is applied such that attributes in the same group are encouraged to share more knowledge. Meanwhile, attributes in different groups will generally compete with each other, and consequently share less knowledge. We show the effectiveness of our method on two popular attribute datasets.Comment: 11 pages, 3 figures, ieee transaction pape

    Deep spiking neural networks with applications to human gesture recognition

    Get PDF
    The spiking neural networks (SNNs), as the 3rd generation of Artificial Neural Networks (ANNs), are a class of event-driven neuromorphic algorithms that potentially have a wide range of application domains and are applicable to a variety of extremely low power neuromorphic hardware. The work presented in this thesis addresses the challenges of human gesture recognition using novel SNN algorithms. It discusses the design of these algorithms for both visual and auditory domain human gesture recognition as well as event-based pre-processing toolkits for audio signals. From the visual gesture recognition aspect, a novel SNN-based event-driven hand gesture recognition system is proposed. This system is shown to be effective in an experiment on hand gesture recognition with its spiking recurrent convolutional neural network (SCRNN) design, which combines both designed convolution operation and recurrent connectivity to maintain spatial and temporal relations with address-event-representation (AER) data. The proposed SCRNN architecture can achieve arbitrary temporal resolution, which means it can exploit temporal correlations between event collections. This design utilises a backpropagation-based training algorithm and does not suffer from gradient vanishing/explosion problems. From the audio perspective, a novel end-to-end spiking speech emotion recognition system (SER) is proposed. This system employs the MFCC as its main speech feature extractor as well as a self-designed latency coding algorithm to effciently convert the raw signal to AER input that can be used for SNN. A two-layer spiking recurrent architecture is proposed to address temporal correlations between spike trains. The robustness of this system is supported by several open public datasets, which demonstrate state of the arts recognition accuracy and a significant reduction in network size, computational costs, and training speed. In addition to directly contributing to neuromorphic SER, this thesis proposes a novel speech-coding algorithm based on the working mechanism of humans auditory organ system. The algorithm mimics the functionality of the cochlea and successfully provides an alternative method of event-data acquisition for audio-based data. The algorithm is then further simplified and extended into an application of speech enhancement which is jointly used in the proposed SER system. This speech-enhancement method uses the lateral inhibition mechanism as a frequency coincidence detector to remove uncorrelated noise in the time-frequency spectrum. The method is shown to be effective by experiments for up to six types of noise.The spiking neural networks (SNNs), as the 3rd generation of Artificial Neural Networks (ANNs), are a class of event-driven neuromorphic algorithms that potentially have a wide range of application domains and are applicable to a variety of extremely low power neuromorphic hardware. The work presented in this thesis addresses the challenges of human gesture recognition using novel SNN algorithms. It discusses the design of these algorithms for both visual and auditory domain human gesture recognition as well as event-based pre-processing toolkits for audio signals. From the visual gesture recognition aspect, a novel SNN-based event-driven hand gesture recognition system is proposed. This system is shown to be effective in an experiment on hand gesture recognition with its spiking recurrent convolutional neural network (SCRNN) design, which combines both designed convolution operation and recurrent connectivity to maintain spatial and temporal relations with address-event-representation (AER) data. The proposed SCRNN architecture can achieve arbitrary temporal resolution, which means it can exploit temporal correlations between event collections. This design utilises a backpropagation-based training algorithm and does not suffer from gradient vanishing/explosion problems. From the audio perspective, a novel end-to-end spiking speech emotion recognition system (SER) is proposed. This system employs the MFCC as its main speech feature extractor as well as a self-designed latency coding algorithm to effciently convert the raw signal to AER input that can be used for SNN. A two-layer spiking recurrent architecture is proposed to address temporal correlations between spike trains. The robustness of this system is supported by several open public datasets, which demonstrate state of the arts recognition accuracy and a significant reduction in network size, computational costs, and training speed. In addition to directly contributing to neuromorphic SER, this thesis proposes a novel speech-coding algorithm based on the working mechanism of humans auditory organ system. The algorithm mimics the functionality of the cochlea and successfully provides an alternative method of event-data acquisition for audio-based data. The algorithm is then further simplified and extended into an application of speech enhancement which is jointly used in the proposed SER system. This speech-enhancement method uses the lateral inhibition mechanism as a frequency coincidence detector to remove uncorrelated noise in the time-frequency spectrum. The method is shown to be effective by experiments for up to six types of noise

    Semantic Attributes for Transfer Learning in Visual Recognition

    Get PDF
    Angetrieben durch den Erfolg von Deep Learning Verfahren wurden in Bezug auf künstliche Intelligenz erhebliche Fortschritte im Bereich des Maschinenverstehens gemacht. Allerdings sind Tausende von manuell annotierten Trainingsdaten zwingend notwendig, um die Generalisierungsfähigkeit solcher Modelle sicherzustellen. Darüber hinaus muss das Modell jedes Mal komplett neu trainiert werden, sobald es auf eine neue Problemklasse angewandt werden muss. Dies führt wiederum dazu, dass der sehr kostenintensive Prozess des Sammelns und Annotierens von Trainingsdaten wiederholt werden muss, wodurch die Skalierbarkeit solcher Modelle erheblich begrenzt wird. Auf der anderen Seite bearbeiten wir Menschen neue Aufgaben nicht isoliert, sondern haben die bemerkenswerte Fähigkeit, auf bereits erworbenes Wissen bei der Lösung neuer Probleme zurückzugreifen. Diese Fähigkeit wird als Transfer-Learning bezeichnet. Sie ermöglicht es uns, schneller, besser und anhand nur sehr weniger Beispiele Neues zu lernen. Daher besteht ein großes Interesse, diese Fähigkeit durch Algorithmen nachzuahmen, insbesondere in Bereichen, in denen Trainingsdaten sehr knapp oder sogar nicht verfügbar sind. In dieser Arbeit untersuchen wir Transfer-Learning im Kontext von Computer Vision. Insbesondere untersuchen wir, wie visuelle Erkennung (z.B. Objekt- oder Aktionsklassifizierung) durchgeführt werden kann, wenn nur wenige oder keine Trainingsbeispiele existieren. Eine vielversprechende Lösung in dieser Richtung ist das Framework der semantischen Attribute. Dabei werden visuelle Kategorien in Form von Attributen wie Farbe, Muster und Form beschrieben. Diese Attribute können aus einer disjunkten Menge von Trainingsbeispielen gelernt werden. Da die Attribute eine doppelte, d.h. sowohl visuelle als auch semantische, Interpretation haben, kann Sprache effektiv genutzt werden, um den Übertragungsprozess zu steuern. Dies bedeutet, dass Modelle für eine neue visuelle Kategorie nur anhand der sprachlichen Beschreibung erstellt werden können, indem relevante Attribute selektiert und auf die neue Kategorie übertragen werden. Die Notwendigkeit von Trainingsbildern entfällt durch diesen Prozess jedoch vollständig. In dieser Arbeit stellen wir neue Lösungen vor, semantische Attribute zu modellieren, zu übertragen, automatisch mit visuellen Kategorien zu assoziieren, und aus sprachlichen Beschreibungen zu erkennen. Zu diesem Zweck beleuchten wir die attributbasierte Erkennung aus den folgenden vier Blickpunkten: 1) Anders als das gängige Modell, bei dem Attribute global gelernt werden müssen, stellen wir einen hierarchischen Ansatz vor, der es ermöglicht, die Attribute auf verschiedenen Abstraktionsebenen zu lernen. Wir zeigen zudem, wie die Struktur zwischen den Kategorien effektiv genutzt werden kann, um den Lern- und Transferprozess zu steuern und damit diskriminative Modelle für neue Kategorien zu erstellen. Mit einer gründlichen experimentellen Analyse demonstrieren wir eine deutliche Verbesserung unseres Modells gegenüber dem globalen Ansatz, insbesondere bei der Erkennung detailgenauer Kategorien. 2) In vorherrschend attributbasierten Transferansätzen überwacht der Benutzer die Zuordnung zwischen den Attributen und den Kategorien. Wir schlagen in dieser Arbeit vor, die Verbindung zwischen den beiden automatisch und ohne Benutzereingriff herzustellen. Unser Modell erfasst die semantischen Beziehungen, welche die Attribute mit Objekten koppeln, um ihre Assoziationen vorherzusagen und unüberwacht auszuwählen welche Attribute übertragen werden sollen. 3) Wir umgehen die Notwendigkeit eines vordefinierten Vokabulars von Attributen. Statt dessen schlagen wir vor, Enyzklopädie-Artikel zu verwenden, die Objektkategorien in einem freien Text beschreiben, um automatisch eine Menge von diskriminanten, salienten und vielfältigen Attributen zu entdecken. Diese Beseitigung des Bedarfs eines benutzerdefinierten Vokabulars ermöglicht es uns, das Potenzial attributbasierter Modelle im Kontext sehr großer Datenmengen vollends auszuschöpfen. 4) Wir präsentieren eine neuartige Anwendung semantischer Attribute in der realen Welt. Wir schlagen das erste Verfahren vor, welches automatisch Modestile lernt, und vorhersagt, wie sich ihre Beliebtheit in naher Zukunft entwickeln wird. Wir zeigen, dass semantische Attribute interpretierbare Modestile liefern und zu einer besseren Vorhersage der Beliebtheit von visuellen Stilen im Vergleich zu anderen Darstellungen führen

    To Make Their Journey Back To Nature: Zoo Captivity and Post / Humanism

    Get PDF
    Humans detain Other species in zoos for various publicly stated reasons. In this way, zoo captives are entangled within human rationalities. The Adelaide Zoo’s mission statement expresses utopian ideals of conservation and connection, which it aims to achieve through combating the dystopian realities of Other species extinction and human separation from nature. This binary is formulated through the western belief system of humanism, a dualistic rationality, which is used by the zoo to explain its purpose. In this dualism, humans are superior, in control, and progressing, while all Other matter/s are ranked, and compartmentalised, as lesser. Based on fieldwork at the Adelaide Zoo from July 2011 to December 2013 my research examines the representations of zooing and claims of control and progress. To research western humanism, I employ four key theorists: Val Plumwood, an eco-feminist philosopher; Tim Ingold, a relational anthropologist (my categorisation); Karen Barad, a feminist physicist and Gregory Bateson an eco-anthropologist-scientist. These theorists utilise relational, connected systems to explain western miscalculations / conceptions about reality, that is, a posthuman perspective. I also draw on the work of a number of other pertinent scholars. Western humanism organises zooing, but zoo-workers (paid and unpaid) demonstrated relational comprehensions about life at the zoo, for themselves and captives. The captives also demonstrated divergent becomings from those officially narrated. My Baradian intra-actions with captives, including Bears, Superb Lyrebirds, a Cassowary, South American Primates, Lions and Tigers, and one sonically-beautiful little brown bird, are all documented here. I attended the Adelaide Zoo as volunteer and ethnographer, enjoying the experience of observing hands-on keeping, and participating in volunteering. My research included observing, participating, interviewing and hanging out with people, as well as burrowing through the current and historical trails of official and unofficial narratives, within which the multi-non-linear connections between past, present, and hoped-for-future zooing, emerged. The disparity between humanist and posthumanist perspectives is foundational to understanding zoo conservation efforts, and how people are entangled in zooing. A key contention developed within the thesis is the tension between the representation/s of zooing and the experienced life emergences of zooing; that is, the possibility of performing zooing alternately. At the zoo, representations are utilised to explain zooing to the populace, where representations allow the appearance of movement, or change, in the stasis that is captivity. By reading each key theorist through my research, the mechanisms of western humanism clearly cohered into a perceivable, yet questionable, rationale. I argue that western dualistic essentialism, as a pathway to knowledge, has overshadowed, but not abolished, relational thinking, where connectivity and recursivity — that is response ability — better explains matter/s reality. In this thesis I articulate humanism to humanists, by making perceptible what often goes without thought, the normative values and beliefs of my culture, along with the mechanisms by which these are implicated in the processes and practices of zooing.Thesis (Ph.D.) -- University of Adelaide, School of Social Sciences, 202