716 research outputs found

    Learning Features by Watching Objects Move

    Full text link
    This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as 'pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.Comment: CVPR 201

    Deep learning for large-scale fine-grained recognition of cars

    Get PDF
    Deep learning (DL) is widely used nowadays, with several applications in image classification and object detection. Among many of these applications is the use of Convolutional Neural Networks (CNNs) whose operation is: for a given input (image) and output (label/class), generate representations that define and allow to distinguish different kinds of objects. Neural Networks are computationally demanding, taking hours to train. Convolutional Neural Networks are even more demanding since their input data are usually images – a rich data type that holds a lot of information. The fast evolution in Computer Vision, using deep learning techniques, and computing power recently allowed to train CNNs which can classify images with high precision. In car classifieds websites images are one of the most important types of content. However, until today, little knowledge/metadata is produced from such images. In order to insert an advert in the platform, the user must upload an image of the car for sale and fill a certain number of fields, among them the vehicle category, the color of the car and its respective make, model and version. In this dissertation, CNNs are used for the recognition of the make, model and version of cars where transfer learning and fine-tuning are two approaches used for transferring the knowledge learned in one task and adapting it to another. We extend the work to also validate the efficacy of these neural networks on the tasks of vehicle category and cars’ color recognition. We pretend to validate how CNNs behave in these different tasks. Approaches like background removal and data augmentation are explored for reducing overfitting. We collected one of the largest datasets to date for the task of make, model and version recognition of cars, composed of 1.2 million images belonging to 790 labels.The results obtained in the scope of this dissertation set a new state-of-the-art performance for this type of task (accuracy of 92.7% on an ensemble method) considering the number of classes to classify and the number of images used. It is demonstrated the efficacy of the recent advances in CNN architectures in fine-grained classification where intra-class variation is small and viewpoint variation is high, when a largescale dataset is used.Deep Learning (DL) é um termo cada vez mais mencionado nos dias de hoje, com vastas aplicações em classificação de imagens e detecção de objectos. Por detrás de muitas destas aplicações está a utilização de Convolutional Neural Networks (CNN) cujo funcionamento é, para um dado input (imagem) e output (nome do objecto representado/classe), produzir representações que definem e permitem distinguir vários tipos de objectos. As redes neuronais são computacionalmente exigentes e podem levar horas a ser treinadas. Convolutional Neural Networks são ainda mais exigentes visto o seu input ser, usualmente, imagens - um tipo de dados rico que contém muita informação. Com a rápida evolução do poder computacional aliada à evolução no campo de Computer Vision com recurso a CNNs é possível, somente nos últimos anos, treinar CNNs para classificação de imagens com alto nível de precisão. Em sites de classificados de carros as imagens são um dos tipos de conteúdo mais importante. Todavia até aos dias de hoje, pouco conhecimento/metadados são gerados a partir das mesmas. O utilizador tem sempre que, para inserir um anúncio na plataforma, preencher um vasto número de campos, entre eles a categoria do veículo, a cor do carro e a respectiva marca, modelo e versão, e inserir uma imagem do carro para venda. Nesta dissertação são utilizadas CNNs para o reconhecimento da marca, modelo e versão de carros em que se utiliza transfer learning e fine-tuning para transferir o conhecimento “aprendido” numa tarefa e adaptá-lo para outra. O trabalho é estendido de forma a demonstrar, também, a eficácia destas redes neuronais para as tarefas de reconhecimento da categoria do veículo e reconhecimento de cor de carros. Pretendemos validar como as CNNs se comportam nestes diferentes tipos de tarefas. Abordagens como remoção do fundo da imagem e data augmentation são utilizadas para reduzir overfitting.É obtido um dos maiores datasets para a tarefa de reconhecimento de marca, modelo e versão de carros, composto por 1,2 milhões de imagens pertencentes a 790 classes. Os resultados apresentados são dos melhores para este tipo de tarefa (precisão de 92.7% com um ensemble) considerando tanto o número de classes a classificar como o número de imagens utilizadas. Os resultados obtidos evidenciam a eficácia das arquitecturas de CNNs modernas para a classificação granular onde a variação intra-classe é reduzida e a variação da perspectiva é elevada, quando é utilizado um dataset de grandes dimensões

    Grounding deep models of visual data

    Get PDF
    Deep models are state-of-the-art for many computer vision tasks including object classification, action recognition, and captioning. As Artificial Intelligence systems that utilize deep models are becoming ubiquitous, it is also becoming crucial to explain why they make certain decisions: Grounding model decisions. In this thesis, we study: 1) Improving Model Classification. We show that by utilizing web action images along with videos in training for action recognition, significant performance boosts of convolutional models can be achieved. Without explicit grounding, labeled web action images tend to contain discriminative action poses, which highlight discriminative portions of a video’s temporal progression. 2) Spatial Grounding. We visualize spatial evidence of deep model predictions using a discriminative top-down attention mechanism, called Excitation Backprop. We show how such visualizations are equally informative for correct and incorrect model predictions, and highlight the shift of focus when different training strategies are adopted. 3) Spatial Grounding for Improving Model Classification at Training Time. We propose a guided dropout regularizer for deep networks based on the evidence of a network prediction. This approach penalizes neurons that are most relevant for model prediction. By dropping such high-saliency neurons, the network is forced to learn alternative paths in order to maintain loss minimization. We demonstrate better generalization ability, an increased utilization of network neurons, and a higher resilience to network compression. 4) Spatial Grounding for Improving Model Classification at Test Time. We propose Guided Zoom, an approach that utilizes spatial grounding to make more informed predictions at test time. Guided Zoom compares the evidence used to make a preliminary decision with the evidence of correctly classified training examples to ensure evidenceprediction consistency, otherwise refines the prediction. We demonstrate accuracy gains for fine-grained classification. 5) Spatiotemporal Grounding. We devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep recurrent neural network’s classification/captioning output. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks

    Semantic Attributes for Transfer Learning in Visual Recognition

    Get PDF
    Angetrieben durch den Erfolg von Deep Learning Verfahren wurden in Bezug auf künstliche Intelligenz erhebliche Fortschritte im Bereich des Maschinenverstehens gemacht. Allerdings sind Tausende von manuell annotierten Trainingsdaten zwingend notwendig, um die Generalisierungsfähigkeit solcher Modelle sicherzustellen. Darüber hinaus muss das Modell jedes Mal komplett neu trainiert werden, sobald es auf eine neue Problemklasse angewandt werden muss. Dies führt wiederum dazu, dass der sehr kostenintensive Prozess des Sammelns und Annotierens von Trainingsdaten wiederholt werden muss, wodurch die Skalierbarkeit solcher Modelle erheblich begrenzt wird. Auf der anderen Seite bearbeiten wir Menschen neue Aufgaben nicht isoliert, sondern haben die bemerkenswerte Fähigkeit, auf bereits erworbenes Wissen bei der Lösung neuer Probleme zurückzugreifen. Diese Fähigkeit wird als Transfer-Learning bezeichnet. Sie ermöglicht es uns, schneller, besser und anhand nur sehr weniger Beispiele Neues zu lernen. Daher besteht ein großes Interesse, diese Fähigkeit durch Algorithmen nachzuahmen, insbesondere in Bereichen, in denen Trainingsdaten sehr knapp oder sogar nicht verfügbar sind. In dieser Arbeit untersuchen wir Transfer-Learning im Kontext von Computer Vision. Insbesondere untersuchen wir, wie visuelle Erkennung (z.B. Objekt- oder Aktionsklassifizierung) durchgeführt werden kann, wenn nur wenige oder keine Trainingsbeispiele existieren. Eine vielversprechende Lösung in dieser Richtung ist das Framework der semantischen Attribute. Dabei werden visuelle Kategorien in Form von Attributen wie Farbe, Muster und Form beschrieben. Diese Attribute können aus einer disjunkten Menge von Trainingsbeispielen gelernt werden. Da die Attribute eine doppelte, d.h. sowohl visuelle als auch semantische, Interpretation haben, kann Sprache effektiv genutzt werden, um den Übertragungsprozess zu steuern. Dies bedeutet, dass Modelle für eine neue visuelle Kategorie nur anhand der sprachlichen Beschreibung erstellt werden können, indem relevante Attribute selektiert und auf die neue Kategorie übertragen werden. Die Notwendigkeit von Trainingsbildern entfällt durch diesen Prozess jedoch vollständig. In dieser Arbeit stellen wir neue Lösungen vor, semantische Attribute zu modellieren, zu übertragen, automatisch mit visuellen Kategorien zu assoziieren, und aus sprachlichen Beschreibungen zu erkennen. Zu diesem Zweck beleuchten wir die attributbasierte Erkennung aus den folgenden vier Blickpunkten: 1) Anders als das gängige Modell, bei dem Attribute global gelernt werden müssen, stellen wir einen hierarchischen Ansatz vor, der es ermöglicht, die Attribute auf verschiedenen Abstraktionsebenen zu lernen. Wir zeigen zudem, wie die Struktur zwischen den Kategorien effektiv genutzt werden kann, um den Lern- und Transferprozess zu steuern und damit diskriminative Modelle für neue Kategorien zu erstellen. Mit einer gründlichen experimentellen Analyse demonstrieren wir eine deutliche Verbesserung unseres Modells gegenüber dem globalen Ansatz, insbesondere bei der Erkennung detailgenauer Kategorien. 2) In vorherrschend attributbasierten Transferansätzen überwacht der Benutzer die Zuordnung zwischen den Attributen und den Kategorien. Wir schlagen in dieser Arbeit vor, die Verbindung zwischen den beiden automatisch und ohne Benutzereingriff herzustellen. Unser Modell erfasst die semantischen Beziehungen, welche die Attribute mit Objekten koppeln, um ihre Assoziationen vorherzusagen und unüberwacht auszuwählen welche Attribute übertragen werden sollen. 3) Wir umgehen die Notwendigkeit eines vordefinierten Vokabulars von Attributen. Statt dessen schlagen wir vor, Enyzklopädie-Artikel zu verwenden, die Objektkategorien in einem freien Text beschreiben, um automatisch eine Menge von diskriminanten, salienten und vielfältigen Attributen zu entdecken. Diese Beseitigung des Bedarfs eines benutzerdefinierten Vokabulars ermöglicht es uns, das Potenzial attributbasierter Modelle im Kontext sehr großer Datenmengen vollends auszuschöpfen. 4) Wir präsentieren eine neuartige Anwendung semantischer Attribute in der realen Welt. Wir schlagen das erste Verfahren vor, welches automatisch Modestile lernt, und vorhersagt, wie sich ihre Beliebtheit in naher Zukunft entwickeln wird. Wir zeigen, dass semantische Attribute interpretierbare Modestile liefern und zu einer besseren Vorhersage der Beliebtheit von visuellen Stilen im Vergleich zu anderen Darstellungen führen

    Person Re-Identification by Deep Joint Learning of Multi-Loss Classification

    Full text link
    Existing person re-identification (re-id) methods rely mostly on either localised or global feature representation alone. This ignores their joint benefit and mutual complementary effects. In this work, we show the advantages of jointly learning local and global features in a Convolutional Neural Network (CNN) by aiming to discover correlated local and global features in different context. Specifically, we formulate a method for joint learning of local and global feature selection losses designed to optimise person re-id when using only generic matching metrics such as the L2 distance. We design a novel CNN architecture for Jointly Learning Multi-Loss (JLML) of local and global discriminative feature optimisation subject concurrently to the same re-id labelled information. Extensive comparative evaluations demonstrate the advantages of this new JLML model for person re-id over a wide range of state-of-the-art re-id methods on five benchmarks (VIPeR, GRID, CUHK01, CUHK03, Market-1501).Comment: Accepted by IJCAI 201

    An overview of mixing augmentation methods and augmentation strategies

    Full text link
    Deep Convolutional Neural Networks have made an incredible progress in many Computer Vision tasks. This progress, however, often relies on the availability of large amounts of the training data, required to prevent over-fitting, which in many domains entails significant cost of manual data labeling. An alternative approach is application of data augmentation (DA) techniques that aim at model regularization by creating additional observations from the available ones. This survey focuses on two DA research streams: image mixing and automated selection of augmentation strategies. First, the presented methods are briefly described, and then qualitatively compared with respect to their key characteristics. Various quantitative comparisons are also included based on the results reported in recent DA literature. This review mainly covers the methods published in the materials of top-tier conferences and in leading journals in the years 2017-2021

    Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work

    Full text link
    Inspired by the fact that human brains can emphasize discriminative parts of the input and suppress irrelevant ones, substantial local mechanisms have been designed to boost the development of computer vision. They can not only focus on target parts to learn discriminative local representations, but also process information selectively to improve the efficiency. In terms of application scenarios and paradigms, local mechanisms have different characteristics. In this survey, we provide a systematic review of local mechanisms for various computer vision tasks and approaches, including fine-grained visual recognition, person re-identification, few-/zero-shot learning, multi-modal learning, self-supervised learning, Vision Transformers, and so on. Categorization of local mechanisms in each field is summarized. Then, advantages and disadvantages for every category are analyzed deeply, leaving room for exploration. Finally, future research directions about local mechanisms have also been discussed that may benefit future works. To the best our knowledge, this is the first survey about local mechanisms on computer vision. We hope that this survey can shed light on future research in the computer vision field
    corecore