2,599 research outputs found

    Unsupervised learning of object landmarks by factorized spatial embeddings

    Full text link
    Learning automatically the structure of object categories remains an important open problem in computer vision. In this paper, we propose a novel unsupervised approach that can discover and learn landmarks in object categories, thus characterizing their structure. Our approach is based on factorizing image deformations, as induced by a viewpoint change or an object deformation, by learning a deep neural network that detects landmarks consistently with such visual effects. Furthermore, we show that the learned landmarks establish meaningful correspondences between different object instances in a category without having to impose this requirement explicitly. We assess the method qualitatively on a variety of object types, natural and man-made. We also show that our unsupervised landmarks are highly predictive of manually-annotated landmarks in face benchmark datasets, and can be used to regress these with a high degree of accuracy.Comment: To be published in ICCV 201

    Weakly- and Semi-Supervised Panoptic Segmentation

    Full text link
    We present a weakly supervised model that jointly performs both semantic- and instance-segmentation -- a particularly relevant problem given the substantial cost of obtaining pixel-perfect annotation for these tasks. In contrast to many popular instance segmentation approaches based on object detectors, our method does not predict any overlapping instances. Moreover, we are able to segment both "thing" and "stuff" classes, and thus explain all the pixels in the image. "Thing" classes are weakly-supervised with bounding boxes, and "stuff" with image-level tags. We obtain state-of-the-art results on Pascal VOC, for both full and weak supervision (which achieves about 95% of fully-supervised performance). Furthermore, we present the first weakly-supervised results on Cityscapes for both semantic- and instance-segmentation. Finally, we use our weakly supervised framework to analyse the relationship between annotation quality and predictive performance, which is of interest to dataset creators.Comment: ECCV 2018. The first two authors contributed equall

    Object Detection in Data Acquired From Aerial Devices

    Get PDF
    The object detection task, both in images and in videos, has been the source of extraordinary advances with state-of-the-art architectures that can achieve close to perfect precision on large modern datasets. As a result, since these models are trained on large-scale datasets, most of them can adapt to almost any other real-world scenario if given enough data. Nevertheless, there is a specific scenario, aerial images, in which these models tend to perform worse due to their natural characteristics. The main problem differentiating typical object detection datasets from aerial object detection datasets is the object’s scale that needs to be located and identified. Moreover, factors such as the image’s brightness, object rotation and details, and background colours also play a crucial role in the model’s performance, no matter its architecture. Deep learning models make decisions based on the features they can extract from the training data. This technique works particularly well in standard scenarios, where images portray the object at a standard scale in which the object’s details are precise and allow the model to distinguish it from the other objects and background. However, when considering a scenario where the image is being captured from 50 meters above, the object’s details diminish considerably and, thus, logically, making it harder for deep learning models to extract meaningful features that will allow for the identification and localization of the said object. Nowadays, many surveillance systems use static cameras placed in pre-defined places; however, a more appropriate approach for some scenarios would be using drones to surveil a particular area with a specific route. More specifically, these types of surveillance would be adequate for scenarios where it is not feasible to cover the whole area with static cameras, such as wild forests. The first objective of this dissertation is to gather a dataset that focuses on detecting people and vehicles in wild-forest scenarios. The dataset was captured using a DJI drone in four distinct zones of Serra da Estrela. It contains instances captured under different weather conditions – sunny and foggy – and during different parts of the day – morning, afternoon and evening. In addition, it also includes four different types of terrain, earth, tar, forest, and gravel, and there are two classes of objects, person and vehicle. Later on, the second objective of this dissertation aims to precisely analyze how state-ofthe-art single-frame-based and video object detectors perform in the previously described dataset. The analysis focuses on the models’ performance related to each object class in every terrain. Given this, we can demonstrate the exact situations in which the different models stand out and which ones tend to perform the worse. Finally, we propose two methods based on the results obtained during the first phase of experiments, where each aims to solve a different problem that emerged from applying stateof-the-art models to aerial images. The first method aims to improve the performance of the video object detector models in certain situations by using background removal algorithms to delineate specific areas in which the detectors’ predictions are considered valid. One of the main problems with creating a high-quality dataset from scratch is the intensive and time-consuming annotation process after gathering the data. Regarding this, the second method we propose consists of a self-supervised architecture that aims to tackle the particular scarcity of high-quality aerial datasets. The main idea is to analyze the usefulness of unlabelled data in these problems and thus, avoid the immense time-consuming process of labelling the entirety of a full-scale aerial dataset. The reported results show that even with only a partially labelled dataset, it is possible to use the unlabelled data in a self-supervised matter to improve the model’s performance further.A tarefa de deteção de objetos, tanto em imagem como em vídeo, tem contribuído com inúmeros avanços extraordinários no que toca a arquiteturas inovadoras e ao desenvolvimento de conjuntos de dados cada vez mais completos e de qualidade. Nesse sentido, a maioria dos modelos consegue adaptar-se a quase qualquer cenário do mundo real – se existirem dados suficientes –, uma vez que estes modelos são treinados nestes grandes conjuntos de dados. No entanto, existe um cenário específico – as imagens aéreas –, e que devido às suas caraterísticas naturais, estes modelos tendem a mostrar um desempenho de menor qualidade. Contudo, a diferença de escala do próprio objeto que precisa de ser localizado e identificado é o principal aspeto que marca a diferença entre os conjuntos de imagens típicas e os conjuntos de imagens aéreas. Além disso, fatores como o brilho da imagem, a rotação do objeto, os detalhes do mesmo e as cores de fundo também desempenham um papel crucial no desempenho do modelo, independentemente da sua arquitetura. Modelos de aprendizagem profunda tomam decisões com base nas características que conseguem extrair do conjunto de imagens de treino. Esta técnica funciona particularmente bem em cenários padrão, em que as imagens representam o objeto numa escala normal, onde os detalhes do objeto são precisos e permitem que o modelo o distinga de outros objetos. Contudo, ao considerar um cenário onde a imagem está a ser capturada a 50 metros de altura, os detalhes do objeto diminuem consideravelmente e, portanto, torna-se mais difícil para o modelo extrair as melhores caraterísticas significativas que permitem a identificação e localização do objeto. Atualmente, muitos sistemas de vigilância utilizam câmaras estáticas colocadas em locais pré-definidos; porém, uma abordagem mais apropriada para alguns cenários poderia passar por utilizar drones de modo a vigiar uma determinada área com um percurso pré-definido. Mais especificamente, estes tipos de vigilância seriam adequados a cenários em que não é viável cobrir toda a área com câmaras, tal como florestas. O primeiro objetivo do presente trabalho passa por reunir um conjunto de dados que se foque na deteção de pessoas e veículos em florestas. O conjunto de dados foi capturado com um drone DJI em quatro zonas distintas da Serra da Estrela, e contém gravações que foram capturadas com diferentes condições meteorológicas – sol e nevoeiro – e durante diferentes fases do dia – manhã, tarde e ao anoitecer. Além do mais, contempla também quatro tipos diferentes de terreno, terra, alcatrão, floresta e gravilha, para além de existirem duas classes de objetos, pessoa e veículo. Posteriormente, o segundo objetivo contempla a análise precisa do modo como os detetores de objetos de vídeo e imagem atuam no conjunto de dados anteriormente descrito. A análise centra-se no desempenho dos modelos em relação a cada classe de objeto e a cada terreno. Com isto, conseguimos demonstrar uma perspetiva das situações exatas em que os diferentes tipos de modelos se destacam e quais os que tendem a não ter um desempenho tão adequado. Finalmente, com base nos resultados obtidos durante a primeira fase de experiências, o objetivo final tem como propósito propor dois métodos em que cada um deles visa resolver um problema diferente que surgiu da aplicação destes detetores em imagens aéreas. O primeiro método destaca a utilização de algoritmos de remoção de fundo para melhorar o desempenho dos modelos de deteção de objetos em vídeo em determinadas situações com o objetivo de delimitar áreas específicas nas quais as deteções dos modelos devem ser consideradas válidas. Um dos principais problemas na criação de um conjunto de dados de alta qualidade a partir do zero é o processo intensivo e moroso de anotação após a recolha dos dados. Com respeito a isto, o segundo método proposto consiste numa arquitetura auto-supervisionada que tem como objetivo enfrentar a escassez particular de conjuntos de dados aéreos de alta qualidade. A ideia principal é analisar a utilidade dos dados não anotados nestes projetos e, assim, evitar o processo demorado e custoso de anotar a totalidade de um conjunto de dados aéreos. Os resultados relatados mostram que, mesmo com um conjunto de dados parcialmente anotado, é possível utilizar os dados não anotados numa arquitetura auto-supervisionada para melhorar ainda mais o desempenho do modelo

    Automatic annotation for weakly supervised learning of detectors

    Get PDF
    PhDObject detection in images and action detection in videos are among the most widely studied computer vision problems, with applications in consumer photography, surveillance, and automatic media tagging. Typically, these standard detectors are fully supervised, that is they require a large body of training data where the locations of the objects/actions in images/videos have been manually annotated. With the emergence of digital media, and the rise of high-speed internet, raw images and video are available for little to no cost. However, the manual annotation of object and action locations remains tedious, slow, and expensive. As a result there has been a great interest in training detectors with weak supervision where only the presence or absence of object/action in image/video is needed, not the location. This thesis presents approaches for weakly supervised learning of object/action detectors with a focus on automatically annotating object and action locations in images/videos using only binary weak labels indicating the presence or absence of object/action in images/videos. First, a framework for weakly supervised learning of object detectors in images is presented. In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically annotating object locations in weakly labelled data is presented which, unlike existing approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial annotation is then used to start an iterative process in which standard object detectors are used to refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift from the object of interest, a scheme for detecting model drift is also presented. Furthermore, unlike most other methods, our weakly supervised approach is evaluated on data without manual pose (object orientation) annotation. Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues, is carried out. From the analysis, a new method based on negative mining (NegMine) is presented for the initial annotation of both object and action data. The NegMine based approach is a much simpler formulation using only inter-class measure and requires no complex combinatorial optimisation but can still meet or outperform existing approaches including the previously pre3 sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing approaches to boost their performance. Finally, the thesis will take a step back and look at the use of generic object detectors as prior knowledge in weakly supervised learning of object detectors. These generic object detectors are typically based on sampling saliency maps that indicate if a pixel belongs to the background or foreground. A new approach to generating saliency maps is presented that, unlike existing approaches, looks beyond the current image of interest and into images similar to the current image. We show that our generic object proposal method can be used by itself to annotate the weakly labelled object data with surprisingly high accuracy
    corecore