2,599 research outputs found
Unsupervised learning of object landmarks by factorized spatial embeddings
Learning automatically the structure of object categories remains an
important open problem in computer vision. In this paper, we propose a novel
unsupervised approach that can discover and learn landmarks in object
categories, thus characterizing their structure. Our approach is based on
factorizing image deformations, as induced by a viewpoint change or an object
deformation, by learning a deep neural network that detects landmarks
consistently with such visual effects. Furthermore, we show that the learned
landmarks establish meaningful correspondences between different object
instances in a category without having to impose this requirement explicitly.
We assess the method qualitatively on a variety of object types, natural and
man-made. We also show that our unsupervised landmarks are highly predictive of
manually-annotated landmarks in face benchmark datasets, and can be used to
regress these with a high degree of accuracy.Comment: To be published in ICCV 201
Weakly- and Semi-Supervised Panoptic Segmentation
We present a weakly supervised model that jointly performs both semantic- and
instance-segmentation -- a particularly relevant problem given the substantial
cost of obtaining pixel-perfect annotation for these tasks. In contrast to many
popular instance segmentation approaches based on object detectors, our method
does not predict any overlapping instances. Moreover, we are able to segment
both "thing" and "stuff" classes, and thus explain all the pixels in the image.
"Thing" classes are weakly-supervised with bounding boxes, and "stuff" with
image-level tags. We obtain state-of-the-art results on Pascal VOC, for both
full and weak supervision (which achieves about 95% of fully-supervised
performance). Furthermore, we present the first weakly-supervised results on
Cityscapes for both semantic- and instance-segmentation. Finally, we use our
weakly supervised framework to analyse the relationship between annotation
quality and predictive performance, which is of interest to dataset creators.Comment: ECCV 2018. The first two authors contributed equall
Object Detection in Data Acquired From Aerial Devices
The object detection task, both in images and in videos, has been the source of extraordinary advances with state-of-the-art architectures that can achieve close to perfect precision
on large modern datasets. As a result, since these models are trained on large-scale datasets,
most of them can adapt to almost any other real-world scenario if given enough data. Nevertheless, there is a specific scenario, aerial images, in which these models tend to perform
worse due to their natural characteristics. The main problem differentiating typical object
detection datasets from aerial object detection datasets is the object’s scale that needs to be
located and identified. Moreover, factors such as the image’s brightness, object rotation and
details, and background colours also play a crucial role in the model’s performance, no matter its architecture.
Deep learning models make decisions based on the features they can extract from the training data. This technique works particularly well in standard scenarios, where images portray
the object at a standard scale in which the object’s details are precise and allow the model to
distinguish it from the other objects and background. However, when considering a scenario
where the image is being captured from 50 meters above, the object’s details diminish considerably and, thus, logically, making it harder for deep learning models to extract meaningful
features that will allow for the identification and localization of the said object. Nowadays,
many surveillance systems use static cameras placed in pre-defined places; however, a more
appropriate approach for some scenarios would be using drones to surveil a particular area
with a specific route. More specifically, these types of surveillance would be adequate for scenarios where it is not feasible to cover the whole area with static cameras, such as wild forests.
The first objective of this dissertation is to gather a dataset that focuses on detecting people
and vehicles in wild-forest scenarios. The dataset was captured using a DJI drone in four
distinct zones of Serra da Estrela. It contains instances captured under different weather
conditions – sunny and foggy – and during different parts of the day – morning, afternoon
and evening. In addition, it also includes four different types of terrain, earth, tar, forest, and
gravel, and there are two classes of objects, person and vehicle.
Later on, the second objective of this dissertation aims to precisely analyze how state-ofthe-art single-frame-based and video object detectors perform in the previously described
dataset. The analysis focuses on the models’ performance related to each object class in every
terrain. Given this, we can demonstrate the exact situations in which the different models
stand out and which ones tend to perform the worse.
Finally, we propose two methods based on the results obtained during the first phase of experiments, where each aims to solve a different problem that emerged from applying stateof-the-art models to aerial images. The first method aims to improve the performance of the
video object detector models in certain situations by using background removal algorithms
to delineate specific areas in which the detectors’ predictions are considered valid.
One of the main problems with creating a high-quality dataset from scratch is the intensive and time-consuming annotation process after gathering the data. Regarding this, the second
method we propose consists of a self-supervised architecture that aims to tackle the particular scarcity of high-quality aerial datasets. The main idea is to analyze the usefulness of
unlabelled data in these problems and thus, avoid the immense time-consuming process of
labelling the entirety of a full-scale aerial dataset. The reported results show that even with
only a partially labelled dataset, it is possible to use the unlabelled data in a self-supervised
matter to improve the model’s performance further.A tarefa de deteção de objetos, tanto em imagem como em vídeo, tem contribuído com inúmeros avanços extraordinários no que toca a arquiteturas inovadoras e ao desenvolvimento
de conjuntos de dados cada vez mais completos e de qualidade. Nesse sentido, a maioria
dos modelos consegue adaptar-se a quase qualquer cenário do mundo real – se existirem
dados suficientes –, uma vez que estes modelos são treinados nestes grandes conjuntos de
dados. No entanto, existe um cenário específico – as imagens aéreas –, e que devido às suas
caraterísticas naturais, estes modelos tendem a mostrar um desempenho de menor qualidade. Contudo, a diferença de escala do próprio objeto que precisa de ser localizado e identificado é o principal aspeto que marca a diferença entre os conjuntos de imagens típicas e
os conjuntos de imagens aéreas. Além disso, fatores como o brilho da imagem, a rotação do
objeto, os detalhes do mesmo e as cores de fundo também desempenham um papel crucial
no desempenho do modelo, independentemente da sua arquitetura.
Modelos de aprendizagem profunda tomam decisões com base nas características que conseguem extrair do conjunto de imagens de treino. Esta técnica funciona particularmente
bem em cenários padrão, em que as imagens representam o objeto numa escala normal,
onde os detalhes do objeto são precisos e permitem que o modelo o distinga de outros objetos. Contudo, ao considerar um cenário onde a imagem está a ser capturada a 50 metros de
altura, os detalhes do objeto diminuem consideravelmente e, portanto, torna-se mais difícil
para o modelo extrair as melhores caraterísticas significativas que permitem a identificação
e localização do objeto. Atualmente, muitos sistemas de vigilância utilizam câmaras estáticas colocadas em locais pré-definidos; porém, uma abordagem mais apropriada para alguns
cenários poderia passar por utilizar drones de modo a vigiar uma determinada área com um
percurso pré-definido. Mais especificamente, estes tipos de vigilância seriam adequados a
cenários em que não é viável cobrir toda a área com câmaras, tal como florestas.
O primeiro objetivo do presente trabalho passa por reunir um conjunto de dados que se foque
na deteção de pessoas e veículos em florestas. O conjunto de dados foi capturado com um
drone DJI em quatro zonas distintas da Serra da Estrela, e contém gravações que foram
capturadas com diferentes condições meteorológicas – sol e nevoeiro – e durante diferentes
fases do dia – manhã, tarde e ao anoitecer. Além do mais, contempla também quatro tipos
diferentes de terreno, terra, alcatrão, floresta e gravilha, para além de existirem duas classes
de objetos, pessoa e veículo.
Posteriormente, o segundo objetivo contempla a análise precisa do modo como os detetores
de objetos de vídeo e imagem atuam no conjunto de dados anteriormente descrito. A análise
centra-se no desempenho dos modelos em relação a cada classe de objeto e a cada terreno.
Com isto, conseguimos demonstrar uma perspetiva das situações exatas em que os diferentes
tipos de modelos se destacam e quais os que tendem a não ter um desempenho tão adequado.
Finalmente, com base nos resultados obtidos durante a primeira fase de experiências, o objetivo final tem como propósito propor dois métodos em que cada um deles visa resolver um
problema diferente que surgiu da aplicação destes detetores em imagens aéreas. O primeiro
método destaca a utilização de algoritmos de remoção de fundo para melhorar o desempenho
dos modelos de deteção de objetos em vídeo em determinadas situações com o objetivo de
delimitar áreas específicas nas quais as deteções dos modelos devem ser consideradas válidas.
Um dos principais problemas na criação de um conjunto de dados de alta qualidade a partir
do zero é o processo intensivo e moroso de anotação após a recolha dos dados. Com respeito
a isto, o segundo método proposto consiste numa arquitetura auto-supervisionada que tem
como objetivo enfrentar a escassez particular de conjuntos de dados aéreos de alta qualidade.
A ideia principal é analisar a utilidade dos dados não anotados nestes projetos e, assim, evitar o processo demorado e custoso de anotar a totalidade de um conjunto de dados aéreos.
Os resultados relatados mostram que, mesmo com um conjunto de dados parcialmente anotado, é possível utilizar os dados não anotados numa arquitetura auto-supervisionada para
melhorar ainda mais o desempenho do modelo
Automatic annotation for weakly supervised learning of detectors
PhDObject detection in images and action detection in videos are among the most widely studied
computer vision problems, with applications in consumer photography, surveillance, and automatic
media tagging. Typically, these standard detectors are fully supervised, that is they require
a large body of training data where the locations of the objects/actions in images/videos have
been manually annotated. With the emergence of digital media, and the rise of high-speed internet,
raw images and video are available for little to no cost. However, the manual annotation
of object and action locations remains tedious, slow, and expensive. As a result there has been
a great interest in training detectors with weak supervision where only the presence or absence
of object/action in image/video is needed, not the location. This thesis presents approaches for
weakly supervised learning of object/action detectors with a focus on automatically annotating
object and action locations in images/videos using only binary weak labels indicating the presence
or absence of object/action in images/videos.
First, a framework for weakly supervised learning of object detectors in images is presented.
In the proposed approach, a variation of multiple instance learning (MIL) technique for automatically
annotating object locations in weakly labelled data is presented which, unlike existing
approaches, uses inter-class and intra-class cue fusion to obtain the initial annotation. The initial
annotation is then used to start an iterative process in which standard object detectors are used to
refine the location annotation. Finally, to ensure that the iterative training of detectors do not drift
from the object of interest, a scheme for detecting model drift is also presented. Furthermore,
unlike most other methods, our weakly supervised approach is evaluated on data without manual
pose (object orientation) annotation.
Second, an analysis of the initial annotation of objects, using inter-class and intra-class cues,
is carried out. From the analysis, a new method based on negative mining (NegMine) is presented
for the initial annotation of both object and action data. The NegMine based approach is a
much simpler formulation using only inter-class measure and requires no complex combinatorial
optimisation but can still meet or outperform existing approaches including the previously pre3
sented inter-intra class cue fusion approach. Furthermore, NegMine can be fused with existing
approaches to boost their performance.
Finally, the thesis will take a step back and look at the use of generic object detectors as prior
knowledge in weakly supervised learning of object detectors. These generic object detectors are
typically based on sampling saliency maps that indicate if a pixel belongs to the background
or foreground. A new approach to generating saliency maps is presented that, unlike existing
approaches, looks beyond the current image of interest and into images similar to the current
image. We show that our generic object proposal method can be used by itself to annotate the
weakly labelled object data with surprisingly high accuracy
- …