156 research outputs found
Scale Invariant Interest Points with Shearlets
Shearlets are a relatively new directional multi-scale framework for signal
analysis, which have been shown effective to enhance signal discontinuities
such as edges and corners at multiple scales. In this work we address the
problem of detecting and describing blob-like features in the shearlets
framework. We derive a measure which is very effective for blob detection and
closely related to the Laplacian of Gaussian. We demonstrate the measure
satisfies the perfect scale invariance property in the continuous case. In the
discrete setting, we derive algorithms for blob detection and keypoint
description. Finally, we provide qualitative justifications of our findings as
well as a quantitative evaluation on benchmark data. We also report an
experimental evidence that our method is very suitable to deal with compressed
and noisy images, thanks to the sparsity property of shearlets
Depth-based descriptor for matching keypoints in 3D scenes
Keypoint detection is a basic step in many computer vision algorithms aimed at recognition of objects, automatic navigation and analysis of biomedical images. Successful implementation of higher level image analysis tasks, however, is conditioned by reliable detection of characteristic image local regions termed keypoints. A large number of keypoint detection algorithms has been proposed and verified. In this paper we discuss the most important keypoint detection algorithms. The main part of this work is devoted to description of a keypoint detection algorithm we propose that incorporates depth information computed from stereovision cameras or other depth sensing devices. It is shown that filtering out keypoints that are context dependent, e.g. located at boundaries of objects can improve the matching performance of the keypoints which is the basis for object recognition tasks. This improvement is shown quantitatively by comparing the proposed algorithm to the widely accepted SIFT keypoint detector algorithm. Our study is motivated by a development of a system aimed at aiding the visually impaired in space perception and object identification
Biologically motivated keypoint detection for RGB-D data
With the emerging interest in active vision, computer vision researchers have been increasingly
concerned with the mechanisms of attention. Therefore, several visual attention
computational models inspired by the human visual system, have been developed, aiming at
the detection of regions of interest in images.
This thesis is focused on selective visual attention, which provides a mechanism for the
brain to focus computational resources on an object at a time, guided by low-level image properties
(Bottom-Up attention). The task of recognizing objects in different locations is achieved
by focusing on different locations, one at a time. Given the computational requirements of the
models proposed, the research in this area has been mainly of theoretical interest. More recently,
psychologists, neurobiologists and engineers have developed cooperation's and this has
resulted in considerable benefits. The first objective of this doctoral work is to bring together
concepts and ideas from these different research areas, providing a study of the biological research
on human visual system and a discussion of the interdisciplinary knowledge in this area, as
well as the state-of-art on computational models of visual attention (bottom-up). Normally, the
visual attention is referred by engineers as saliency: when people fix their look in a particular
region of the image, that's because that region is salient. In this research work, saliency methods
are presented based on their classification (biological plausible, computational or hybrid)
and in a chronological order.
A few salient structures can be used for applications like object registration, retrieval or
data simplification, being possible to consider these few salient structures as keypoints when
aiming at performing object recognition. Generally, object recognition algorithms use a large
number of descriptors extracted in a dense set of points, which comes along with very high computational
cost, preventing real-time processing. To avoid the problem of the computational
complexity required, the features have to be extracted from a small set of points, usually called
keypoints. The use of keypoint-based detectors allows the reduction of the processing time and
the redundancy in the data. Local descriptors extracted from images have been extensively
reported in the computer vision literature. Since there is a large set of keypoint detectors, this
suggests the need of a comparative evaluation between them. In this way, we propose to do a
description of 2D and 3D keypoint detectors, 3D descriptors and an evaluation of existing 3D keypoint
detectors in a public available point cloud library with 3D real objects. The invariance of
the 3D keypoint detectors was evaluated according to rotations, scale changes and translations.
This evaluation reports the robustness of a particular detector for changes of point-of-view and
the criteria used are the absolute and the relative repeatability rate. In our experiments, the
method that achieved better repeatability rate was the ISS3D method.
The analysis of the human visual system and saliency maps detectors with biological inspiration
led to the idea of making an extension for a keypoint detector based on the color
information in the retina. Such proposal produced a 2D keypoint detector inspired by the behavior
of the early visual system. Our method is a color extension of the BIMP keypoint detector,
where we include both color and intensity channels of an image: color information is included
in a biological plausible way and multi-scale image features are combined into a single keypoints
map. This detector is compared against state-of-art detectors and found particularly
well-suited for tasks such as category and object recognition. The recognition process is performed
by comparing the extracted 3D descriptors in the locations indicated by the keypoints after mapping the 2D keypoints locations to the 3D space. The evaluation allowed us to obtain
the best pair keypoint detector/descriptor on a RGB-D object dataset. Using our keypoint detector
and the SHOTCOLOR descriptor a good category recognition rate and object recognition
rate were obtained, and it is with the PFHRGB descriptor that we obtain the best results.
A 3D recognition system involves the choice of keypoint detector and descriptor. A new
method for the detection of 3D keypoints on point clouds is presented and a benchmarking is
performed between each pair of 3D keypoint detector and 3D descriptor to evaluate their performance
on object and category recognition. These evaluations are done in a public database
of real 3D objects. Our keypoint detector is inspired by the behavior and neural architecture
of the primate visual system: the 3D keypoints are extracted based on a bottom-up 3D saliency
map, which is a map that encodes the saliency of objects in the visual environment. The saliency
map is determined by computing conspicuity maps (a combination across different modalities)
of the orientation, intensity and color information, in a bottom-up and in a purely stimulusdriven
manner. These three conspicuity maps are fused into a 3D saliency map and, finally, the
focus of attention (or "keypoint location") is sequentially directed to the most salient points in
this map. Inhibiting this location automatically allows the system to attend to the next most
salient location. The main conclusions are: with a similar average number of keypoints, our 3D
keypoint detector outperforms the other eight 3D keypoint detectors evaluated by achiving the
best result in 32 of the evaluated metrics in the category and object recognition experiments,
when the second best detector only obtained the best result in 8 of these metrics. The unique
drawback is the computational time, since BIK-BUS is slower than the other detectors. Given
that differences are big in terms of recognition performance, size and time requirements, the
selection of the keypoint detector and descriptor has to be matched to the desired task and we
give some directions to facilitate this choice. After proposing the 3D keypoint detector, the research focused on a robust detection and
tracking method for 3D objects by using keypoint information in a particle filter. This method
consists of three distinct steps: Segmentation, Tracking Initialization and Tracking. The segmentation
is made to remove all the background information, reducing the number of points for
further processing. In the initialization, we use a keypoint detector with biological inspiration.
The information of the object that we want to follow is given by the extracted keypoints. The
particle filter does the tracking of the keypoints, so with that we can predict where the keypoints
will be in the next frame. In a recognition system, one of the problems is the computational cost
of keypoint detectors with this we intend to solve this problem. The experiments with PFBIKTracking
method are done indoors in an office/home environment, where personal robots are
expected to operate. The Tracking Error evaluates the stability of the general tracking method.
We also quantitatively evaluate this method using a "Tracking Error". Our evaluation is done by
the computation of the keypoint and particle centroid. Comparing our system that the tracking
method which exists in the Point Cloud Library, we archive better results, with a much smaller
number of points and computational time. Our method is faster and more robust to occlusion
when compared to the OpenniTracker.Com o interesse emergente na visão ativa, os investigadores de visão computacional têm
estado cada vez mais preocupados com os mecanismos de atenção. Por isso, uma série de
modelos computacionais de atenção visual, inspirado no sistema visual humano, têm sido desenvolvidos.
Esses modelos têm como objetivo detetar regiões de interesse nas imagens.
Esta tese está focada na atenção visual seletiva, que fornece um mecanismo para que
o cérebro concentre os recursos computacionais num objeto de cada vez, guiado pelas propriedades
de baixo nÃvel da imagem (atenção Bottom-Up). A tarefa de reconhecimento de
objetos em diferentes locais é conseguida através da concentração em diferentes locais, um
de cada vez. Dados os requisitos computacionais dos modelos propostos, a investigação nesta
área tem sido principalmente de interesse teórico. Mais recentemente, psicólogos, neurobiólogos
e engenheiros desenvolveram cooperações e isso resultou em benefÃcios consideráveis. No
inÃcio deste trabalho, o objetivo é reunir os conceitos e ideias a partir dessas diferentes áreas
de investigação. Desta forma, é fornecido o estudo sobre a investigação da biologia do sistema
visual humano e uma discussão sobre o conhecimento interdisciplinar da matéria, bem como
um estado de arte dos modelos computacionais de atenção visual (bottom-up). Normalmente,
a atenção visual é denominada pelos engenheiros como saliência, se as pessoas fixam o olhar
numa determinada região da imagem é porque esta região é saliente. Neste trabalho de investigação,
os métodos saliência são apresentados em função da sua classificação (biologicamente
plausÃvel, computacional ou hÃbrido) e numa ordem cronológica.
Algumas estruturas salientes podem ser usadas, em vez do objeto todo, em aplicações
tais como registo de objetos, recuperação ou simplificação de dados. É possÃvel considerar
estas poucas estruturas salientes como pontos-chave, com o objetivo de executar o reconhecimento
de objetos. De um modo geral, os algoritmos de reconhecimento de objetos utilizam um
grande número de descritores extraÃdos num denso conjunto de pontos. Com isso, estes têm um
custo computacional muito elevado, impedindo que o processamento seja realizado em tempo
real. A fim de evitar o problema da complexidade computacional requerido, as caracterÃsticas
devem ser extraÃdas a partir de um pequeno conjunto de pontos, geralmente chamados pontoschave.
O uso de detetores de pontos-chave permite a redução do tempo de processamento e a
quantidade de redundância dos dados. Os descritores locais extraÃdos a partir das imagens têm
sido amplamente reportados na literatura de visão por computador. Uma vez que existe um
grande conjunto de detetores de pontos-chave, sugere a necessidade de uma avaliação comparativa
entre eles. Desta forma, propomos a fazer uma descrição dos detetores de pontos-chave
2D e 3D, dos descritores 3D e uma avaliação dos detetores de pontos-chave 3D existentes numa
biblioteca de pública disponÃvel e com objetos 3D reais. A invariância dos detetores de pontoschave
3D foi avaliada de acordo com variações nas rotações, mudanças de escala e translações.
Essa avaliação retrata a robustez de um determinado detetor no que diz respeito às mudanças
de ponto-de-vista e os critérios utilizados são as taxas de repetibilidade absoluta e relativa. Nas
experiências realizadas, o método que apresentou melhor taxa de repetibilidade foi o método
ISS3D.
Com a análise do sistema visual humano e dos detetores de mapas de saliência com inspiração
biológica, surgiu a ideia de se fazer uma extensão para um detetor de ponto-chave
com base na informação de cor na retina. A proposta produziu um detetor de ponto-chave 2D
inspirado pelo comportamento do sistema visual. O nosso método é uma extensão com base na cor do detetor de ponto-chave BIMP, onde se incluem os canais de cor e de intensidade de
uma imagem. A informação de cor é incluÃda de forma biológica plausÃvel e as caracterÃsticas
multi-escala da imagem são combinadas num único mapas de pontos-chave. Este detetor
é comparado com os detetores de estado-da-arte e é particularmente adequado para tarefas
como o reconhecimento de categorias e de objetos. O processo de reconhecimento é realizado
comparando os descritores 3D extraÃdos nos locais indicados pelos pontos-chave. Para isso, as
localizações do pontos-chave 2D têm de ser convertido para o espaço 3D. Isto foi possÃvel porque
o conjunto de dados usado contém a localização de cada ponto de no espaço 2D e 3D. A avaliação
permitiu-nos obter o melhor par detetor de ponto-chave/descritor num RGB-D object dataset.
Usando o nosso detetor de ponto-chave e o descritor SHOTCOLOR, obtemos uma noa taxa de
reconhecimento de categorias e para o reconhecimento de objetos é com o descritor PFHRGB
que obtemos os melhores resultados.
Um sistema de reconhecimento 3D envolve a escolha de detetor de ponto-chave e descritor,
por isso é apresentado um novo método para a deteção de pontos-chave em nuvens de
pontos 3D e uma análise comparativa é realizada entre cada par de detetor de ponto-chave
3D e descritor 3D para avaliar o desempenho no reconhecimento de categorias e de objetos.
Estas avaliações são feitas numa base de dados pública de objetos 3D reais. O nosso detetor
de ponto-chave é inspirado no comportamento e na arquitetura neural do sistema visual dos
primatas. Os pontos-chave 3D são extraÃdas com base num mapa de saliências 3D bottom-up,
ou seja, um mapa que codifica a saliência dos objetos no ambiente visual. O mapa de saliência
é determinada pelo cálculo dos mapas de conspicuidade (uma combinação entre diferentes
modalidades) da orientação, intensidade e informações de cor de forma bottom-up e puramente
orientada para o estÃmulo. Estes três mapas de conspicuidade são fundidos num mapa de saliência
3D e, finalmente, o foco de atenção (ou "localização do ponto-chave") está sequencialmente
direcionado para os pontos mais salientes deste mapa. Inibir este local permite que o sistema
automaticamente orientado para próximo local mais saliente. As principais conclusões são: com
um número médio similar de pontos-chave, o nosso detetor de ponto-chave 3D supera os outros
oito detetores de pontos-chave 3D avaliados, obtendo o melhor resultado em 32 das métricas
avaliadas nas experiências do reconhecimento das categorias e dos objetos, quando o segundo
melhor detetor obteve apenas o melhor resultado em 8 dessas métricas. A única desvantagem
é o tempo computacional, uma vez que BIK-BUS é mais lento do que os outros detetores. Dado
que existem grandes diferenças em termos de desempenho no reconhecimento, de tamanho
e de tempo, a seleção do detetor de ponto-chave e descritor tem de ser interligada com a
tarefa desejada e nós damos algumas orientações para facilitar esta escolha neste trabalho de
investigação.
Depois de propor um detetor de ponto-chave 3D, a investigação incidiu sobre um método
robusto de deteção e tracking de objetos 3D usando as informações dos pontos-chave num filtro
de partÃculas. Este método consiste em três etapas distintas: Segmentação, Inicialização do
Tracking e Tracking. A segmentação é feita de modo a remover toda a informação de fundo,
a fim de reduzir o número de pontos para processamento futuro. Na inicialização, usamos um
detetor de ponto-chave com inspiração biológica. A informação do objeto que queremos seguir
é dada pelos pontos-chave extraÃdos. O filtro de partÃculas faz o acompanhamento dos pontoschave,
de modo a se poder prever onde os pontos-chave estarão no próximo frame. As experiências
com método PFBIK-Tracking são feitas no interior, num ambiente de escritório/casa, onde
se espera que robôs pessoais possam operar. Também avaliado quantitativamente este método
utilizando um "Tracking Error". A avaliação passa pelo cálculo das centróides dos pontos-chave e
das partÃculas. Comparando o nosso sistema com o método de tracking que existe na biblioteca usada no desenvolvimento, nós obtemos melhores resultados, com um número muito menor de
pontos e custo computacional. O nosso método é mais rápido e mais robusto em termos de
oclusão, quando comparado com o OpenniTracker
What is Holding Back Convnets for Detection?
Convolutional neural networks have recently shown excellent results in
general object detection and many other tasks. Albeit very effective, they
involve many user-defined design choices. In this paper we want to better
understand these choices by inspecting two key aspects "what did the network
learn?", and "what can the network learn?". We exploit new annotations
(Pascal3D+), to enable a new empirical analysis of the R-CNN detector. Despite
common belief, our results indicate that existing state-of-the-art convnet
architectures are not invariant to various appearance factors. In fact, all
considered networks have similar weak points which cannot be mitigated by
simply increasing the training data (architectural changes are needed). We show
that overall performance can improve when using image renderings for data
augmentation. We report the best known results on the Pascal3D+ detection and
view-point estimation tasks
Sparse Binary Features for Image Classification
In this work a new method for automatic image classification is proposed. It relies on a compact representation of images using sets of sparse binary features. This work first evaluates the Fast Retina Keypoint binary descriptor and proposes improvements based on an efficient descriptor representation. The efficient representation is created using dimensionality reduction techniques, entropy analysis and decorrelated sampling. In a second part, the problem of image classification is tackled. The traditional approach uses machine learning algorithms to create classifiers, and some works already propose to use a compact image representation using feature extraction as preprocessing. The second contribution of this work is to show that binary features, while being very compact and low dimensional (compared to traditional representation of images), still provide a very high discriminant power. This is shown using various learning algorithms and binary descriptors. These years a scheme has been widely used to perform object recognition on images, or equivalently image classification. It is based on the concept of Bag of Visual Words. More precisely, an image is described using an unordered set of visual words, that are generally represented by feature descriptions. The last contribution of this work is to use binary features with a simple Bag of Visual Words classifier. Tests of performance for the image classification are performed on a large database of images
Single and multiple stereo view navigation for planetary rovers
© Cranfield UniversityThis thesis deals with the challenge of autonomous navigation of the ExoMars rover.
The absence of global positioning systems (GPS) in space, added to the limitations
of wheel odometry makes autonomous navigation based on these two techniques - as
done in the literature - an inviable solution and necessitates the use of other approaches.
That, among other reasons, motivates this work to use solely visual data to solve the
robot’s Egomotion problem.
The homogeneity of Mars’ terrain makes the robustness of the low level image
processing technique a critical requirement. In the first part of the thesis, novel solutions
are presented to tackle this specific problem. Detection of robust features against
illumination changes and unique matching and association of features is a sought after
capability. A solution for robustness of features against illumination variation is proposed
combining Harris corner detection together with moment image representation.
Whereas the first provides a technique for efficient feature detection, the moment images
add the necessary brightness invariance. Moreover, a bucketing strategy is used
to guarantee that features are homogeneously distributed within the images. Then, the
addition of local feature descriptors guarantees the unique identification of image cues.
In the second part, reliable and precise motion estimation for the Mars’s robot is
studied. A number of successful approaches are thoroughly analysed. Visual Simultaneous
Localisation And Mapping (VSLAM) is investigated, proposing enhancements
and integrating it with the robust feature methodology. Then, linear and nonlinear optimisation
techniques are explored. Alternative photogrammetry reprojection concepts
are tested. Lastly, data fusion techniques are proposed to deal with the integration of
multiple stereo view data.
Our robust visual scheme allows good feature repeatability. Because of this,
dimensionality reduction of the feature data can be used without compromising the
overall performance of the proposed solutions for motion estimation. Also, the developed
Egomotion techniques have been extensively validated using both simulated and
real data collected at ESA-ESTEC facilities. Multiple stereo view solutions for robot
motion estimation are introduced, presenting interesting benefits. The obtained results
prove the innovative methods presented here to be accurate and reliable approaches
capable to solve the Egomotion problem in a Mars environment
On the Design and Analysis of Multiple View Descriptors
We propose an extension of popular descriptors based on gradient orientation
histograms (HOG, computed in a single image) to multiple views. It hinges on
interpreting HOG as a conditional density in the space of sampled images, where
the effects of nuisance factors such as viewpoint and illumination are
marginalized. However, such marginalization is performed with respect to a very
coarse approximation of the underlying distribution. Our extension leverages on
the fact that multiple views of the same scene allow separating intrinsic from
nuisance variability, and thus afford better marginalization of the latter. The
result is a descriptor that has the same complexity of single-view HOG, and can
be compared in the same manner, but exploits multiple views to better trade off
insensitivity to nuisance variability with specificity to intrinsic
variability. We also introduce a novel multi-view wide-baseline matching
dataset, consisting of a mixture of real and synthetic objects with ground
truthed camera motion and dense three-dimensional geometry
Vision for Social Robots: Human Perception and Pose Estimation
In order to extract the underlying meaning from a scene captured from the surrounding world in a single still image, social robots will need to learn the human ability to detect different objects, understand their arrangement and relationships relative both to their own parts and to each other, and infer the dynamics under which they are evolving. Furthermore, they will need to develop and hold a notion of context to allow assigning different meanings (semantics) to the same visual configuration (syntax) of a scene.
The underlying thread of this Thesis is the investigation of new ways for enabling interactions between social robots and humans, by advancing the visual perception capabilities of robots when they process images and videos in which humans are the main focus of attention.
First, we analyze the general problem of scene understanding, as social robots moving through the world need to be able to interpret scenes without having been assigned a specific preset goal. Throughout this line of research, i) we observe that human actions and interactions which can be visually discriminated from an image follow a very heavy-tailed distribution; ii) we develop an algorithm that can obtain a spatial understanding of a scene by only using cues arising from the effect of perspective on a picture of a person’s face; and iii) we define a novel taxonomy of errors for the task of estimating the 2D body pose of people in images to better explain the behavior of algorithms and highlight their underlying causes of error.
Second, we focus on the specific task of 3D human pose and motion estimation from monocular 2D images using weakly supervised training data, as accurately predicting human pose will open up the possibility of richer interactions between humans and social robots. We show that when 3D ground-truth data is only available in small quantities, or not at all, it is possible to leverage knowledge about the physical properties of the human body, along with additional constraints related to alternative types of supervisory signals, to learn models that can regress the full 3D pose of the human body and predict its motions from monocular 2D images.
Taken in its entirety, the intent of this Thesis is to highlight the importance of, and provide novel methodologies for, social robots' ability to interpret their surrounding environment, learn in a way that is robust to low data availability, and generalize previously observed behaviors to unknown situations in a similar way to humans.</p
Unsupervised landmark discovery via self-training correspondence
Object parts, also known as landmarks, convey information about an object’s shape and spatial configuration in 3D space, especially for deformable objects. The goal of landmark detection is to have a model that, for a particular object instance, can estimate the locations of its parts. Research in this field is mainly driven by supervised approaches, where a sufficient amount of human-annotated data is available. As annotating landmarks for all objects is impractical, this thesis focuses on learning landmark detectors without supervision. Despite good performance on limited scenarios (objects showcasing minor rigid deformation), unsupervised landmark discovery mostly remains an open problem. Existing work fails to capture semantic landmarks, i.e. points similar to the ones assigned by human annotators and may not generalise well to highly articulated objects like the human body, complicated backgrounds or large viewpoint variations.
In this thesis, we propose a novel self-training framework for the discovery of unsupervised landmarks. Contrary to existing methods that build on auxiliary tasks such as image generation or equivariance, we depart from generic keypoints and train a landmark detector and descriptor to improve itself, tuning the keypoints into distinctive landmarks. We propose an iterative algorithm that alternates between producing new pseudo-labels through feature clustering and learning distinctive features for each pseudo-class through contrastive learning. Our detector can discover highly semantic landmarks, that are more flexible in terms of capturing large viewpoint changes and out-of-plane rotations (3D rotations). New state-of-the-art performance is achieved in multiple challenging datasets
- …