66 research outputs found
HVC-Net: Unifying Homography, Visibility, and Confidence Learning for Planar Object Tracking
Robust and accurate planar tracking over a whole video sequence is vitally
important for many vision applications. The key to planar object tracking is to
find object correspondences, modeled by homography, between the reference image
and the tracked image. Existing methods tend to obtain wrong correspondences
with changing appearance variations, camera-object relative motions and
occlusions. To alleviate this problem, we present a unified convolutional
neural network (CNN) model that jointly considers homography, visibility, and
confidence. First, we introduce correlation blocks that explicitly account for
the local appearance changes and camera-object relative motions as the base of
our model. Second, we jointly learn the homography and visibility that links
camera-object relative motions with occlusions. Third, we propose a confidence
module that actively monitors the estimation quality from the pixel correlation
distributions obtained in correlation blocks. All these modules are plugged
into a Lucas-Kanade (LK) tracking pipeline to obtain both accurate and robust
planar object tracking. Our approach outperforms the state-of-the-art methods
on public POT and TMT datasets. Its superior performance is also verified on a
real-world application, synthesizing high-quality in-video advertisements.Comment: Accepted to ECCV 202
Vehicle Trajectories from Unlabeled Data through Iterative Plane Registration
One of the most complex aspects of autonomous driving concerns understanding the surrounding environment. In particular, the interest falls on detecting which agents are populating it and how they are moving. The capacity to predict how these may act in the near future would allow an autonomous vehicle to safely plan its trajectory, minimizing the risks for itself and others. In this work we propose an automatic trajectory annotation method exploiting an Iterative Plane Registration algorithm based on homographies and semantic segmentations. The output of our technique is a set of holistic trajectories (past-present-future) paired with a single image context, useful to train a predictive model
E3CM: Epipolar-Constrained Cascade Correspondence Matching
Accurate and robust correspondence matching is of utmost importance for
various 3D computer vision tasks. However, traditional explicit
programming-based methods often struggle to handle challenging scenarios, and
deep learning-based methods require large well-labeled datasets for network
training. In this article, we introduce Epipolar-Constrained Cascade
Correspondence (E3CM), a novel approach that addresses these limitations.
Unlike traditional methods, E3CM leverages pre-trained convolutional neural
networks to match correspondence, without requiring annotated data for any
network training or fine-tuning. Our method utilizes epipolar constraints to
guide the matching process and incorporates a cascade structure for progressive
refinement of matches. We extensively evaluate the performance of E3CM through
comprehensive experiments and demonstrate its superiority over existing
methods. To promote further research and facilitate reproducibility, we make
our source code publicly available at https://mias.group/E3CM.Comment: accepted to Neurocomputin
Deep Homography Estimation for Dynamic Scenes
Homography estimation is an important step in many computer vision problems.
Recently, deep neural network methods have shown to be favorable for this
problem when compared to traditional methods. However, these new methods do not
consider dynamic content in input images. They train neural networks with only
image pairs that can be perfectly aligned using homographies. This paper
investigates and discusses how to design and train a deep neural network that
handles dynamic scenes. We first collect a large video dataset with dynamic
content. We then develop a multi-scale neural network and show that when
properly trained using our new dataset, this neural network can already handle
dynamic scenes to some extent. To estimate a homography of a dynamic scene in a
more principled way, we need to identify the dynamic content. Since dynamic
content detection and homography estimation are two tightly coupled tasks, we
follow the multi-task learning principles and augment our multi-scale network
such that it jointly estimates the dynamics masks and homographies. Our
experiments show that our method can robustly estimate homography for
challenging scenarios with dynamic scenes, blur artifacts, or lack of textures.Comment: CVPR 2020, https://github.com/lcmhoang/hmg-dynamic
TVCalib: Camera Calibration for Sports Field Registration in Soccer
Sports field registration in broadcast videos is typically interpreted as the
task of homography estimation, which provides a mapping between a planar field
and the corresponding visible area of the image. In contrast to previous
approaches, we consider the task as a camera calibration problem. First, we
introduce a differentiable objective function that is able to learn the camera
pose and focal length from segment correspondences (e.g., lines, point clouds),
based on pixel-level annotations for segments of a known calibration object.
The calibration module iteratively minimizes the segment reprojection error
induced by the estimated camera parameters. Second, we propose a novel approach
for 3D sports field registration from broadcast soccer images. Compared to the
typical solution, which subsequently refines an initial estimation, our
solution does it in one step. The proposed method is evaluated for sports field
registration on two datasets and achieves superior results compared to two
state-of-the-art approaches.Comment: Accepted for publication at WACV'2
Pose estimation system based on monocular cameras
Our world is full of wonders. It is filled with mysteries and challenges, which through
the ages inspired and called for the human civilization to grow itself, either philosophically
or sociologically. In time, humans reached their own physical limitations;
nevertheless, we created technology to help us overcome it. Like the ancient uncovered
land, we are pulled into the discovery and innovation of our time. All of this is
possible due to a very human characteristic - our imagination.
The world that surrounds us is mostly already discovered, but with the power of
computer vision (CV) and augmented reality (AR), we are able to live in multiple hidden
universes alongside our own. With the increasing performance and capabilities of
the current mobile devices, AR is what we dream it can be. There are still many obstacles,
but this future is already our reality, and with the evolving technologies closing
the gap between the real and the virtual world, soon it will be possible for us to surround
ourselves into other dimensions, or fuse them with our own.
This thesis focuses on the development of a system to predict the camera’s pose
estimation in the real-world regarding to the virtual world axis. The work was developed
as a sub-module integrated on the M5SAR project: Mobile Five Senses Augmented
Reality System for Museums, aiming to a more immerse experience with the
total or partial replacement of the environments’ surroundings. It is based mainly on
man-made buildings indoors and their typical rectangular cuboid shape. With the possibility
of knowing the user’s camera direction, we can then superimpose dynamic AR content, inviting the user to explore the hidden worlds.
The M5SAR project introduced a new way to explore the existent historical museums
by exploring the human’s five senses: hearing, smell, taste, touch, vision. With
this innovative technology, the user is able to enhance their visitation and immerse
themselves into a virtual world blended with our reality. A mobile device application
was built containing an innovating framework: MIRAR - Mobile Image Recognition
based Augmented Reality - containing object recognition, navigation, and additional
AR information projection in order to enrich the users’ visit, providing an intuitive
and compelling information regarding the available artworks, exploring the hearing
and vision senses. A device specially designed was built to explore the additional
three senses: smell, taste and touch which, when attached to a mobile device, either
smartphone or tablet, would pair with it and automatically react in with the offered
narrative related to the artwork, immersing the user with a sensorial experience.
As mentioned above, the work presented on this thesis is relative to a sub-module
of the MIRAR regarding environment detection and the superimposition of AR content.
With the main goal being the full replacement of the walls’ contents, and with the
possibility of keeping the artwork visible or not, it presented an additional challenge
with the limitation of using only monocular cameras. Without the depth information,
any 2D image of an environment, to a computer doesn’t represent the tridimensional
layout of the real-world dimensions. Nevertheless, man-based building tends to follow
a rectangular approach to divisions’ constructions, which allows for a prediction
to where the vanishing point on any environment image may point, allowing the reconstruction
of an environment’s layout from a 2D image. Furthermore, combining
this information with an initial localization through an improved image recognition
to retrieve the camera’s spatial position regarding to the real-world coordinates and
the virtual-world, alas, pose estimation, allowed for the possibility of superimposing
specific localized AR content over the user’s mobile device frame, in order to immerse,
i.e., a museum’s visitor into another era correlated to the present artworks’ historical
period. Through the work developed for this thesis, it was also presented a better planar surface in space rectification and retrieval, a hybrid and scalable multiple images
matching system, a more stabilized outlier filtration applied to the camera’s axis,
and a continuous tracking system that works with uncalibrated cameras and is able to
achieve particularly obtuse angles and still maintain the surface superimposition.
Furthermore, a novelty method using deep learning models for semantic segmentation
was introduced for indoor layout estimation based on monocular images. Contrary
to the previous developed methods, there is no need to perform geometric calculations
to achieve a near state of the art performance with a fraction of the parameters
required by similar methods. Contrary to the previous work presented on this thesis,
this method performs well even in unseen and cluttered rooms if they follow the Manhattan
assumption. An additional lightweight application to retrieve the camera pose
estimation is presented using the proposed method.O nosso mundo está repleto de maravilhas. Está cheio de mistérios e desafios, os quais,
ao longo das eras, inspiraram e impulsionaram a civilização humana a evoluir, seja
filosófica ou sociologicamente. Eventualmente, os humanos foram confrontados com
os seus limites físicos; desta forma, criaram tecnologias que permitiram superá-los.
Assim como as terras antigas por descobrir, somos impulsionados à descoberta e inovação
da nossa era, e tudo isso é possível graças a uma característica marcadamente
humana: a nossa imaginação.
O mundo que nos rodeia está praticamente todo descoberto, mas com o poder da
visão computacional (VC) e da realidade aumentada (RA), podemos viver em múltiplos
universos ocultos dentro do nosso. Com o aumento da performance e das capacidades
dos dispositivos móveis da atualidade, a RA pode ser exatamente aquilo que
sonhamos. Continuam a existir muitos obstáculos, mas este futuro já é o nosso presente,
e com a evolução das tecnologias a fechar o fosso entre o mundo real e o mundo
virtual, em breve será possível cercarmo-nos de outras dimensões, ou fundi-las dentro
da nossa.
Esta tese foca-se no desenvolvimento de um sistema de predição para a estimação
da pose da câmara no mundo real em relação ao eixo virtual do mundo. Este trabalho
foi desenvolvido como um sub-módulo integrado no projeto M5SAR: Mobile
Five Senses Augmented Reality System for Museums, com o objetivo de alcançar uma
experiência mais imersiva com a substituição total ou parcial dos limites do ambiente. Dedica-se ao interior de edifícios de arquitetura humana e a sua típica forma
de retângulo cuboide. Com a possibilidade de saber a direção da câmara do dispositivo,
podemos então sobrepor conteúdo dinâmico de RA, num convite ao utilizador
para explorar os mundos ocultos.
O projeto M5SAR introduziu uma nova forma de explorar os museus históricos existentes
através da exploração dos cinco sentidos humanos: a audição, o cheiro, o paladar,
o toque e a visão. Com essa tecnologia inovadora, o utilizador pode engrandecer
a sua visita e mergulhar num mundo virtual mesclado com a nossa realidade. Uma
aplicação para dispositivo móvel foi criada, contendo uma estrutura inovadora: MIRAR
- Mobile Image Recognition based Augmented Reality - a possuir o reconhecimento
de objetos, navegação e projeção de informação de RA adicional, de forma a
enriquecer a visita do utilizador, a fornecer informação intuitiva e interessante em relação
às obras de arte disponíveis, a explorar os sentidos da audição e da visão. Foi
também desenhado um dispositivo para exploração em particular dos três outros sentidos
adicionais: o cheiro, o toque e o sabor. Este dispositivo, quando afixado a um
dispositivo móvel, como um smartphone ou tablet, emparelha e reage com este automaticamente
com a narrativa relacionada à obra de arte, a imergir o utilizador numa
experiência sensorial.
Como já referido, o trabalho apresentado nesta tese é relativo a um sub-módulo
do MIRAR, relativamente à deteção do ambiente e a sobreposição de conteúdo de RA.
Sendo o objetivo principal a substituição completa dos conteúdos das paredes, e com
a possibilidade de manter as obras de arte visíveis ou não, foi apresentado um desafio
adicional com a limitação do uso de apenas câmaras monoculares. Sem a informação
relativa à profundidade, qualquer imagem bidimensional de um ambiente, para um
computador isso não se traduz na dimensão tridimensional das dimensões do mundo
real. No entanto, as construções de origem humana tendem a seguir uma abordagem
retangular às divisões dos edifícios, o que permite uma predição de onde poderá apontar
o ponto de fuga de qualquer ambiente, a permitir a reconstrução da disposição de
uma divisão através de uma imagem bidimensional. Adicionalmente, ao combinar esta informação com uma localização inicial através de um reconhecimento por imagem
refinado, para obter a posição espacial da câmara em relação às coordenadas
do mundo real e do mundo virtual, ou seja, uma estimativa da pose, foi possível alcançar
a possibilidade de sobrepor conteúdo de RA especificamente localizado sobre
a moldura do dispositivo móvel, de maneira a imergir, ou seja, colocar o visitante do
museu dentro de outra era, relativa ao período histórico da obra de arte em questão.
Ao longo do trabalho desenvolvido para esta tese, também foi apresentada uma melhor
superfície planar na recolha e retificação espacial, um sistema de comparação de
múltiplas imagens híbrido e escalável, um filtro de outliers mais estabilizado, aplicado
ao eixo da câmara, e um sistema de tracking contínuo que funciona com câmaras não
calibradas e que consegue obter ângulos particularmente obtusos, continuando a manter
a sobreposição da superfície.
Adicionalmente, um algoritmo inovador baseado num modelo de deep learning
para a segmentação semântica foi introduzido na estimativa do traçado com base em
imagens monoculares. Ao contrário de métodos previamente desenvolvidos, não é
necessário realizar cálculos geométricos para obter um desempenho próximo ao state
of the art e ao mesmo tempo usar uma fração dos parâmetros requeridos para métodos
semelhantes. Inversamente ao trabalho previamente apresentado nesta tese, este
método apresenta um bom desempenho mesmo em divisões sem vista ou obstruídas,
caso sigam a mesma premissa Manhattan. Uma leve aplicação adicional para obter a
posição da câmara é apresentada usando o método proposto
Learning Rotation-Equivariant Features for Visual Correspondence
Extracting discriminative local features that are invariant to imaging
variations is an integral part of establishing correspondences between images.
In this work, we introduce a self-supervised learning framework to extract
discriminative rotation-invariant descriptors using group-equivariant CNNs.
Thanks to employing group-equivariant CNNs, our method effectively learns to
obtain rotation-equivariant features and their orientations explicitly, without
having to perform sophisticated data augmentations. The resultant features and
their orientations are further processed by group aligning, a novel invariant
mapping technique that shifts the group-equivariant features by their
orientations along the group dimension. Our group aligning technique achieves
rotation-invariance without any collapse of the group dimension and thus
eschews loss of discriminability. The proposed method is trained end-to-end in
a self-supervised manner, where we use an orientation alignment loss for the
orientation estimation and a contrastive descriptor loss for robust local
descriptors to geometric/photometric variations. Our method demonstrates
state-of-the-art matching accuracy among existing rotation-invariant
descriptors under varying rotation and also shows competitive results when
transferred to the task of keypoint matching and camera pose estimation.Comment: Accepted to CVPR 2023, Project webpage at
http://cvlab.postech.ac.kr/research/REL
Semantics and Planar Geometry for self-supervised Road Scene Understanding
In this thesis we leverage domain knowledge, specifically of road scenes, to provide a self-supervision signal, reduce the labelling requirements, improve the convergence of training and introduce interpretable parameters based on vastly simplified models. Specifically, we chose to research the value of applying domain knowledge to the popular tasks of semantic segmentation and relative pose estimation towards better understanding road scenes. In particular we leverage semantic and geometric scene understanding separately in the first two contributions and then seek to combine them in the third contribution.
Firstly, we show that hierarchical structure in class labels for training networks for tasks such as semantic segmentation can be useful for boosting performance and accelerating training. Moreover, we present a hierarchical loss implementation which differentiates between minor and serious errors, and evaluate our method on the Vistas road scene dataset.
Secondly, for the task of self-supervised monocular relative pose estimation, we propose a ground-relative formulation for network output which roots our problem in a locally planar geometry. Current self-supervised methods generally require over-parameterised training of both a pose and depth network, and our method entirely replaces the need for depth estimation, while obtaining competitive results on the KITTI visual odometry dataset, dramatically simplifying the problem.
Thirdly, we combine semantics with our geometric formulation by extracting the road plane with semantic segmentation and robustly fitting homographies to fine-scale correspondences between coarsely aligned image pairs. We show that with aid from our geometric knowledge and a known analytical method, we can decompose these homographies into camera-relative pose, providing a self-supervision signal that significantly improves our visual odometry performance at both training and test time. In particular, we form a non-differentiable module which computes real-time pseudo-labels, avoiding training complexity, and additionally allowing for test-time performance boosting, helping tackle bias present with deep learning methods
- …