2,068 research outputs found
Latent-Class Hough Forests for 3D object detection and pose estimation of rigid objects
In this thesis we propose a novel framework, Latent-Class Hough Forests, for the problem of 3D object detection and pose estimation in heavily cluttered and occluded scenes. Firstly, we adapt the state-of-the-art template-based representation, LINEMOD [34, 36], into a scale-invariant patch descriptor and integrate it into a regression forest using a novel template-based split function. In training, rather than explicitly collecting representative negative samples, our method is trained on positive samples only and we treat the class distributions at the leaf nodes as latent variables. During the inference process we iteratively update these distributions, providing accurate estimation of background clutter and foreground occlusions and thus a better detection rate. Furthermore, as a by-product, the latent class distributions can provide accurate occlusion aware segmentation masks, even in the multi-instance scenario. In addition to an existing public dataset, which contains only single-instance sequences with large amounts of clutter, we have collected a new, more challenging, dataset for multiple-instance detection containing heavy 2D and 3D clutter as well as foreground occlusions. We evaluate the Latent-Class Hough Forest on both of these datasets where we outperform state-of-the art methods.Open Acces
Pose estimation system based on monocular cameras
Our world is full of wonders. It is filled with mysteries and challenges, which through
the ages inspired and called for the human civilization to grow itself, either philosophically
or sociologically. In time, humans reached their own physical limitations;
nevertheless, we created technology to help us overcome it. Like the ancient uncovered
land, we are pulled into the discovery and innovation of our time. All of this is
possible due to a very human characteristic - our imagination.
The world that surrounds us is mostly already discovered, but with the power of
computer vision (CV) and augmented reality (AR), we are able to live in multiple hidden
universes alongside our own. With the increasing performance and capabilities of
the current mobile devices, AR is what we dream it can be. There are still many obstacles,
but this future is already our reality, and with the evolving technologies closing
the gap between the real and the virtual world, soon it will be possible for us to surround
ourselves into other dimensions, or fuse them with our own.
This thesis focuses on the development of a system to predict the camera’s pose
estimation in the real-world regarding to the virtual world axis. The work was developed
as a sub-module integrated on the M5SAR project: Mobile Five Senses Augmented
Reality System for Museums, aiming to a more immerse experience with the
total or partial replacement of the environments’ surroundings. It is based mainly on
man-made buildings indoors and their typical rectangular cuboid shape. With the possibility
of knowing the user’s camera direction, we can then superimpose dynamic AR content, inviting the user to explore the hidden worlds.
The M5SAR project introduced a new way to explore the existent historical museums
by exploring the human’s five senses: hearing, smell, taste, touch, vision. With
this innovative technology, the user is able to enhance their visitation and immerse
themselves into a virtual world blended with our reality. A mobile device application
was built containing an innovating framework: MIRAR - Mobile Image Recognition
based Augmented Reality - containing object recognition, navigation, and additional
AR information projection in order to enrich the users’ visit, providing an intuitive
and compelling information regarding the available artworks, exploring the hearing
and vision senses. A device specially designed was built to explore the additional
three senses: smell, taste and touch which, when attached to a mobile device, either
smartphone or tablet, would pair with it and automatically react in with the offered
narrative related to the artwork, immersing the user with a sensorial experience.
As mentioned above, the work presented on this thesis is relative to a sub-module
of the MIRAR regarding environment detection and the superimposition of AR content.
With the main goal being the full replacement of the walls’ contents, and with the
possibility of keeping the artwork visible or not, it presented an additional challenge
with the limitation of using only monocular cameras. Without the depth information,
any 2D image of an environment, to a computer doesn’t represent the tridimensional
layout of the real-world dimensions. Nevertheless, man-based building tends to follow
a rectangular approach to divisions’ constructions, which allows for a prediction
to where the vanishing point on any environment image may point, allowing the reconstruction
of an environment’s layout from a 2D image. Furthermore, combining
this information with an initial localization through an improved image recognition
to retrieve the camera’s spatial position regarding to the real-world coordinates and
the virtual-world, alas, pose estimation, allowed for the possibility of superimposing
specific localized AR content over the user’s mobile device frame, in order to immerse,
i.e., a museum’s visitor into another era correlated to the present artworks’ historical
period. Through the work developed for this thesis, it was also presented a better planar surface in space rectification and retrieval, a hybrid and scalable multiple images
matching system, a more stabilized outlier filtration applied to the camera’s axis,
and a continuous tracking system that works with uncalibrated cameras and is able to
achieve particularly obtuse angles and still maintain the surface superimposition.
Furthermore, a novelty method using deep learning models for semantic segmentation
was introduced for indoor layout estimation based on monocular images. Contrary
to the previous developed methods, there is no need to perform geometric calculations
to achieve a near state of the art performance with a fraction of the parameters
required by similar methods. Contrary to the previous work presented on this thesis,
this method performs well even in unseen and cluttered rooms if they follow the Manhattan
assumption. An additional lightweight application to retrieve the camera pose
estimation is presented using the proposed method.O nosso mundo está repleto de maravilhas. Está cheio de mistérios e desafios, os quais,
ao longo das eras, inspiraram e impulsionaram a civilização humana a evoluir, seja
filosófica ou sociologicamente. Eventualmente, os humanos foram confrontados com
os seus limites fÃsicos; desta forma, criaram tecnologias que permitiram superá-los.
Assim como as terras antigas por descobrir, somos impulsionados à descoberta e inovação
da nossa era, e tudo isso é possÃvel graças a uma caracterÃstica marcadamente
humana: a nossa imaginação.
O mundo que nos rodeia está praticamente todo descoberto, mas com o poder da
visão computacional (VC) e da realidade aumentada (RA), podemos viver em múltiplos
universos ocultos dentro do nosso. Com o aumento da performance e das capacidades
dos dispositivos móveis da atualidade, a RA pode ser exatamente aquilo que
sonhamos. Continuam a existir muitos obstáculos, mas este futuro já é o nosso presente,
e com a evolução das tecnologias a fechar o fosso entre o mundo real e o mundo
virtual, em breve será possÃvel cercarmo-nos de outras dimensões, ou fundi-las dentro
da nossa.
Esta tese foca-se no desenvolvimento de um sistema de predição para a estimação
da pose da câmara no mundo real em relação ao eixo virtual do mundo. Este trabalho
foi desenvolvido como um sub-módulo integrado no projeto M5SAR: Mobile
Five Senses Augmented Reality System for Museums, com o objetivo de alcançar uma
experiência mais imersiva com a substituição total ou parcial dos limites do ambiente. Dedica-se ao interior de edifÃcios de arquitetura humana e a sua tÃpica forma
de retângulo cuboide. Com a possibilidade de saber a direção da câmara do dispositivo,
podemos então sobrepor conteúdo dinâmico de RA, num convite ao utilizador
para explorar os mundos ocultos.
O projeto M5SAR introduziu uma nova forma de explorar os museus históricos existentes
através da exploração dos cinco sentidos humanos: a audição, o cheiro, o paladar,
o toque e a visão. Com essa tecnologia inovadora, o utilizador pode engrandecer
a sua visita e mergulhar num mundo virtual mesclado com a nossa realidade. Uma
aplicação para dispositivo móvel foi criada, contendo uma estrutura inovadora: MIRAR
- Mobile Image Recognition based Augmented Reality - a possuir o reconhecimento
de objetos, navegação e projeção de informação de RA adicional, de forma a
enriquecer a visita do utilizador, a fornecer informação intuitiva e interessante em relação
à s obras de arte disponÃveis, a explorar os sentidos da audição e da visão. Foi
também desenhado um dispositivo para exploração em particular dos três outros sentidos
adicionais: o cheiro, o toque e o sabor. Este dispositivo, quando afixado a um
dispositivo móvel, como um smartphone ou tablet, emparelha e reage com este automaticamente
com a narrativa relacionada à obra de arte, a imergir o utilizador numa
experiência sensorial.
Como já referido, o trabalho apresentado nesta tese é relativo a um sub-módulo
do MIRAR, relativamente à deteção do ambiente e a sobreposição de conteúdo de RA.
Sendo o objetivo principal a substituição completa dos conteúdos das paredes, e com
a possibilidade de manter as obras de arte visÃveis ou não, foi apresentado um desafio
adicional com a limitação do uso de apenas câmaras monoculares. Sem a informação
relativa à profundidade, qualquer imagem bidimensional de um ambiente, para um
computador isso não se traduz na dimensão tridimensional das dimensões do mundo
real. No entanto, as construções de origem humana tendem a seguir uma abordagem
retangular à s divisões dos edifÃcios, o que permite uma predição de onde poderá apontar
o ponto de fuga de qualquer ambiente, a permitir a reconstrução da disposição de
uma divisão através de uma imagem bidimensional. Adicionalmente, ao combinar esta informação com uma localização inicial através de um reconhecimento por imagem
refinado, para obter a posição espacial da câmara em relação às coordenadas
do mundo real e do mundo virtual, ou seja, uma estimativa da pose, foi possÃvel alcançar
a possibilidade de sobrepor conteúdo de RA especificamente localizado sobre
a moldura do dispositivo móvel, de maneira a imergir, ou seja, colocar o visitante do
museu dentro de outra era, relativa ao perÃodo histórico da obra de arte em questão.
Ao longo do trabalho desenvolvido para esta tese, também foi apresentada uma melhor
superfÃcie planar na recolha e retificação espacial, um sistema de comparação de
múltiplas imagens hÃbrido e escalável, um filtro de outliers mais estabilizado, aplicado
ao eixo da câmara, e um sistema de tracking contÃnuo que funciona com câmaras não
calibradas e que consegue obter ângulos particularmente obtusos, continuando a manter
a sobreposição da superfÃcie.
Adicionalmente, um algoritmo inovador baseado num modelo de deep learning
para a segmentação semântica foi introduzido na estimativa do traçado com base em
imagens monoculares. Ao contrário de métodos previamente desenvolvidos, não é
necessário realizar cálculos geométricos para obter um desempenho próximo ao state
of the art e ao mesmo tempo usar uma fração dos parâmetros requeridos para métodos
semelhantes. Inversamente ao trabalho previamente apresentado nesta tese, este
método apresenta um bom desempenho mesmo em divisões sem vista ou obstruÃdas,
caso sigam a mesma premissa Manhattan. Uma leve aplicação adicional para obter a
posição da câmara é apresentada usando o método proposto
T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects
We introduce T-LESS, a new public dataset for estimating the 6D pose, i.e.
translation and rotation, of texture-less rigid objects. The dataset features
thirty industry-relevant objects with no significant texture and no
discriminative color or reflectance properties. The objects exhibit symmetries
and mutual similarities in shape and/or size. Compared to other datasets, a
unique property is that some of the objects are parts of others. The dataset
includes training and test images that were captured with three synchronized
sensors, specifically a structured-light and a time-of-flight RGB-D sensor and
a high-resolution RGB camera. There are approximately 39K training and 10K test
images from each sensor. Additionally, two types of 3D models are provided for
each object, i.e. a manually created CAD model and a semi-automatically
reconstructed one. Training images depict individual objects against a black
background. Test images originate from twenty test scenes having varying
complexity, which increases from simple scenes with several isolated objects to
very challenging ones with multiple instances of several objects and with a
high amount of clutter and occlusion. The images were captured from a
systematically sampled view sphere around the object/scene, and are annotated
with accurate ground truth 6D poses of all modeled objects. Initial evaluation
results indicate that the state of the art in 6D object pose estimation has
ample room for improvement, especially in difficult cases with significant
occlusion. The T-LESS dataset is available online at cmp.felk.cvut.cz/t-less.Comment: WACV 201
Dynamic Scene Reconstruction and Understanding
Traditional approaches to 3D reconstruction have achieved remarkable progress in static scene acquisition. The acquired data serves as priors or benchmarks for many vision and graphics tasks, such as object detection and robotic navigation. Thus, obtaining interpretable and editable representations from a raw monocular RGB-D video sequence is an outstanding goal in scene understanding. However, acquiring an interpretable representation becomes significantly more challenging when a scene contains dynamic activities; for example, a moving camera, rigid object movement, and non-rigid motions. These dynamic scene elements introduce a scene factorization problem, i.e., dividing a scene into elements and jointly estimating elements’ motion and geometry. Moreover, the monocular setting brings in the problems of tracking and fusing partially occluded objects as they are scanned from one viewpoint at a time.
This thesis explores several ideas for acquiring an interpretable model in dynamic environments. Firstly, we utilize synthetic assets such as floor plans and object meshes to generate dynamic data for training and evaluation. Then, we explore the idea of learning geometry priors with an instance segmentation module, which predicts the location and grouping of indoor objects. We use the learned geometry priors to infer the occluded object geometry for tracking and reconstruction. While instance segmentation modules usually have a generalization issue, i.e., struggling to handle unknown objects, we observed that the empty space information in the background geometry is more reliable for detecting moving objects. Thus, we proposed a segmentation-by-reconstruction strategy for acquiring rigidly-moving objects and backgrounds. Finally, we present a novel neural representation to learn a factorized scene representation, reconstructing every dynamic element. The proposed model supports both rigid and non-rigid motions without pre-trained templates. We demonstrate that our systems and representation improve the reconstruction quality on synthetic test sets and real-world scans
DTF-Net: Category-Level Pose Estimation and Shape Reconstruction via Deformable Template Field
Estimating 6D poses and reconstructing 3D shapes of objects in open-world
scenes from RGB-depth image pairs is challenging. Many existing methods rely on
learning geometric features that correspond to specific templates while
disregarding shape variations and pose differences among objects in the same
category. As a result, these methods underperform when handling unseen object
instances in complex environments. In contrast, other approaches aim to achieve
category-level estimation and reconstruction by leveraging normalized geometric
structure priors, but the static prior-based reconstruction struggles with
substantial intra-class variations. To solve these problems, we propose the
DTF-Net, a novel framework for pose estimation and shape reconstruction based
on implicit neural fields of object categories. In DTF-Net, we design a
deformable template field to represent the general category-wise shape latent
features and intra-category geometric deformation features. The field
establishes continuous shape correspondences, deforming the category template
into arbitrary observed instances to accomplish shape reconstruction. We
introduce a pose regression module that shares the deformation features and
template codes from the fields to estimate the accurate 6D pose of each object
in the scene. We integrate a multi-modal representation extraction module to
extract object features and semantic masks, enabling end-to-end inference.
Moreover, during training, we implement a shape-invariant training strategy and
a viewpoint sampling method to further enhance the model's capability to
extract object pose features. Extensive experiments on the REAL275 and CAMERA25
datasets demonstrate the superiority of DTF-Net in both synthetic and real
scenes. Furthermore, we show that DTF-Net effectively supports grasping tasks
with a real robot arm.Comment: The first two authors are with equal contributions. Paper accepted by
ACM MM 202
Contribuciones a la estimación de la pose de la cámara en aplicaciones industriales de realidad aumentada
Augmented Reality (AR) aims to complement the visual perception of the user environment superimposing virtual elements. The main challenge of this technology is to combine the virtual and real world in a precise and natural way. To carry out this goal, estimating the user position and orientation in both worlds at all times is a crucial task. Currently, there are numerous techniques and algorithms developed for camera pose estimation. However, the use of synthetic square markers has become the fastest, most robust and simplest solution in these cases. In this scope, a big number of marker detection systems have been developed. Nevertheless, most of them presents some limitations, (1) their unattractive and non-customizable visual appearance prevent their use in industrial products and (2) the detection rate is drastically reduced in presence of noise, blurring and occlusions. In this doctoral dissertation the above-mentioned limitations are addressed. In first place, a comparison has been made between the different marker detection systems currently available in the literature, emphasizing the limitations of each. Secondly, a novel approach to design, detect and track customized markers capable of easily adapting to the visual limitations of commercial products has been developed. In third place, a method that combines the detection of black and white square markers with keypoints and contours has been implemented to estimate the camera position in AR applications. The main motivation of this work is to offer a versatile alternative (based on contours and keypoints) in cases where, due to noise, blurring or occlusions, it is not possible to identify markers in the images. Finally, a method for reconstruction and semantic segmentation of 3D objects using square markers in photogrammetry processes has been presented.La Realidad Aumentada (AR) tiene como objetivo complementar la percepción visual del entorno circunstante al usuario mediante la superposición de elementos virtuales. El principal reto de dicha tecnologÃa se basa en fusionar, de forma precisa y natural, el mundo virtual con el mundo real. Para llevar a cabo dicha tarea, es de vital importancia conocer en todo momento tanto la posición, asà como la orientación del usuario en ambos mundos. Actualmente, existen un gran número de técnicas de estimación de pose. No obstante, el uso de marcadores sintéticos cuadrados se ha convertido en la solución más rápida, robusta y sencilla utilizada en estos casos. En este ámbito de estudio, existen un gran número de sistemas de detección de marcadores ampliamente extendidos. Sin embargo, su uso presenta ciertas limitaciones, (1) su aspecto visual, poco atractivo y nada customizable impiden su uso en ciertos productos industriales en donde la personalización comercial es un aspecto crucial y (2) la tasa de detección se ve duramente decrementada ante la presencia de ruido, desenfoques y oclusiones Esta tesis doctoral se ocupa de las limitaciones anteriormente mencionadas. En primer lugar, se ha realizado una comparativa entre los diferentes sistemas de detección de marcadores actualmente en uso, enfatizando las limitaciones de cada uno. En segundo lugar, se ha desarrollado un novedoso enfoque para diseñar, detectar y trackear marcadores personalizados capaces de adaptarse fácilmente a las limitaciones visuales de productos comerciales. En tercer lugar, se ha implementado un método que combina la detección de marcadores cuadrados blancos y negros con keypoints y contornos, para estimar de la posición de la cámara en aplicaciones AR. La principal motivación de este trabajo se basa en ofrecer una alternativa versátil (basada en contornos y keypoints) en aquellos casos donde, por motivos de ruido, desenfoques u oclusiones no sea posible identificar marcadores en las imágenes. Por último, se ha desarrollado un método de reconstrucción y segmentación semántica de objetos 3D utilizando marcadores cuadrados en procesos de fotogrametrÃa
- …