16,440 research outputs found
Active object recognition for 2D and 3D applications
Includes bibliographical referencesActive object recognition provides a mechanism for selecting informative viewpoints to complete recognition tasks as quickly and accurately as possible. One can manipulate the position of the camera or the object of interest to obtain more useful information. This approach can improve the computational efficiency of the recognition task by only processing viewpoints selected based on the amount of relevant information they contain. Active object recognition methods are based around how to select the next best viewpoint and the integration of the extracted information. Most active recognition methods do not use local interest points which have been shown to work well in other recognition tasks and are tested on images containing a single object with no occlusions or clutter. In this thesis we investigate using local interest points (SIFT) in probabilistic and non-probabilistic settings for active single and multiple object and viewpoint/pose recognition. Test images used contain objects that are occluded and occur in significant clutter. Visually similar objects are also included in our dataset. Initially we introduce a non-probabilistic 3D active object recognition system which consists of a mechanism for selecting the next best viewpoint and an integration strategy to provide feedback to the system. A novel approach to weighting the uniqueness of features extracted is presented, using a vocabulary tree data structure. This process is then used to determine the next best viewpoint by selecting the one with the highest number of unique features. A Bayesian framework uses the modified statistics from the vocabulary structure to update the system's confidence in the identity of the object. New test images are only captured when the belief hypothesis is below a predefined threshold. This vocabulary tree method is tested against randomly selecting the next viewpoint and a state-of-the-art active object recognition method by Kootstra et al.. Our approach outperforms both methods by correctly recognizing more objects with less computational expense. This vocabulary tree method is extended for use in a probabilistic setting to improve the object recognition accuracy. We introduce Bayesian approaches for object recognition and object and pose recognition. Three likelihood models are introduced which incorporate various parameters and levels of complexity. The occlusion model, which includes geometric information and variables that cater for the background distribution and occlusion, correctly recognizes all objects on our challenging database. This probabilistic approach is further extended for recognizing multiple objects and poses in a test images. We show through experiments that this model can recognize multiple objects which occur in close proximity to distractor objects. Our viewpoint selection strategy is also extended to the multiple object application and performs well when compared to randomly selecting the next viewpoint, the activation model and mutual information. We also study the impact of using active vision for shape recognition. Fourier descriptors are used as input to our shape recognition system with mutual information as the active vision component. We build multinomial and Gaussian distributions using this information, which correctly recognizes a sequence of objects. We demonstrate the effectiveness of active vision in object recognition systems. We show that even in different recognition applications using different low level inputs, incorporating active vision improves the overall accuracy and decreases the computational expense of object recognition systems
Model-Based Environmental Visual Perception for Humanoid Robots
The visual perception of a robot should answer two fundamental questions: What? and Where? In order to properly and efficiently reply to these questions, it is essential to establish a bidirectional coupling between the external stimuli and the internal representations. This coupling links the physical world with the inner abstraction models by sensor transformation, recognition, matching and optimization algorithms. The objective of this PhD is to establish this sensor-model coupling
Research on Symbolic Inference in Computational Vision
This paper provides an overview of ongoing research in the GRASP laboratory which focuses on the general problem of symbolic inference in computational vision. In this report we describe a conceptual framework for this research, and describe our current research programs in the component areas which support this work
3D scanning of cultural heritage with consumer depth cameras
Three dimensional reconstruction of cultural heritage objects is an expensive and time-consuming process. Recent consumer real-time depth acquisition devices, like Microsoft Kinect, allow very fast and simple acquisition of 3D views. However 3D scanning with such devices is a challenging task due to the limited accuracy and reliability of the acquired data. This paper introduces a 3D reconstruction pipeline suited to use consumer depth cameras as hand-held scanners for cultural heritage objects. Several new contributions have been made to achieve this result. They include an ad-hoc filtering scheme that exploits the model of the error on the acquired data and a novel algorithm for the extraction of salient points exploiting both depth and color data. Then the salient points are used within a modified version of the ICP algorithm that exploits both geometry and color distances to precisely align the views even when geometry information is not sufficient to constrain the registration. The proposed method, although applicable to generic scenes, has been tuned to the acquisition of sculptures and in this connection its performance is rather interesting as the experimental results indicate
ImageManip: Image-based Robotic Manipulation with Affordance-guided Next View Selection
In the realm of future home-assistant robots, 3D articulated object
manipulation is essential for enabling robots to interact with their
environment. Many existing studies make use of 3D point clouds as the primary
input for manipulation policies. However, this approach encounters challenges
due to data sparsity and the significant cost associated with acquiring point
cloud data, which can limit its practicality. In contrast, RGB images offer
high-resolution observations using cost effective devices but lack spatial 3D
geometric information. To overcome these limitations, we present a novel
image-based robotic manipulation framework. This framework is designed to
capture multiple perspectives of the target object and infer depth information
to complement its geometry. Initially, the system employs an eye-on-hand RGB
camera to capture an overall view of the target object. It predicts the initial
depth map and a coarse affordance map. The affordance map indicates actionable
areas on the object and serves as a constraint for selecting subsequent
viewpoints. Based on the global visual prior, we adaptively identify the
optimal next viewpoint for a detailed observation of the potential manipulation
success area. We leverage geometric consistency to fuse the views, resulting in
a refined depth map and a more precise affordance map for robot manipulation
decisions. By comparing with prior works that adopt point clouds or RGB images
as inputs, we demonstrate the effectiveness and practicality of our method. In
the project webpage (https://sites.google.com/view/imagemanip), real world
experiments further highlight the potential of our method for practical
deployment
On the Design and Analysis of Multiple View Descriptors
We propose an extension of popular descriptors based on gradient orientation
histograms (HOG, computed in a single image) to multiple views. It hinges on
interpreting HOG as a conditional density in the space of sampled images, where
the effects of nuisance factors such as viewpoint and illumination are
marginalized. However, such marginalization is performed with respect to a very
coarse approximation of the underlying distribution. Our extension leverages on
the fact that multiple views of the same scene allow separating intrinsic from
nuisance variability, and thus afford better marginalization of the latter. The
result is a descriptor that has the same complexity of single-view HOG, and can
be compared in the same manner, but exploits multiple views to better trade off
insensitivity to nuisance variability with specificity to intrinsic
variability. We also introduce a novel multi-view wide-baseline matching
dataset, consisting of a mixture of real and synthetic objects with ground
truthed camera motion and dense three-dimensional geometry
Learning Correspondence Structures for Person Re-identification
This paper addresses the problem of handling spatial misalignments due to
camera-view changes or human-pose variations in person re-identification. We
first introduce a boosting-based approach to learn a correspondence structure
which indicates the patch-wise matching probabilities between images from a
target camera pair. The learned correspondence structure can not only capture
the spatial correspondence pattern between cameras but also handle the
viewpoint or human-pose variation in individual images. We further introduce a
global constraint-based matching process. It integrates a global matching
constraint over the learned correspondence structure to exclude cross-view
misalignments during the image patch matching process, hence achieving a more
reliable matching score between images. Finally, we also extend our approach by
introducing a multi-structure scheme, which learns a set of local
correspondence structures to capture the spatial correspondence sub-patterns
between a camera pair, so as to handle the spatial misalignments between
individual images in a more precise way. Experimental results on various
datasets demonstrate the effectiveness of our approach.Comment: IEEE Trans. Image Processing, vol. 26, no. 5, pp. 2438-2453, 2017.
The project page for this paper is available at
http://min.sjtu.edu.cn/lwydemo/personReID.htm arXiv admin note: text overlap
with arXiv:1504.0624
Scene understanding for autonomous robots operating in indoor environments
Mención Internacional en el título de doctorThe idea of having robots among us is not new. Great efforts are continually made to
replicate human intelligence, with the vision of having robots performing different activities,
including hazardous, repetitive, and tedious tasks. Research has demonstrated that robots are
good at many tasks that are hard for us, mainly in terms of precision, efficiency, and speed.
However, there are some tasks that humans do without much effort that are challenging for
robots. Especially robots in domestic environments are far from satisfactorily fulfilling some
tasks, mainly because these environments are unstructured, cluttered, and with a variety of
environmental conditions to control.
This thesis addresses the problem of scene understanding in the context of autonomous
robots operating in everyday human environments. Furthermore, this thesis is developed
under the HEROITEA research project that aims to develop a robot system to help
elderly people in domestic environments as an assistant. Our main objective is to develop
different methods that allow robots to acquire more information from the environment to
progressively build knowledge that allows them to improve the performance on high-level
robotic tasks. In this way, scene understanding is a broad research topic, and it is considered
a complex task due to the multiple sub-tasks that are involved. In that context, in this thesis,
we focus on three sub-tasks: object detection, scene recognition, and semantic segmentation
of the environment.
Firstly, we implement methods to recognize objects considering real indoor environments.
We applied machine learning techniques incorporating uncertainties and more modern
techniques based on deep learning. Besides, apart from detecting objects, it is essential to
comprehend the scene where they can occur. For this reason, we propose an approach
for scene recognition that considers the influence of the detected objects in the prediction
process. We demonstrate that the exiting objects and their relationships can improve the
inference about the scene class. We also consider that a scene recognition model can
benefit from the advantages of other models. We propose a multi-classifier model for scene
recognition based on weighted voting schemes. The experiments carried out in real-world
indoor environments demonstrate that the adequate combination of independent classifiers
allows obtaining a more robust and precise model for scene recognition.
Moreover, to increase the understanding of a robot about its surroundings, we propose
a new division of the environment based on regions to build a useful representation of
the environment. Object and scene information is integrated into a probabilistic fashion
generating a semantic map of the environment containing meaningful regions within each
room. The proposed system has been assessed on simulated and real-world domestic
scenarios, demonstrating its ability to generate consistent environment representations.
Lastly, full knowledge of the environment can enhance more complex robotic tasks; that is
why in this thesis, we try to study how a complete knowledge of the environment influences
the robot’s performance in high-level tasks. To do so, we select an essential task, which
is searching for objects. This mundane task can be considered a precondition to perform
many complex robotic tasks such as fetching and carrying, manipulation, user requirements,
among others. The execution of these activities by service robots needs full knowledge of
the environment to perform each task efficiently. In this thesis, we propose two searching
strategies that consider prior information, semantic representation of the environment, and
the relationships between known objects and the type of scene. All our developments are
evaluated in simulated and real-world environments, integrated with other systems, and
operating in real platforms, demonstrating their feasibility to implement in real scenarios, and
in some cases outperforming other approaches. We also demonstrate how our representation
of the environment can boost the performance of more complex robotic tasks compared to
more standard environmental representations.La idea de tener robots entre nosotros no es nueva. Continuamente se realizan grandes
esfuerzos para replicar la inteligencia humana, con la visión de tener robots que realicen
diferentes actividades, incluidas tareas peligrosas, repetitivas y tediosas. La investigación ha
demostrado que los robots son buenos en muchas tareas que resultan difíciles para nosotros,
principalmente en términos de precisión, eficiencia y velocidad. Sin embargo, existen tareas
que los humanos realizamos sin mucho esfuerzo y que son un desafío para los robots.
Especialmente, los robots en entornos domésticos están lejos de cumplir satisfactoriamente
algunas tareas, principalmente porque estos entornos no son estructurados, pueden estar
desordenados y cuentan con una gran variedad de condiciones ambientales que controlar.
Esta tesis aborda el problema de la comprensión de la escena en el contexto de robots
autónomos que operan en entornos humanos cotidianos. Asimismo, esta tesis se desarrolla
en el marco del proyecto de investigación HEROITEA que tiene como objetivo desarrollar
un sistema robótico que funcione como asistente para ayudar a personas mayores en entornos
domésticos. Nuestro principal objetivo es desarrollar diferentes métodos que permitan a
los robots adquirir más información del entorno a fin de construir progresivamente un
conocimiento que les permita mejorar su desempeño en tareas robóticas más complejas.
En este sentido, la comprensión de escenas es un tema de investigación amplio, y se
considera una tarea compleja debido a las múltiples subtareas involucradas. En esta tesis
nos enfocamos específicamente en tres subtareas: detección de objetos, reconocimiento de
escenas y etiquetado semántico del entorno.
Por un lado, implementamos métodos para el reconocimiento de objectos considerando
entornos interiores reales. Aplicamos técnicas de aprendizaje automático incorporando
incertidumbres y técnicas más modernas basadas en aprendizaje profundo. Además, aparte
de detectar objetos, es fundamental comprender la escena donde estos se encuentran. Por esta
razón, proponemos un modelo para el reconocimiento de escenas que considera la influencia
de los objetos detectados en el proceso de predicción. Demostramos que los objetos existentes
y sus relaciones pueden mejorar el proceso de inferencia de la categoría de la escena. También
consideramos que un modelo de reconocimiento de escenas puede beneficiarse de las ventajas
de otros modelos. Por ello, proponemos un multiclasificador para el reconocimiento de escenas basado en esquemas de votación ponderados. Los experimentos llevados a cabo
en entornos interiores reales demuestran que la combinación adecuada de clasificadores
independientes permite obtener un modelo más robusto y preciso para el reconocimiento
de escenas.
Adicionalmente, para aumentar la comprensión de un robot acerca de su entorno,
proponemos una nueva división del entorno basada en regiones a fin de construir una
representación útil del entorno. La información de objetos y de la escena se integra de forma
probabilística generando un mapa semántico que contiene regiones significativas dentro de
cada habitación. El sistema propuesto ha sido evaluado en entornos domésticos simulados y
reales, demostrando su capacidad para generar representaciones consistentes del entorno.
Por otro lado, el conocimiento integral del entorno puede mejorar tareas robóticas más
complejas; es por ello que en esta tesis analizamos cómo el conocimiento completo del
entorno influye en el desempeño del robot en tareas de alto nivel. Para ello, seleccionamos una
tarea fundamental, que es la búsqueda de objetos. Esta tarea mundana puede considerarse
una condición previa para realizar diversas tareas robóticas complejas, como transportar
objetos, tareas de manipulación, atender requerimientos del usuario, entre otras. La
ejecución de estas actividades por parte de robots de servicio requiere un conocimiento
profundo del entorno para realizar cada tarea de manera eficiente. En esta tesis proponemos
dos estrategias de búsqueda de objetos que consideran información previa, la representación
semántica del entorno, las relaciones entre los objetos conocidos y el tipo de escena. Todos
nuestros desarrollos son evaluados en entornos simulados y reales, integrados con otros
sistemas y operando en plataformas reales, demostrando su viabilidad de ser implementados
en escenarios reales y, en algunos casos, superando a otros enfoques. También demostramos
cómo nuestra representación del entorno puede mejorar el desempeño de tareas robóticas
más complejas en comparación con representaciones del entorno más tradicionales.Programa de Doctorado en Ingeniería Eléctrica, Electrónica y Automática por la Universidad Carlos III de MadridPresidente: Carlos Balaguer Bernaldo de Quirós.- Secretario: Fernando Matía Espada.- Vocal: Klaus Strob
- …