1,574 research outputs found
Benchmarking Visual-Inertial Deep Multimodal Fusion for Relative Pose Regression and Odometry-aided Absolute Pose Regression
Visual-inertial localization is a key problem in computer vision and robotics
applications such as virtual reality, self-driving cars, and aerial vehicles.
The goal is to estimate an accurate pose of an object when either the
environment or the dynamics are known. Recent methods directly regress the pose
using convolutional and spatio-temporal networks. Absolute pose regression
(APR) techniques predict the absolute camera pose from an image input in a
known scene. Odometry methods perform relative pose regression (RPR) that
predicts the relative pose from a known object dynamic (visual or inertial
inputs). The localization task can be improved by retrieving information of
both data sources for a cross-modal setup, which is a challenging problem due
to contradictory tasks. In this work, we conduct a benchmark to evaluate deep
multimodal fusion based on PGO and attention networks. Auxiliary and Bayesian
learning are integrated for the APR task. We show accuracy improvements for the
RPR-aided APR task and for the RPR-RPR task for aerial vehicles and hand-held
devices. We conduct experiments on the EuRoC MAV and PennCOSYVIO datasets, and
record a novel industry dataset.Comment: Under revie
Recommended from our members
On Building Generalizable Learning Agents
It has been a long-standing goal in Artificial Intelligence (AI) to build machines that can solve tasks that humans can. Thanks to the recent rapid progress in data-driven methods, which train agents to solve tasks by learning from massive training data, there have been many successes in applying such learning approaches to handle and even solve a number of extremely challenging tasks, including image classification, language generation, robotics control, and several multi-player games. The key factor for all these data-driven successes is that the trained agents can generalize to test scenarios that are unseen during training. This generalization capability is the foundation for building any practical AI system. This thesis studies generalization, the fundamental challenge in AI, and proposes solutions to improve the generalization performances of learning agents in a variety of problems. We start by providing a formal formulation of the generalization problem in the context of reinforcement learning and proposing 4 principles within this formulation to guide the design of training techniques for improved generalization. We validate the effectiveness of our proposed principles by considering 4 different domains, from simple to complex, and developing domain-specific techniques following these principles. Particularly, we begin with the simplest domain, i.e., path-finding on graphs (Part I), and then consider visual navigation in a 3D world (Part II) and competition in complex multi-agent games (Part III), and lastly tackle some natural language processing tasks (Part IV). Empirical evidences demonstrate that the proposed principles can generally lead to much improved generalization performances in a wide range of problems
Core Challenges in Embodied Vision-Language Planning
Recent advances in the areas of multimodal machine learning and artificial
intelligence (AI) have led to the development of challenging tasks at the
intersection of Computer Vision, Natural Language Processing, and Embodied AI.
Whereas many approaches and previous survey pursuits have characterised one or
two of these dimensions, there has not been a holistic analysis at the center
of all three. Moreover, even when combinations of these topics are considered,
more focus is placed on describing, e.g., current architectural methods, as
opposed to also illustrating high-level challenges and opportunities for the
field. In this survey paper, we discuss Embodied Vision-Language Planning
(EVLP) tasks, a family of prominent embodied navigation and manipulation
problems that jointly use computer vision and natural language. We propose a
taxonomy to unify these tasks and provide an in-depth analysis and comparison
of the new and current algorithmic approaches, metrics, simulated environments,
as well as the datasets used for EVLP tasks. Finally, we present the core
challenges that we believe new EVLP works should seek to address, and we
advocate for task construction that enables model generalizability and furthers
real-world deployment.Comment: 35 page
Recognising, Representing and Mapping Natural Features in Unstructured Environments
This thesis addresses the problem of building statistical models for multi-sensor perception in unstructured outdoor environments. The perception problem is divided into three distinct tasks: recognition, representation and association. Recognition is cast as a statistical classification problem where inputs are images or a combination of images and ranging information. Given the complexity and variability of natural environments, this thesis investigates the use of Bayesian statistics and supervised dimensionality reduction to incorporate prior information and fuse sensory data. A compact probabilistic representation of natural objects is essential for many problems in field robotics. This thesis presents techniques for combining non-linear dimensionality reduction with parametric learning through Expectation Maximisation to build general representations of natural features. Once created these models need to be rapidly processed to account for incoming information. To this end, techniques for efficient probabilistic inference are proposed. The robustness of localisation and mapping algorithms is directly related to reliable data association. Conventional algorithms employ only geometric information which can become inconsistent for large trajectories. A new data association algorithm incorporating visual and geometric information is proposed to improve the reliability of this task. The method uses a compact probabilistic representation of objects to fuse visual and geometric information for the association decision. The main contributions of this thesis are: 1) a stochastic representation of objects through non-linear dimensionality reduction; 2) a landmark recognition system using a visual and ranging sensors; 3) a data association algorithm combining appearance and position properties; 4) a real-time algorithm for detection and segmentation of natural objects from few training images and 5) a real-time place recognition system combining dimensionality reduction and Bayesian learning. The theoretical contributions of this thesis are demonstrated with a series of experiments in unstructured environments. In particular, the combination of recognition, representation and association algorithms is applied to the Simultaneous Localisation and Mapping problem (SLAM) to close large loops in outdoor trajectories, proving the benefits of the proposed methodology
Sea Ice Extraction via Remote Sensed Imagery: Algorithms, Datasets, Applications and Challenges
The deep learning, which is a dominating technique in artificial
intelligence, has completely changed the image understanding over the past
decade. As a consequence, the sea ice extraction (SIE) problem has reached a
new era. We present a comprehensive review of four important aspects of SIE,
including algorithms, datasets, applications, and the future trends. Our review
focuses on researches published from 2016 to the present, with a specific focus
on deep learning-based approaches in the last five years. We divided all
relegated algorithms into 3 categories, including classical image segmentation
approach, machine learning-based approach and deep learning-based methods. We
reviewed the accessible ice datasets including SAR-based datasets, the
optical-based datasets and others. The applications are presented in 4 aspects
including climate research, navigation, geographic information systems (GIS)
production and others. It also provides insightful observations and inspiring
future research directions.Comment: 24 pages, 6 figure
Vulnerable road users and connected autonomous vehicles interaction: a survey
There is a group of users within the vehicular traffic ecosystem known as Vulnerable Road Users (VRUs). VRUs include pedestrians, cyclists, motorcyclists, among others. On the other hand, connected autonomous vehicles (CAVs) are a set of technologies that combines, on the one hand, communication technologies to stay always ubiquitous connected, and on the other hand, automated technologies to assist or replace the human driver during the driving process. Autonomous vehicles are being visualized as a viable alternative to solve road accidents providing a general safe environment for all the users on the road specifically to the most vulnerable. One of the problems facing autonomous vehicles is to generate mechanisms that facilitate their integration not only within the mobility environment, but also into the road society in a safe and efficient way. In this paper, we analyze and discuss how this integration can take place, reviewing the work that has been developed in recent years in each of the stages of the vehicle-human interaction, analyzing the challenges of vulnerable users and proposing solutions that contribute to solving these challenges.This work was partially funded by the Ministry of Economy, Industry, and Competitiveness
of Spain under Grant: Supervision of drone fleet and optimization of commercial operations flight
plans, PID2020-116377RB-C21.Peer ReviewedPostprint (published version
HUMAN ROBOT INTERACTION THROUGH SEMANTIC INTEGRATION OF MULTIPLE MODALITIES, DIALOG MANAGEMENT, AND CONTEXTS
The hypothesis for this research is that applying the Human Computer Interaction (HCI) concepts of using multiple modalities, dialog management, context, and semantics to Human Robot Interaction (HRI) will improve the performance of Instruction Based Learning (IBL) compared to only using speech. We tested the hypothesis by simulating a domestic robot that can be taught to clean a house using a multi-modal interface. We used a method of semantically integrating the inputs from multiple modalities and contexts that multiplies a confidence score for each input by a Fusion Weight, sums the products, and then uses the input with the highest product sum. We developed an algorithm for determining the Fusion Weights. We concluded that different modalities, contexts, and modes of dialog management impact human robot interaction; however, which combination is better depends on the importance of the accuracy of learning what is taught versus the succinctness of the dialog between the user and the robot
- …