15 research outputs found
PLAR: Prompt Learning for Action Recognition
We present a new general learning approach, Prompt Learning for Action
Recognition (PLAR), which leverages the strengths of prompt learning to guide
the learning process. Our approach is designed to predict the action label by
helping the models focus on the descriptions or instructions associated with
actions in the input videos. Our formulation uses various prompts, including
learnable prompts, auxiliary visual information, and large vision models to
improve the recognition performance. In particular, we design a learnable
prompt method that learns to dynamically generate prompts from a pool of prompt
experts under different inputs. By sharing the same objective with the task,
our proposed PLAR can optimize prompts that guide the model's predictions while
explicitly learning input-invariant (prompt experts pool) and input-specific
(data-dependent) prompt knowledge. We evaluate our approach on datasets
consisting of both ground camera videos and aerial videos, and scenes with
single-agent and multi-agent actions. In practice, we observe a 3.17-10.2%
accuracy improvement on the aerial multi-agent dataset Okutamam and a 1.0-3.6%
improvement on the ground camera single-agent dataset Something Something V2.
We plan to release our code on the WWW
GANav: Group-wise Attention Network for Classifying Navigable Regions in Unstructured Outdoor Environments
We present a new learning-based method for identifying safe and navigable
regions in off-road terrains and unstructured environments from RGB images. Our
approach consists of classifying groups of terrain classes based on their
navigability levels using coarse-grained semantic segmentation. We propose a
bottleneck transformer-based deep neural network architecture that uses a novel
group-wise attention mechanism to distinguish between navigability levels of
different terrains.Our group-wise attention heads enable the network to
explicitly focus on the different groups and improve the accuracy. In addition,
we propose a dynamic weighted cross entropy loss function to handle the
long-tailed nature of the dataset. We show through extensive evaluations on the
RUGD and RELLIS-3D datasets that our learning algorithm improves the accuracy
of visual perception in off-road terrains for navigation. We compare our
approach with prior work on these datasets and achieve an improvement over the
state-of-the-art mIoU by 6.74-39.1% on RUGD and 3.82-10.64% on RELLIS-3D
iPLAN: Intent-Aware Planning in Heterogeneous Traffic via Distributed Multi-Agent Reinforcement Learning
Navigating safely and efficiently in dense and heterogeneous traffic
scenarios is challenging for autonomous vehicles (AVs) due to their inability
to infer the behaviors or intentions of nearby drivers. In this work, we
introduce a distributed multi-agent reinforcement learning (MARL) algorithm
that can predict trajectories and intents in dense and heterogeneous traffic
scenarios. Our approach for intent-aware planning, iPLAN, allows agents to
infer nearby drivers' intents solely from their local observations. We model
two distinct incentives for agents' strategies: Behavioral Incentive for
high-level decision-making based on their driving behavior or personality and
Instant Incentive for motion planning for collision avoidance based on the
current traffic state. Our approach enables agents to infer their opponents'
behavior incentives and integrate this inferred information into their
decision-making and motion-planning processes. We perform experiments on two
simulation environments, Non-Cooperative Navigation and Heterogeneous Highway.
In Heterogeneous Highway, results show that, compared with centralized training
decentralized execution (CTDE) MARL baselines such as QMIX and MAPPO, our
method yields a 4.3% and 38.4% higher episodic reward in mild and chaotic
traffic, with 48.1% higher success rate and 80.6% longer survival time in
chaotic traffic. We also compare with a decentralized training decentralized
execution (DTDE) baseline IPPO and demonstrate a higher episodic reward of
12.7% and 6.3% in mild traffic and chaotic traffic, 25.3% higher success rate,
and 13.7% longer survival time
GrASPE: Graph based Multimodal Fusion for Robot Navigation in Unstructured Outdoor Environments
We present a novel trajectory traversability estimation and planning
algorithm for robot navigation in complex outdoor environments. We incorporate
multimodal sensory inputs from an RGB camera, 3D LiDAR, and robot's odometry
sensor to train a prediction model to estimate candidate trajectories' success
probabilities based on partially reliable multi-modal sensor observations. We
encode high-dimensional multi-modal sensory inputs to low-dimensional feature
vectors using encoder networks and represent them as a connected graph to train
an attention-based Graph Neural Network (GNN) model to predict trajectory
success probabilities. We further analyze the image and point cloud data
separately to quantify sensor reliability to augment the weights of the feature
graph representation used in our GNN. During runtime, our model utilizes
multi-sensor inputs to predict the success probabilities of the trajectories
generated by a local planner to avoid potential collisions and failures. Our
algorithm demonstrates robust predictions when one or more sensor modalities
are unreliable or unavailable in complex outdoor environments. We evaluate our
algorithm's navigation performance using a Spot robot in real-world outdoor
environments
CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition
We present CrossLoc3D, a novel 3D place recognition method that solves a
large-scale point matching problem in a cross-source setting. Cross-source
point cloud data corresponds to point sets captured by depth sensors with
different accuracies or from different distances and perspectives. We address
the challenges in terms of developing 3D place recognition methods that account
for the representation gap between points captured by different sources. Our
method handles cross-source data by utilizing multi-grained features and
selecting convolution kernel sizes that correspond to most prominent features.
Inspired by the diffusion models, our method uses a novel iterative refinement
process that gradually shifts the embedding spaces from different sources to a
single canonical space for better metric learning. In addition, we present
CS-Campus3D, the first 3D aerial-ground cross-source dataset consisting of
point cloud data from both aerial and ground LiDAR scans. The point clouds in
CS-Campus3D have representation gaps and other features like different views,
point densities, and noise patterns. We show that our CrossLoc3D algorithm can
achieve an improvement of 4.74% - 15.37% in terms of the top 1 average recall
on our CS-Campus3D benchmark and achieves performance comparable to
state-of-the-art 3D place recognition method on the Oxford RobotCar. We will
release the code and CS-Campus3D benchmark
Terrain Classification and Navigability Analysis in Unstructured Outdoor Environments
We present a new learning-based method for identifying safe and navigable regions inoff-road terrains and unstructured environments from RGB images. Our approach consists of
classifying groups of terrains based on their navigability levels using coarse-grained semantic
segmentation. We propose a transformer-based deep neural network architecture that uses a
novel group-wise attention mechanism to distinguish between navigability levels of different
terrains. Our group-wise attention heads enable the network to explicitly focus on the different
groups and improve the accuracy. We show through extensive evaluations on the RUGD and
RELLIS-3D datasets that our learning algorithm improves visual perception accuracy in off-road
terrains for navigation. We compare our approach with prior work on these datasets and achieve
an improvement over the state-of-the-art mIoU by 6.74-39.1% on RUGD and 3.82-10.64% on
RELLIS-3D. In addition, we deploy our method on a Clearpath Jackal robot. Our approach
improves the performance of the navigation algorithm in terms of average progress towards the
goal by 54.73% and the false positives in terms of forbidden region by 29.96%