8,852 research outputs found
Unbiased Directed Object Attention Graph for Object Navigation
Object navigation tasks require agents to locate specific objects in unknown
environments based on visual information. Previously, graph convolutions were
used to implicitly explore the relationships between objects. However, due to
differences in visibility among objects, it is easy to generate biases in
object attention. Thus, in this paper, we propose a directed object attention
(DOA) graph to guide the agent in explicitly learning the attention
relationships between objects, thereby reducing the object attention bias. In
particular, we use the DOA graph to perform unbiased adaptive object attention
(UAOA) on the object features and unbiased adaptive image attention (UAIA) on
the raw images, respectively. To distinguish features in different branches, a
concise adaptive branch energy distribution (ABED) method is proposed. We
assess our methods on the AI2-Thor dataset. Compared with the state-of-the-art
(SOTA) method, our method reports 7.4%, 8.1% and 17.6% increase in success rate
(SR), success weighted by path length (SPL) and success weighted by action
efficiency (SAE), respectively.Comment: 13 pages, ready to ACM Mutimedia, under revie
The Development of LLMs for Embodied Navigation
In recent years, the rapid advancement of Large Language Models (LLMs) such
as the Generative Pre-trained Transformer (GPT) has attracted increasing
attention due to their potential in a variety of practical applications. The
application of LLMs with Embodied Intelligence has emerged as a significant
area of focus. Among the myriad applications of LLMs, navigation tasks are
particularly noteworthy because they demand a deep understanding of the
environment and quick, accurate decision-making. LLMs can augment embodied
intelligence systems with sophisticated environmental perception and
decision-making support, leveraging their robust language and image-processing
capabilities. This article offers an exhaustive summary of the symbiosis
between LLMs and embodied intelligence with a focus on navigation. It reviews
state-of-the-art models, research methodologies, and assesses the advantages
and disadvantages of existing embodied navigation models and datasets. Finally,
the article elucidates the role of LLMs in embodied intelligence, based on
current research, and forecasts future directions in the field. A comprehensive
list of studies in this survey is available at
https://github.com/Rongtao-Xu/Awesome-LLM-E
A Survey of Embodied AI: From Simulators to Research Tasks
There has been an emerging paradigm shift from the era of "internet AI" to
"embodied AI", where AI algorithms and agents no longer learn from datasets of
images, videos or text curated primarily from the internet. Instead, they learn
through interactions with their environments from an egocentric perception
similar to humans. Consequently, there has been substantial growth in the
demand for embodied AI simulators to support various embodied AI research
tasks. This growing interest in embodied AI is beneficial to the greater
pursuit of Artificial General Intelligence (AGI), but there has not been a
contemporary and comprehensive survey of this field. This paper aims to provide
an encyclopedic survey for the field of embodied AI, from its simulators to its
research. By evaluating nine current embodied AI simulators with our proposed
seven features, this paper aims to understand the simulators in their provision
for use in embodied AI research and their limitations. Lastly, this paper
surveys the three main research tasks in embodied AI -- visual exploration,
visual navigation and embodied question answering (QA), covering the
state-of-the-art approaches, evaluation metrics and datasets. Finally, with the
new insights revealed through surveying the field, the paper will provide
suggestions for simulator-for-task selections and recommendations for the
future directions of the field.Comment: Under Review for IEEE TETC
mSpace: What do Numbers and Totals Mean in a Flexible Semantic Browser
With the Semantic Web community’s growing interest in Human Computer Interaction, this paper addresses a challenge for user interface design and future shifts in search paradigms. Where browsers using current search paradigms often use numeric values to indicate volumes of sub-hierarchies, future semantic browsers will not be limited to fixed hierarchical datasets, but allow flexible exploration through multiple intersecting domains. With the future use of similar numeric indicators uncertain, research here suggests that the inclusion of such indicators should be based around focal data objects within each information domain. Further research is required, as a significant number of contradicting participant expectations were present. It is the concern of the Semantic Web community to make sure that future btic search paradigms can best support their users
End-to-End (Instance)-Image Goal Navigation through Correspondence as an Emergent Phenomenon
Most recent work in goal oriented visual navigation resorts to large-scale
machine learning in simulated environments. The main challenge lies in learning
compact representations generalizable to unseen environments and in learning
high-capacity perception modules capable of reasoning on high-dimensional
input. The latter is particularly difficult when the goal is not given as a
category ("ObjectNav") but as an exemplar image ("ImageNav"), as the perception
module needs to learn a comparison strategy requiring to solve an underlying
visual correspondence problem. This has been shown to be difficult from reward
alone or with standard auxiliary tasks. We address this problem through a
sequence of two pretext tasks, which serve as a prior for what we argue is one
of the main bottleneck in perception, extremely wide-baseline relative pose
estimation and visibility prediction in complex scenes. The first pretext task,
cross-view completion is a proxy for the underlying visual correspondence
problem, while the second task addresses goal detection and finding directly.
We propose a new dual encoder with a large-capacity binocular ViT model and
show that correspondence solutions naturally emerge from the training signals.
Experiments show significant improvements and SOTA performance on the two
benchmarks, ImageNav and the Instance-ImageNav variant, where camera intrinsics
and height differ between observation and goal
- …