46 research outputs found
NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models
Trained with an unprecedented scale of data, large language models (LLMs)
like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities
from model scaling. Such a trend underscored the potential of training LLMs
with unlimited language data, advancing the development of a universal embodied
agent. In this work, we introduce the NavGPT, a purely LLM-based
instruction-following navigation agent, to reveal the reasoning capability of
GPT models in complex embodied scenes by performing zero-shot sequential action
prediction for vision-and-language navigation (VLN). At each step, NavGPT takes
the textual descriptions of visual observations, navigation history, and future
explorable directions as inputs to reason the agent's current status, and makes
the decision to approach the target. Through comprehensive experiments, we
demonstrate NavGPT can explicitly perform high-level planning for navigation,
including decomposing instruction into sub-goal, integrating commonsense
knowledge relevant to navigation task resolution, identifying landmarks from
observed scenes, tracking navigation progress, and adapting to exceptions with
plan adjustment. Furthermore, we show that LLMs is capable of generating
high-quality navigational instructions from observations and actions along a
path, as well as drawing accurate top-down metric trajectory given the agent's
navigation history. Despite the performance of using NavGPT to zero-shot R2R
tasks still falling short of trained models, we suggest adapting multi-modality
inputs for LLMs to use as visual navigation agents and applying the explicit
reasoning of LLMs to benefit learning-based models
Preparation, oxidation and ablation resistance of IrAl intermetallic coating
Iridium (Ir) has been selected as the protective coating on the rhenium thruster chamber of the liquid rocket engine, due to its high melting point, excellent corrosion resistance and quite low oxygen permeability. However, Ir forms gaseous oxides rather than a protective oxide barrier above 1100Ā°C under oxidizing environments, leading to a limited lifetime at high temperatures. To improve the oxidation resistance, in the present work pure Ir was modified by pack cementation to produce a single phase IrAl intermetallic coating. The bond strength of the coating was examined by coating-pull-off test. The oxidation and ablation resistance was assessed by cyclic oxidation test at 1800Ā°C and high frequency plasma wind tunnel test (heat flux: 2.03MW/m2 and enthalpy: 19MJ/kg), respectively. It was found that the IrAl coating is well boned to the substrate with a bond strength above 30MPa. The oxidation and ablation resistance of the Ir was significantly enhanced after the pack cementation treatment (see Figure 1). The improvement in oxidation and ablation resistance can be ascribed to the excellent comprehensive properties of the in-situ formed Al2O3 barrier and outstanding physical and chemical compatibility among the phases in the multilayer coating system.
Please click Additional Files below to see the full abstract
Bi-directional Training for Composed Image Retrieval via Text Prompt Learning
Composed image retrieval searches for a target image based on a multi-modal
user query comprised of a reference image and modification text describing the
desired changes. Existing approaches to solving this challenging task learn a
mapping from the (reference image, modification text)-pair to an image
embedding that is then matched against a large image corpus. One area that has
not yet been explored is the reverse direction, which asks the question, what
reference image when modified as describe by the text would produce the given
target image? In this work we propose a bi-directional training scheme that
leverages such reversed queries and can be applied to existing composed image
retrieval architectures. To encode the bi-directional query we prepend a
learnable token to the modification text that designates the direction of the
query and then finetune the parameters of the text embedding module. We make no
other changes to the network architecture. Experiments on two standard datasets
show that our novel approach achieves improved performance over a baseline
BLIP-based model that itself already achieves state-of-the-art performance.Comment: 12 pages, 5 figure
Effect of infiltration time on the microstructure and mechanical properties of C/C-SiC composite prepared by SiāZr10 alloyed melt infiltration
Low cost C/C-SiC composites were prepared through reactive melt infiltration with Si-Zr10 alloy infiltrant under different infiltration time. Effect of infiltration time on the microstructure and mechanical properties of the composite were investigated. ZrC tended to be formed in the composite and the amount of carbon phase decreased with an extension in the infiltration time according to the X-ray diffraction results. Phase transformation of the C/C-SiC composite was analyzed based on C-Si-Zr phase diagram. Flexural strength of the composite prepared by preform 0.9 g/cmĀ³ decreased with an increase in the infiltration time while that of the composite prepared by preform 1.38 g/cmĀ³ increased initially and then decreased reversely. The highest flexural strength of the composite was found at about 324 MPa. Flexural strength of the composite is considered to depend on its phase composition and fiber-matrix interface.This work was supported by National Natural Science
Foundation of China (51302315), Innovation Foundation for
Excellent Postgraduate of National University of Defense
Technology and Hunan Provincial Innovation Foundation for
Postgraduate. Yonggang Tong also thanks the support from
China Scholarship Council
Microstructureļ¼mechanical property and oxidation behavior of HfZrTiTaBx HEAs
The unique structural and thermal features of high-entropy alloys (HEAs) conduce to their excellent stability and mechanical properties. Recent researches have suggested that the high-entropy alloys composed of refractory metals exhibit competitive phase-stability and strength at elevated temperatures, which made them the promising candidate materials for high-temperature structural applications at even higher temperatures compared with the Ni-based superalloys. However, the alloys barely consisting of refractory metal elements are usually oxidized easily in oxidizing environment at high temperatures. This work aims to prepare a refractory HEA with both excellent mechanical properties and outstanding oxidation resistance by alloying of B element. In this study, an equimolar quaternary HfZrTiTa alloy and three kinds of HfZrTiTaBx(x=1.1, 2.3, 4.7) alloys with different amounts of B-addition were produced by vacuum arc melting technique in argon atmosphere. The structures of the prepared alloys were characterized via X-Ray diffraction and TEM. The oxidation behaviors of these alloys were investigated by differential scanning calorimeter (DSC)from 25ā to 1300ā in air. Their mechanical properties at room temperature and phase-stability at different annealing temperatures from 800ā to 1600ā were also examined. The results show that the HfZrTiTa alloy consists of a fully disordered body-centered cubic (BCC) solid solution phase due to the high mixing entropy, while the alloys with B addition have some nano particles uniformly distributed in the BCC solid solution matrix. The lattice parameters and Vicker hardness of the B-containing alloys increase with increasing B content due to the interstitial solid solution strengthening of B element and nanoprecipitation strengthening. The BCC structure of all alloy samples remains stable up to 1200ā. The quaternary HfZrTiTa alloy has a flexural strength of 2.3GPa with a typical dimple fracture morphology, indicating that the alloy shows ductile to some extent. The oxidation rates of the HfZrTiTaBx (x=1.1, 2.3, 4.7) alloys at 1300ā were about 0.13~0.15gā¢mm-2ā¢h-1, obviously lower than that of the HfZrTiTa alloy (0.454gā¢mm-2ā¢h-1)
Learning Navigational Visual Representations with Semantic Map Supervision
Being able to perceive the semantics and the spatial structure of the
environment is essential for visual navigation of a household robot. However,
most existing works only employ visual backbones pre-trained either with
independent images for classification or with self-supervised learning methods
to adapt to the indoor navigation domain, neglecting the spatial relationships
that are essential to the learning of navigation. Inspired by the behavior that
humans naturally build semantically and spatially meaningful cognitive maps in
their brains during navigation, in this paper, we propose a novel
navigational-specific visual representation learning method by contrasting the
agent's egocentric views and semantic maps (Ego-Map). We apply the visual
transformer as the backbone encoder and train the model with data collected
from the large-scale Habitat-Matterport3D environments. Ego-Map learning
transfers the compact and rich information from a map, such as objects,
structure and transition, to the agent's egocentric representations for
navigation. Experiments show that agents using our learned representations on
object-goal navigation outperform recent visual pre-training methods. Moreover,
our representations significantly improve vision-and-language navigation in
continuous environments for both high-level and low-level action spaces,
achieving new state-of-the-art results of 47% SR and 41% SPL on the test
server
Contrastive Video Question Answering via Video Graph Transformer
We propose to perform video question answering (VideoQA) in a Contrastive
manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and
superiority are three-fold: 1) It proposes a dynamic graph transformer module
which encodes video by explicitly capturing the visual objects, their relations
and dynamics, for complex spatio-temporal reasoning. 2) It designs separate
video and text transformers for contrastive learning between the video and text
to perform QA, instead of multi-modal transformer for answer classification.
Fine-grained video-text communication is done by additional cross-modal
interaction modules. 3) It is optimized by the joint fully- and self-supervised
contrastive objectives between the correct and incorrect answers, as well as
the relevant and irrelevant questions respectively. With superior video
encoding and QA solution, we show that CoVGT can achieve much better
performances than previous arts on video reasoning tasks. Its performances even
surpass those models that are pretrained with millions of external data. We
further show that CoVGT can also benefit from cross-modal pretraining, yet with
orders of magnitude smaller data. The results demonstrate the effectiveness and
superiority of CoVGT, and additionally reveal its potential for more
data-efficient pretraining. We hope our success can advance VideoQA beyond
coarse recognition/description towards fine-grained relation reasoning of video
contents. Our code is available at https://github.com/doc-doc/CoVGT.Comment: Accepted by IEEE T-PAMI'2