46 research outputs found

    NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

    Full text link
    Trained with an unprecedented scale of data, large language models (LLMs) like ChatGPT and GPT-4 exhibit the emergence of significant reasoning abilities from model scaling. Such a trend underscored the potential of training LLMs with unlimited language data, advancing the development of a universal embodied agent. In this work, we introduce the NavGPT, a purely LLM-based instruction-following navigation agent, to reveal the reasoning capability of GPT models in complex embodied scenes by performing zero-shot sequential action prediction for vision-and-language navigation (VLN). At each step, NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status, and makes the decision to approach the target. Through comprehensive experiments, we demonstrate NavGPT can explicitly perform high-level planning for navigation, including decomposing instruction into sub-goal, integrating commonsense knowledge relevant to navigation task resolution, identifying landmarks from observed scenes, tracking navigation progress, and adapting to exceptions with plan adjustment. Furthermore, we show that LLMs is capable of generating high-quality navigational instructions from observations and actions along a path, as well as drawing accurate top-down metric trajectory given the agent's navigation history. Despite the performance of using NavGPT to zero-shot R2R tasks still falling short of trained models, we suggest adapting multi-modality inputs for LLMs to use as visual navigation agents and applying the explicit reasoning of LLMs to benefit learning-based models

    Preparation, oxidation and ablation resistance of IrAl intermetallic coating

    Get PDF
    Iridium (Ir) has been selected as the protective coating on the rhenium thruster chamber of the liquid rocket engine, due to its high melting point, excellent corrosion resistance and quite low oxygen permeability. However, Ir forms gaseous oxides rather than a protective oxide barrier above 1100Ā°C under oxidizing environments, leading to a limited lifetime at high temperatures. To improve the oxidation resistance, in the present work pure Ir was modified by pack cementation to produce a single phase IrAl intermetallic coating. The bond strength of the coating was examined by coating-pull-off test. The oxidation and ablation resistance was assessed by cyclic oxidation test at 1800Ā°C and high frequency plasma wind tunnel test (heat flux: 2.03MW/m2 and enthalpy: 19MJ/kg), respectively. It was found that the IrAl coating is well boned to the substrate with a bond strength above 30MPa. The oxidation and ablation resistance of the Ir was significantly enhanced after the pack cementation treatment (see Figure 1). The improvement in oxidation and ablation resistance can be ascribed to the excellent comprehensive properties of the in-situ formed Al2O3 barrier and outstanding physical and chemical compatibility among the phases in the multilayer coating system. Please click Additional Files below to see the full abstract

    Bi-directional Training for Composed Image Retrieval via Text Prompt Learning

    Full text link
    Composed image retrieval searches for a target image based on a multi-modal user query comprised of a reference image and modification text describing the desired changes. Existing approaches to solving this challenging task learn a mapping from the (reference image, modification text)-pair to an image embedding that is then matched against a large image corpus. One area that has not yet been explored is the reverse direction, which asks the question, what reference image when modified as describe by the text would produce the given target image? In this work we propose a bi-directional training scheme that leverages such reversed queries and can be applied to existing composed image retrieval architectures. To encode the bi-directional query we prepend a learnable token to the modification text that designates the direction of the query and then finetune the parameters of the text embedding module. We make no other changes to the network architecture. Experiments on two standard datasets show that our novel approach achieves improved performance over a baseline BLIP-based model that itself already achieves state-of-the-art performance.Comment: 12 pages, 5 figure

    Effect of infiltration time on the microstructure and mechanical properties of C/C-SiC composite prepared by Siā€Zr10 alloyed melt infiltration

    No full text
    Low cost C/C-SiC composites were prepared through reactive melt infiltration with Si-Zr10 alloy infiltrant under different infiltration time. Effect of infiltration time on the microstructure and mechanical properties of the composite were investigated. ZrC tended to be formed in the composite and the amount of carbon phase decreased with an extension in the infiltration time according to the X-ray diffraction results. Phase transformation of the C/C-SiC composite was analyzed based on C-Si-Zr phase diagram. Flexural strength of the composite prepared by preform 0.9 g/cmĀ³ decreased with an increase in the infiltration time while that of the composite prepared by preform 1.38 g/cmĀ³ increased initially and then decreased reversely. The highest flexural strength of the composite was found at about 324 MPa. Flexural strength of the composite is considered to depend on its phase composition and fiber-matrix interface.This work was supported by National Natural Science Foundation of China (51302315), Innovation Foundation for Excellent Postgraduate of National University of Defense Technology and Hunan Provincial Innovation Foundation for Postgraduate. Yonggang Tong also thanks the support from China Scholarship Council

    Microstructureļ¼Œmechanical property and oxidation behavior of HfZrTiTaBx HEAs

    Get PDF
    The unique structural and thermal features of high-entropy alloys (HEAs) conduce to their excellent stability and mechanical properties. Recent researches have suggested that the high-entropy alloys composed of refractory metals exhibit competitive phase-stability and strength at elevated temperatures, which made them the promising candidate materials for high-temperature structural applications at even higher temperatures compared with the Ni-based superalloys. However, the alloys barely consisting of refractory metal elements are usually oxidized easily in oxidizing environment at high temperatures. This work aims to prepare a refractory HEA with both excellent mechanical properties and outstanding oxidation resistance by alloying of B element. In this study, an equimolar quaternary HfZrTiTa alloy and three kinds of HfZrTiTaBx(x=1.1, 2.3, 4.7) alloys with different amounts of B-addition were produced by vacuum arc melting technique in argon atmosphere. The structures of the prepared alloys were characterized via X-Ray diffraction and TEM. The oxidation behaviors of these alloys were investigated by differential scanning calorimeter (DSC)from 25ā„ƒ to 1300ā„ƒ in air. Their mechanical properties at room temperature and phase-stability at different annealing temperatures from 800ā„ƒ to 1600ā„ƒ were also examined. The results show that the HfZrTiTa alloy consists of a fully disordered body-centered cubic (BCC) solid solution phase due to the high mixing entropy, while the alloys with B addition have some nano particles uniformly distributed in the BCC solid solution matrix. The lattice parameters and Vicker hardness of the B-containing alloys increase with increasing B content due to the interstitial solid solution strengthening of B element and nanoprecipitation strengthening. The BCC structure of all alloy samples remains stable up to 1200ā„ƒ. The quaternary HfZrTiTa alloy has a flexural strength of 2.3GPa with a typical dimple fracture morphology, indicating that the alloy shows ductile to some extent. The oxidation rates of the HfZrTiTaBx (x=1.1, 2.3, 4.7) alloys at 1300ā„ƒ were about 0.13~0.15gā€¢mm-2ā€¢h-1, obviously lower than that of the HfZrTiTa alloy (0.454gā€¢mm-2ā€¢h-1)

    Learning Navigational Visual Representations with Semantic Map Supervision

    Full text link
    Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that humans naturally build semantically and spatially meaningful cognitive maps in their brains during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego2^2-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego2^2-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperform recent visual pre-training methods. Moreover, our representations significantly improve vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server

    Contrastive Video Question Answering via Video Graph Transformer

    Full text link
    We propose to perform video question answering (VideoQA) in a Contrastive manner via a Video Graph Transformer model (CoVGT). CoVGT's uniqueness and superiority are three-fold: 1) It proposes a dynamic graph transformer module which encodes video by explicitly capturing the visual objects, their relations and dynamics, for complex spatio-temporal reasoning. 2) It designs separate video and text transformers for contrastive learning between the video and text to perform QA, instead of multi-modal transformer for answer classification. Fine-grained video-text communication is done by additional cross-modal interaction modules. 3) It is optimized by the joint fully- and self-supervised contrastive objectives between the correct and incorrect answers, as well as the relevant and irrelevant questions respectively. With superior video encoding and QA solution, we show that CoVGT can achieve much better performances than previous arts on video reasoning tasks. Its performances even surpass those models that are pretrained with millions of external data. We further show that CoVGT can also benefit from cross-modal pretraining, yet with orders of magnitude smaller data. The results demonstrate the effectiveness and superiority of CoVGT, and additionally reveal its potential for more data-efficient pretraining. We hope our success can advance VideoQA beyond coarse recognition/description towards fine-grained relation reasoning of video contents. Our code is available at https://github.com/doc-doc/CoVGT.Comment: Accepted by IEEE T-PAMI'2
    corecore