2,521 research outputs found

    VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation

    Full text link
    The performance of the Vision-and-Language Navigation~(VLN) tasks has witnessed rapid progress recently thanks to the use of large pre-trained vision-and-language models. However, full fine-tuning the pre-trained model for every downstream VLN task is becoming costly due to the considerable model size. Recent research hotspot of Parameter-Efficient Transfer Learning (PETL) shows great potential in efficiently tuning large pre-trained models for the common CV and NLP tasks, which exploits the most of the representation knowledge implied in the pre-trained model while only tunes a minimal set of parameters. However, simply utilizing existing PETL methods for the more challenging VLN tasks may bring non-trivial degeneration to the performance. Therefore, we present the first study to explore PETL methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction Booster (CIB). Then we combine these two modules with several existing PETL methods as the integrated VLN-PETL. Extensive experimental results on four mainstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of our proposed VLN-PETL, where VLN-PETL achieves comparable or even better performance to full fine-tuning and outperforms other PETL methods with promising margins.Comment: Accepted by ICCV 202

    March in Chat: Interactive Prompting for Remote Embodied Referring Expression

    Full text link
    Many Vision-and-Language Navigation (VLN) tasks have been proposed in recent years, from room-based to object-based and indoor to outdoor. The REVERIE (Remote Embodied Referring Expression) is interesting since it only provides high-level instructions to the agent, which are closer to human commands in practice. Nevertheless, this poses more challenges than other VLN tasks since it requires agents to infer a navigation plan only based on a short instruction. Large Language Models (LLMs) show great potential in robot action planning by providing proper prompts. Still, this strategy has not been explored under the REVERIE settings. There are several new challenges. For example, the LLM should be environment-aware so that the navigation plan can be adjusted based on the current visual observation. Moreover, the LLM planned actions should be adaptable to the much larger and more complex REVERIE environment. This paper proposes a March-in-Chat (MiC) model that can talk to the LLM on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP). Our MiC model outperforms the previous state-of-the-art by large margins by SPL and RGSPL metrics on the REVERIE benchmark.Comment: Accepted by ICCV 202

    DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments

    Full text link
    Simultaneous Localization and Mapping (SLAM) is considered to be a fundamental capability for intelligent mobile robots. Over the past decades, many impressed SLAM systems have been developed and achieved good performance under certain circumstances. However, some problems are still not well solved, for example, how to tackle the moving objects in the dynamic environments, how to make the robots truly understand the surroundings and accomplish advanced tasks. In this paper, a robust semantic visual SLAM towards dynamic environments named DS-SLAM is proposed. Five threads run in parallel in DS-SLAM: tracking, semantic segmentation, local mapping, loop closing, and dense semantic map creation. DS-SLAM combines semantic segmentation network with moving consistency check method to reduce the impact of dynamic objects, and thus the localization accuracy is highly improved in dynamic environments. Meanwhile, a dense semantic octo-tree map is produced, which could be employed for high-level tasks. We conduct experiments both on TUM RGB-D dataset and in the real-world environment. The results demonstrate the absolute trajectory accuracy in DS-SLAM can be improved by one order of magnitude compared with ORB-SLAM2. It is one of the state-of-the-art SLAM systems in high-dynamic environments. Now the code is available at our github: https://github.com/ivipsourcecode/DS-SLAMComment: 7 pages, accepted at the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2018). Now the code is available at our github: https://github.com/ivipsourcecode/DS-SLA

    Aβ Damages Learning and Memory in Alzheimer's Disease Rats with Kidney-Yang Deficiency

    Get PDF
    Previous studies demonstrated that Alzheimer's disease was considered as the consequence produced by deficiency of Kidney essence. However, the mechanism underlying the symptoms also remains elusive. Here we report that spatial learning and memory, escape, and swimming capacities were damaged significantly in Kidney-yang deficiency rats. Indeed, both hippocampal Aβ40 and 42 increases in Kidney-yang deficiency contribute to the learning and memory impairments. Specifically, damage of synaptic plasticity is involved in the learning and memory impairment of Kidney-yang deficiency rats. We determined that the learning and memory damage in Kidney-yang deficiency due to synaptic plasticity impairment and increases of Aβ40 and 42 was not caused via NMDA receptor internalization induced by Aβ increase. β-Adrenergic receptor agonist can rescue the impaired long-term potential (LTP) in Kidney-yang rats. Taken together, our results suggest that spatial learning and memory inhibited in Kidney-yang deficiency might be induced by Aβ increase and the decrease of β2 receptor function in glia

    ResFormer: Scaling ViTs with Multi-Resolution Training

    Full text link
    Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions effectively, especially novel ones in testing, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. We conduct extensive experiments for image classification on ImageNet. The results provide strong quantitative evidence that ResFormer has promising scaling abilities towards a wide range of resolutions. For instance, ResFormer-B-MR achieves a Top-1 accuracy of 75.86% and 81.72% when evaluated on relatively low and high resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better than DeiT-B. We also demonstrate, moreover, ResFormer is flexible and can be easily extended to semantic segmentation, object detection and video action recognition. Code is available at https://github.com/ruitian12/resformer.Comment: CVPR 202

    Experiments and simulations of MEMS thermal sensors for wall shear-stress measurements in aerodynamic control applications

    Get PDF
    MEMS thermal shear-stress sensors exploit heat-transfer effects to measure the shear stress exerted by an air flow on its solid boundary, and have promising applications in aerodynamic control. Classical theory for conventional, macroscale thermal shear-stress sensors states that the rate of heat removed by the flow from the sensor is proportional to the 1/3-power of the shear stress. However, we have observed that this theory is inconsistent with experimental data from MEMS sensors. This paper seeks to develop an understanding of MEMS thermal shear-stress sensors through a study including both experimental and theoretical investigations. We first obtain experimental data that confirm the inadequacy of the classical theory by wind-tunnel testing of prototype MEMS shear-stress sensors with different dimensions and materials. A theoretical analysis is performed to identify that this inadequacy is due to the lack of a thin thermal boundary layer in the fluid flow at the sensor surface, and then a two-dimensional MEMS shear-stress sensor theory is presented. This theory incorporates important heat-transfer effects that are ignored by the classical theory, and consistently explains the experimental data obtained from prototype MEMS sensors. Moreover, the prototype MEMS sensors are studied with three-dimensional simulations, yielding results that quantitatively agree with experimental data. This work demonstrates that classical assumptions made for conventional thermal devices should be carefully examined for miniature MEMS devices

    Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning

    Full text link
    Semi-supervised learning is attracting blooming attention, due to its success in combining unlabeled data. To mitigate potentially incorrect pseudo labels, recent frameworks mostly set a fixed confidence threshold to discard uncertain samples. This practice ensures high-quality pseudo labels, but incurs a relatively low utilization of the whole unlabeled set. In this work, our key insight is that these uncertain samples can be turned into certain ones, as long as the confusion classes for the top-1 class are detected and removed. Invoked by this, we propose a novel method dubbed ShrinkMatch to learn uncertain samples. For each uncertain sample, it adaptively seeks a shrunk class space, which merely contains the original top-1 class, as well as remaining less likely classes. Since the confusion ones are removed in this space, the re-calculated top-1 confidence can satisfy the pre-defined threshold. We then impose a consistency regularization between a pair of strongly and weakly augmented samples in the shrunk space to strive for discriminative representations. Furthermore, considering the varied reliability among uncertain samples and the gradually improved model during training, we correspondingly design two reweighting principles for our uncertain loss. Our method exhibits impressive performance on widely adopted benchmarks. Code is available at https://github.com/LiheYoung/ShrinkMatch.Comment: Accepted by ICCV 202

    Scaling Data Generation in Vision-and-Language Navigation

    Full text link
    Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.Comment: ICCV 202
    corecore