2,607 research outputs found
VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation
The performance of the Vision-and-Language Navigation~(VLN) tasks has
witnessed rapid progress recently thanks to the use of large pre-trained
vision-and-language models. However, full fine-tuning the pre-trained model for
every downstream VLN task is becoming costly due to the considerable model
size. Recent research hotspot of Parameter-Efficient Transfer Learning (PETL)
shows great potential in efficiently tuning large pre-trained models for the
common CV and NLP tasks, which exploits the most of the representation
knowledge implied in the pre-trained model while only tunes a minimal set of
parameters. However, simply utilizing existing PETL methods for the more
challenging VLN tasks may bring non-trivial degeneration to the performance.
Therefore, we present the first study to explore PETL methods for VLN tasks and
propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two
PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction
Booster (CIB). Then we combine these two modules with several existing PETL
methods as the integrated VLN-PETL. Extensive experimental results on four
mainstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of
our proposed VLN-PETL, where VLN-PETL achieves comparable or even better
performance to full fine-tuning and outperforms other PETL methods with
promising margins.Comment: Accepted by ICCV 202
March in Chat: Interactive Prompting for Remote Embodied Referring Expression
Many Vision-and-Language Navigation (VLN) tasks have been proposed in recent
years, from room-based to object-based and indoor to outdoor. The REVERIE
(Remote Embodied Referring Expression) is interesting since it only provides
high-level instructions to the agent, which are closer to human commands in
practice. Nevertheless, this poses more challenges than other VLN tasks since
it requires agents to infer a navigation plan only based on a short
instruction. Large Language Models (LLMs) show great potential in robot action
planning by providing proper prompts. Still, this strategy has not been
explored under the REVERIE settings. There are several new challenges. For
example, the LLM should be environment-aware so that the navigation plan can be
adjusted based on the current visual observation. Moreover, the LLM planned
actions should be adaptable to the much larger and more complex REVERIE
environment. This paper proposes a March-in-Chat (MiC) model that can talk to
the LLM on the fly and plan dynamically based on a newly proposed
Room-and-Object Aware Scene Perceiver (ROASP). Our MiC model outperforms the
previous state-of-the-art by large margins by SPL and RGSPL metrics on the
REVERIE benchmark.Comment: Accepted by ICCV 202
DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments
Simultaneous Localization and Mapping (SLAM) is considered to be a
fundamental capability for intelligent mobile robots. Over the past decades,
many impressed SLAM systems have been developed and achieved good performance
under certain circumstances. However, some problems are still not well solved,
for example, how to tackle the moving objects in the dynamic environments, how
to make the robots truly understand the surroundings and accomplish advanced
tasks. In this paper, a robust semantic visual SLAM towards dynamic
environments named DS-SLAM is proposed. Five threads run in parallel in
DS-SLAM: tracking, semantic segmentation, local mapping, loop closing, and
dense semantic map creation. DS-SLAM combines semantic segmentation network
with moving consistency check method to reduce the impact of dynamic objects,
and thus the localization accuracy is highly improved in dynamic environments.
Meanwhile, a dense semantic octo-tree map is produced, which could be employed
for high-level tasks. We conduct experiments both on TUM RGB-D dataset and in
the real-world environment. The results demonstrate the absolute trajectory
accuracy in DS-SLAM can be improved by one order of magnitude compared with
ORB-SLAM2. It is one of the state-of-the-art SLAM systems in high-dynamic
environments. Now the code is available at our github:
https://github.com/ivipsourcecode/DS-SLAMComment: 7 pages, accepted at the 2018 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2018). Now the code is available at our
github: https://github.com/ivipsourcecode/DS-SLA
Aβ Damages Learning and Memory in Alzheimer's Disease Rats with Kidney-Yang Deficiency
Previous studies demonstrated that Alzheimer's disease was considered as the consequence produced by deficiency of Kidney essence. However, the mechanism underlying the symptoms also remains elusive. Here we report that spatial learning and memory, escape, and swimming capacities were damaged significantly in Kidney-yang deficiency rats. Indeed, both hippocampal Aβ40 and 42 increases in Kidney-yang deficiency contribute to the learning and memory impairments. Specifically, damage of synaptic plasticity is involved in the learning and memory impairment of Kidney-yang deficiency rats. We determined that the learning and memory damage in Kidney-yang deficiency due to synaptic plasticity impairment and increases of Aβ40 and 42 was not caused via NMDA receptor internalization induced by Aβ increase. β-Adrenergic receptor agonist can rescue the impaired long-term potential (LTP) in Kidney-yang rats. Taken together, our results suggest that spatial learning and memory inhibited in Kidney-yang deficiency might be induced by Aβ increase and the decrease of β2 receptor function in glia
ResFormer: Scaling ViTs with Multi-Resolution Training
Vision Transformers (ViTs) have achieved overwhelming success, yet they
suffer from vulnerable resolution scalability, i.e., the performance drops
drastically when presented with input resolutions that are unseen during
training. We introduce, ResFormer, a framework that is built upon the seminal
idea of multi-resolution training for improved performance on a wide spectrum
of, mostly unseen, testing resolutions. In particular, ResFormer operates on
replicated images of different resolutions and enforces a scale consistency
loss to engage interactive information across different scales. More
importantly, to alternate among varying resolutions effectively, especially
novel ones in testing, we propose a global-local positional embedding strategy
that changes smoothly conditioned on input sizes. We conduct extensive
experiments for image classification on ImageNet. The results provide strong
quantitative evidence that ResFormer has promising scaling abilities towards a
wide range of resolutions. For instance, ResFormer-B-MR achieves a Top-1
accuracy of 75.86% and 81.72% when evaluated on relatively low and high
resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better
than DeiT-B. We also demonstrate, moreover, ResFormer is flexible and can be
easily extended to semantic segmentation, object detection and video action
recognition. Code is available at https://github.com/ruitian12/resformer.Comment: CVPR 202
Experiments and simulations of MEMS thermal sensors for wall shear-stress measurements in aerodynamic control applications
MEMS thermal shear-stress sensors exploit heat-transfer effects to measure the shear stress exerted by an air flow on its solid boundary, and have promising applications in aerodynamic control. Classical theory for conventional, macroscale thermal shear-stress sensors states that the rate of heat removed by the flow from the sensor is proportional to the 1/3-power of the shear stress. However, we have observed that this theory is inconsistent with experimental data from MEMS sensors. This paper seeks to develop an understanding of MEMS thermal shear-stress sensors through a study including both experimental and theoretical investigations. We first obtain experimental data that confirm the inadequacy of the classical theory by wind-tunnel testing of prototype MEMS shear-stress sensors with different dimensions and materials. A theoretical analysis is performed to identify that this inadequacy is due to the lack of a thin thermal boundary layer in the fluid flow at the sensor surface, and then a two-dimensional MEMS shear-stress sensor theory is presented. This theory incorporates important heat-transfer effects that are ignored by the classical theory, and consistently explains the experimental data obtained from prototype MEMS sensors. Moreover, the prototype MEMS sensors are studied with three-dimensional simulations, yielding results that quantitatively agree with experimental data. This work demonstrates that classical assumptions made for conventional thermal devices should be carefully examined for miniature MEMS devices
Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning
Semi-supervised learning is attracting blooming attention, due to its success
in combining unlabeled data. To mitigate potentially incorrect pseudo labels,
recent frameworks mostly set a fixed confidence threshold to discard uncertain
samples. This practice ensures high-quality pseudo labels, but incurs a
relatively low utilization of the whole unlabeled set. In this work, our key
insight is that these uncertain samples can be turned into certain ones, as
long as the confusion classes for the top-1 class are detected and removed.
Invoked by this, we propose a novel method dubbed ShrinkMatch to learn
uncertain samples. For each uncertain sample, it adaptively seeks a shrunk
class space, which merely contains the original top-1 class, as well as
remaining less likely classes. Since the confusion ones are removed in this
space, the re-calculated top-1 confidence can satisfy the pre-defined
threshold. We then impose a consistency regularization between a pair of
strongly and weakly augmented samples in the shrunk space to strive for
discriminative representations. Furthermore, considering the varied reliability
among uncertain samples and the gradually improved model during training, we
correspondingly design two reweighting principles for our uncertain loss. Our
method exhibits impressive performance on widely adopted benchmarks. Code is
available at https://github.com/LiheYoung/ShrinkMatch.Comment: Accepted by ICCV 202
Scaling Data Generation in Vision-and-Language Navigation
Recent research in language-guided visual navigation has demonstrated a
significant demand for the diversity of traversable environments and the
quantity of supervision for training generalizable agents. To tackle the common
data scarcity issue in existing vision-and-language navigation datasets, we
propose an effective paradigm for generating large-scale data for learning,
which applies 1200+ photo-realistic environments from HM3D and Gibson datasets
and synthesizes 4.9 million instruction trajectory pairs using fully-accessible
resources on the web. Importantly, we investigate the influence of each
component in this paradigm on the agent's performance and study how to
adequately apply the augmented data to pre-train and fine-tune an agent. Thanks
to our large-scale dataset, the performance of an existing agent can be pushed
up (+11% absolute with regard to previous SoTA) to a significantly new best of
80% single-run success rate on the R2R test split by simple imitation learning.
The long-lasting generalization gap between navigating in seen and unseen
environments is also reduced to less than 1% (versus 8% in the previous best
method). Moreover, our paradigm also facilitates different models to achieve
new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous
environments.Comment: ICCV 202
- …