151 research outputs found
SLUA: A Super Lightweight Unsupervised Word Alignment Model via Cross-Lingual Contrastive Learning
Word alignment is essential for the down-streaming cross-lingual language
understanding and generation tasks. Recently, the performance of the neural
word alignment models has exceeded that of statistical models. However, they
heavily rely on sophisticated translation models. In this study, we propose a
super lightweight unsupervised word alignment (SLUA) model, in which
bidirectional symmetric attention trained with a contrastive learning objective
is introduced, and an agreement loss is employed to bind the attention maps,
such that the alignments follow mirror-like symmetry hypothesis. Experimental
results on several public benchmarks demonstrate that our model achieves
competitive, if not better, performance compared to the state of the art in
word alignment while significantly reducing the training and decoding time on
average. Further ablation analysis and case studies show the superiority of our
proposed SLUA. Notably, we recognize our model as a pioneer attempt to unify
bilingual word embedding and word alignments. Encouragingly, our approach
achieves 16.4x speedup against GIZA++, and 50x parameter compression} compared
with the Transformer-based alignment methods. We will release our code to
facilitate the community.Comment: Work in progres
ST-P3: End-to-end Vision-based Autonomous Driving via Spatial-Temporal Feature Learning
Many existing autonomous driving paradigms involve a multi-stage discrete
pipeline of tasks. To better predict the control signals and enhance user
safety, an end-to-end approach that benefits from joint spatial-temporal
feature learning is desirable. While there are some pioneering works on
LiDAR-based input or implicit design, in this paper we formulate the problem in
an interpretable vision-based setting. In particular, we propose a
spatial-temporal feature learning scheme towards a set of more representative
features for perception, prediction and planning tasks simultaneously, which is
called ST-P3. Specifically, an egocentric-aligned accumulation technique is
proposed to preserve geometry information in 3D space before the bird's eye
view transformation for perception; a dual pathway modeling is devised to take
past motion variations into account for future prediction; a temporal-based
refinement unit is introduced to compensate for recognizing vision-based
elements for planning. To the best of our knowledge, we are the first to
systematically investigate each part of an interpretable end-to-end
vision-based autonomous driving system. We benchmark our approach against
previous state-of-the-arts on both open-loop nuScenes dataset as well as
closed-loop CARLA simulation. The results show the effectiveness of our method.
Source code, model and protocol details are made publicly available at
https://github.com/OpenPerceptionX/ST-P3.Comment: ECCV 202
DA-STC: Domain Adaptive Video Semantic Segmentation via Spatio-Temporal Consistency
Video semantic segmentation is a pivotal aspect of video representation
learning. However, significant domain shifts present a challenge in effectively
learning invariant spatio-temporal features across the labeled source domain
and unlabeled target domain for video semantic segmentation. To solve the
challenge, we propose a novel DA-STC method for domain adaptive video semantic
segmentation, which incorporates a bidirectional multi-level spatio-temporal
fusion module and a category-aware spatio-temporal feature alignment module to
facilitate consistent learning for domain-invariant features. Firstly, we
perform bidirectional spatio-temporal fusion at the image sequence level and
shallow feature level, leading to the construction of two fused intermediate
video domains. This prompts the video semantic segmentation model to
consistently learn spatio-temporal features of shared patch sequences which are
influenced by domain-specific contexts, thereby mitigating the feature gap
between the source and target domain. Secondly, we propose a category-aware
feature alignment module to promote the consistency of spatio-temporal
features, facilitating adaptation to the target domain. Specifically, we
adaptively aggregate the domain-specific deep features of each category along
spatio-temporal dimensions, which are further constrained to achieve
cross-domain intra-class feature alignment and inter-class feature separation.
Extensive experiments demonstrate the effectiveness of our method, which
achieves state-of-the-art mIOUs on multiple challenging benchmarks.
Furthermore, we extend the proposed DA-STC to the image domain, where it also
exhibits superior performance for domain adaptive semantic segmentation. The
source code and models will be made available at
\url{https://github.com/ZHE-SAPI/DA-STC}.Comment: 18 pages,9 figure
Towards Robust Referring Image Segmentation
Referring Image Segmentation (RIS) aims to connect image and language via
outputting the corresponding object masks given a text description, which is a
fundamental vision-language task. Despite lots of works that have achieved
considerable progress for RIS, in this work, we explore an essential question,
"what if the description is wrong or misleading of the text description?". We
term such a sentence as a negative sentence. However, we find that existing
works cannot handle such settings. To this end, we propose a novel formulation
of RIS, named Robust Referring Image Segmentation (R-RIS). It considers the
negative sentence inputs besides the regularly given text inputs. We present
three different datasets via augmenting the input negative sentences and a new
metric to unify both input types. Furthermore, we design a new
transformer-based model named RefSegformer, where we introduce a token-based
vision and language fusion module. Such module can be easily extended to our
R-RIS setting by adding extra blank tokens. Our proposed RefSegformer achieves
the new state-of-the-art results on three regular RIS datasets and three R-RIS
datasets, which serves as a new solid baseline for further research. The
project page is at \url{https://lxtgh.github.io/project/robust_ref_seg/}.Comment: technical repor
Learning to Learn from APIs: Black-Box Data-Free Meta-Learning
Data-free meta-learning (DFML) aims to enable efficient learning of new tasks
by meta-learning from a collection of pre-trained models without access to the
training data. Existing DFML work can only meta-learn from (i) white-box and
(ii) small-scale pre-trained models (iii) with the same architecture,
neglecting the more practical setting where the users only have inference
access to the APIs with arbitrary model architectures and model scale inside.
To solve this issue, we propose a Bi-level Data-free Meta Knowledge
Distillation (BiDf-MKD) framework to transfer more general meta knowledge from
a collection of black-box APIs to one single meta model. Specifically, by just
querying APIs, we inverse each API to recover its training data via a
zero-order gradient estimator and then perform meta-learning via a novel
bi-level meta knowledge distillation structure, in which we design a boundary
query set recovery technique to recover a more informative query set near the
decision boundary. In addition, to encourage better generalization within the
setting of limited API budgets, we propose task memory replay to diversify the
underlying task distribution by covering more interpolated tasks. Extensive
experiments in various real-world scenarios show the superior performance of
our BiDf-MKD framework
TD^2-Net: Toward Denoising and Debiasing for Dynamic Scene Graph Generation
Dynamic scene graph generation (SGG) focuses on detecting objects in a video
and determining their pairwise relationships. Existing dynamic SGG methods
usually suffer from several issues, including 1) Contextual noise, as some
frames might contain occluded and blurred objects. 2) Label bias, primarily due
to the high imbalance between a few positive relationship samples and numerous
negative ones. Additionally, the distribution of relationships exhibits a
long-tailed pattern. To address the above problems, in this paper, we introduce
a network named TD-Net that aims at denoising and debiasing for dynamic
SGG. Specifically, we first propose a denoising spatio-temporal transformer
module that enhances object representation with robust contextual information.
This is achieved by designing a differentiable Top-K object selector that
utilizes the gumbel-softmax sampling strategy to select the relevant
neighborhood for each object. Second, we introduce an asymmetrical reweighting
loss to relieve the issue of label bias. This loss function integrates
asymmetry focusing factors and the volume of samples to adjust the weights
assigned to individual samples. Systematic experimental results demonstrate the
superiority of our proposed TD-Net over existing state-of-the-art
approaches on Action Genome databases. In more detail, TD-Net outperforms
the second-best competitors by 12.7 \% on mean-Recall@10 for predicate
classification.Comment: Accepted by AAAI 202
- …