288 research outputs found
ReLER@ZJU Submission to the Ego4D Moment Queries Challenge 2022
In this report, we present the ReLER@ZJU1 submission to the Ego4D Moment
Queries Challenge in ECCV 2022. In this task, the goal is to retrieve and
localize all instances of possible activities in egocentric videos. Ego4D
dataset is challenging for the temporal action localization task as the
temporal duration of the videos is quite long and each video contains multiple
action instances with fine-grained action classes. To address these problems,
we utilize a multi-scale transformer to classify different action categories
and predict the boundary of each instance. Moreover, in order to better capture
the long-term temporal dependencies in the long videos, we propose a
segment-level recurrence mechanism. Compared with directly feeding all video
features to the transformer encoder, the proposed segment-level recurrence
mechanism alleviates the optimization difficulties and achieves better
performance. The final submission achieved Recall@1,tIoU=0.5 score of 37.24,
average mAP score of 17.67 and took 3-rd place on the leaderboard.Comment: Accepted to ECCV 2022 Ego4D Workshop; 3rd place in Ego4D Moment Query
Challeng
Traceable and authenticated key negotiations via blockchain for vehicular communications
While key negotiation schemes, such as those based on Diffie–Hellman, have been the subject of ongoing research, designing an efficient and security scheme remains challenging. In this paper, we propose a novel key negotiation scheme based on blockchain, which can be deployed in blockchain-enabled contexts such as data sharing or facilitating electric transactions between vehicles (e.g., unmanned vehicles). We propose three candidates for flexible selection, namely, key exchanges via transaction currency values through value channels (such as the amount in transactions), automated key exchanges through static scripts,and dynamic scripts, which can not only guarantee key availability with timeliness but also defend against MITM (man-in-the-middle) attacks, packet-dropping attacks, and decryption failure attacks
Slimmable Networks for Contrastive Self-supervised Learning
Self-supervised learning makes great progress in large model pre-training but
suffers in training small models. Previous solutions to this problem mainly
rely on knowledge distillation and indeed have a two-stage learning procedure:
first train a large teacher model, then distill it to improve the
generalization ability of small ones. In this work, we present a new one-stage
solution to obtain pre-trained small models without extra teachers: slimmable
networks for contrastive self-supervised learning (\emph{SlimCLR}). A slimmable
network contains a full network and several weight-sharing sub-networks. We can
pre-train for only one time and obtain various networks including small ones
with low computation costs. However, in self-supervised cases, the interference
between weight-sharing networks leads to severe performance degradation. One
evidence of the interference is \emph{gradient imbalance}: a small proportion
of parameters produces dominant gradients during backpropagation, and the main
parameters may not be fully optimized. The divergence in gradient directions of
various networks may also cause interference between networks. To overcome
these problems, we make the main parameters produce dominant gradients and
provide consistent guidance for sub-networks via three techniques: slow start
training of sub-networks, online distillation, and loss re-weighting according
to model sizes. Besides, a switchable linear probe layer is applied during
linear evaluation to avoid the interference of weight-sharing linear layers. We
instantiate SlimCLR with typical contrastive learning frameworks and achieve
better performance than previous arts with fewer parameters and FLOPs.Comment: preprint,work in progres
Test-Time Adaptation with CLIP Reward for Zero-Shot Generalization in Vision-Language Models
Misalignment between the outputs of a vision-language (VL) model and task
goal hinders its deployment. This issue can worsen when there are distribution
shifts between the training and test data. To address this problem, prevailing
fully test-time adaptation~(TTA) methods bootstrap themselves through entropy
minimization. However, minimizing the entropy of the predictions makes the
model overfit to incorrect output distributions of itself. In this work, we
propose TTA with feedback to avoid such overfitting and align the model with
task goals. Specifically, we adopt CLIP as reward model to provide feedback for
VL models during test time in various tasks, including image classification,
image-text retrieval, and image captioning. Given a single test sample, the
model aims to maximize CLIP reward through reinforcement learning. We adopt a
reward design with the average CLIP score of sampled candidates as the
baseline. This design is simple and surprisingly effective when combined with
various task-specific sampling strategies. The entire system is flexible,
allowing the reward model to be extended with multiple CLIP models. Plus, a
momentum buffer can be used to memorize and leverage the learned knowledge from
multiple test samples. Extensive experiments demonstrate that our method
significantly improves different VL models after TTA.Comment: preprint, work in progress; project URL
https://github.com/mzhaoshuai/RLC
Bird's-Eye-View Scene Graph for Vision-Language Navigation
Vision-language navigation (VLN), which entails an agent to navigate 3D
environments following human instructions, has shown great advances. However,
current agents are built upon panoramic observations, which hinders their
ability to perceive 3D scene geometry and easily leads to ambiguous selection
of panoramic view. To address these limitations, we present a BEV Scene Graph
(BSG), which leverages multi-step BEV representations to encode scene layouts
and geometric cues of indoor environment under the supervision of 3D detection.
During navigation, BSG builds a local BEV representation at each step and
maintains a BEV-based global scene map, which stores and organizes all the
online collected local BEV representations according to their topological
relations. Based on BSG, the agent predicts a local BEV grid-level decision
score and a global graph-level decision score, combined with a sub-view
selection score on panoramic views, for more accurate action prediction. Our
approach significantly outperforms state-of-the-art methods on REVERIE, R2R,
and R4R, showing the potential of BEV perception in VLN.Comment: Accepted at ICCV 2023; Project page:
https://github.com/DefaultRui/BEV-Scene-Grap
Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval
Language-guided image retrieval enables users to search for images and
interact with the retrieval system more naturally and expressively by using a
reference image and a relative caption as a query. Most existing studies mainly
focus on designing image-text composition architecture to extract
discriminative visual-linguistic relations. Despite great success, we identify
an inherent problem that obstructs the extraction of discriminative features
and considerably compromises model training: \textbf{triplet ambiguity}. This
problem stems from the annotation process wherein annotators view only one
triplet at a time. As a result, they often describe simple attributes, such as
color, while neglecting fine-grained details like location and style. This
leads to multiple false-negative candidates matching the same modification
text. We propose a novel Consensus Network (Css-Net) that self-adaptively
learns from noisy triplets to minimize the negative effects of triplet
ambiguity. Inspired by the psychological finding that groups perform better
than individuals, Css-Net comprises 1) a consensus module featuring four
distinct compositors that generate diverse fused image-text embeddings and 2) a
Kullback-Leibler divergence loss, which fosters learning among the compositors,
enabling them to reduce biases learned from noisy triplets and reach a
consensus. The decisions from four compositors are weighted during evaluation
to further achieve consensus. Comprehensive experiments on three datasets
demonstrate that Css-Net can alleviate triplet ambiguity, achieving competitive
performance on benchmarks, such as R@10 and R@50 on
FashionIQ.Comment: 11 page
Action Sensitivity Learning for the Ego4D Episodic Memory Challenge 2023
This report presents ReLER submission to two tracks in the Ego4D Episodic
Memory Benchmark in CVPR 2023, including Natural Language Queries and Moment
Queries. This solution inherits from our proposed Action Sensitivity Learning
framework (ASL) to better capture discrepant information of frames. Further, we
incorporate a series of stronger video features and fusion strategies. Our
method achieves an average mAP of 29.34, ranking 1st in Moment Queries
Challenge, and garners 19.79 mean R1, ranking 2nd in Natural Language Queries
Challenge. Our code will be released.Comment: Accepted to CVPR 2023 Ego4D Workshop; 1st in Ego4D Moment Queries
Challenge; 2nd in Ego4D Natural Language Queries Challeng
Machine-Learned Invertible Coarse Graining for Multiscale Molecular Modeling
Multiscale molecular modeling is widely applied in scientific research of
molecular properties over large time and length scales. Two specific challenges
are commonly present in multiscale modeling, provided that information between
the coarse and fine representations of molecules needs to be properly
exchanged: One is to construct coarse grained (CG) models by passing
information from the fine to coarse levels; the other is to restore finer
molecular details given CG configurations. Although these two problems are
commonly addressed independently, in this work, we present a theory connecting
them, and develop a methodology called Cycle Coarse Graining (CCG) to solve
both problems in a unified manner. In CCG, reconstruction can be achieved via a
tractable optimization process, leading to a general method to retrieve fine
details from CG simulations, which in turn, delivers a new solution to the CG
problem, yielding an efficient way to calculate free energies in a
rare-event-free manner. CCG thus provides a systematic way for multiscale
molecular modeling, where the finer details of CG simulations can be
efficiently retrieved, and the CG models can be improved consistently.Comment: 10 pages, 5 figures, plus S
- …