204 research outputs found
AVSegFormer: Audio-Visual Segmentation with Transformer
The combination of audio and vision has long been a topic of interest in the
multi-modal community. Recently, a new audio-visual segmentation (AVS) task has
been introduced, aiming to locate and segment the sounding objects in a given
video. This task demands audio-driven pixel-level scene understanding for the
first time, posing significant challenges. In this paper, we propose
AVSegFormer, a novel framework for AVS tasks that leverages the transformer
architecture. Specifically, we introduce audio queries and learnable queries
into the transformer decoder, enabling the network to selectively attend to
interested visual features. Besides, we present an audio-visual mixer, which
can dynamically adjust visual features by amplifying relevant and suppressing
irrelevant spatial channels. Additionally, we devise an intermediate mask loss
to enhance the supervision of the decoder, encouraging the network to produce
more accurate intermediate predictions. Extensive experiments demonstrate that
AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is
available at https://github.com/vvvb-github/AVSegFormer.Comment: 9 pages, 7 figure
Champion Solution for the WSDM2023 Toloka VQA Challenge
In this report, we present our champion solution to the WSDM2023 Toloka
Visual Question Answering (VQA) Challenge. Different from the common VQA and
visual grounding (VG) tasks, this challenge involves a more complex scenario,
i.e. inferring and locating the object implicitly specified by the given
interrogative question. For this task, we leverage ViT-Adapter, a
pre-training-free adapter network, to adapt multi-modal pre-trained
Uni-Perceiver for better cross-modal localization. Our method ranks first on
the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets,
respectively. It shows that ViT-Adapter is also an effective paradigm for
adapting the unified perception model to vision-language downstream tasks. Code
and models will be released at
https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023.Comment: Technical report in WSDM Cup 202
Consensus seeking in multi-agent systems with an active leader and communication delays
summary:In this paper, we consider a multi-agent consensus problem with an active leader and variable interconnection topology. The dynamics of the active leader is given in a general form of linear system. The switching interconnection topology with communication delay among the agents is taken into consideration. A neighbor-based estimator is designed for each agent to obtain the unmeasurable state variables of the dynamic leader, and then a distributed feedback control law is developed to achieve consensus. The feedback parameters are obtained by solving a Riccati equation. By constructing a common Lyapunov function, some sufficient conditions are established to guarantee that each agent can track the active leader by assumption that interconnection topology is undirected and connected. We also point out that some results can be generalized to a class of directed interaction topologies. Moreover, the input-to-state stability (ISS) is obtained for multi-agent system with variable interconnection topology and communication delays in a disturbed environment
Beyond Myopia: Learning from Positive and Unlabeled Data through Holistic Predictive Trends
Learning binary classifiers from positive and unlabeled data (PUL) is vital
in many real-world applications, especially when verifying negative examples is
difficult. Despite the impressive empirical performance of recent PUL methods,
challenges like accumulated errors and increased estimation bias persist due to
the absence of negative labels. In this paper, we unveil an intriguing yet
long-overlooked observation in PUL: \textit{resampling the positive data in
each training iteration to ensure a balanced distribution between positive and
unlabeled examples results in strong early-stage performance. Furthermore,
predictive trends for positive and negative classes display distinctly
different patterns.} Specifically, the scores (output probability) of unlabeled
negative examples consistently decrease, while those of unlabeled positive
examples show largely chaotic trends. Instead of focusing on classification
within individual time frames, we innovatively adopt a holistic approach,
interpreting the scores of each example as a temporal point process (TPP). This
reformulates the core problem of PUL as recognizing trends in these scores. We
then propose a novel TPP-inspired measure for trend detection and prove its
asymptotic unbiasedness in predicting changes. Notably, our method accomplishes
PUL without requiring additional parameter tuning or prior assumptions,
offering an alternative perspective for tackling this problem. Extensive
experiments verify the superiority of our method, particularly in a highly
imbalanced real-world setting, where it achieves improvements of up to
in key metrics. The code is available at
\href{https://github.com/wxr99/HolisticPU}{https://github.com/wxr99/HolisticPU}.Comment: 25 page
Self-compression of femtosecond pulses in normally dispersive media
Self-compression is a simple method to achieve ultrashort and ultraintense
pulses. By solving a modified nonlinear Schrodinger equation considering the
fifth-order susceptibility, it is found that self-compression appeared even in
normally dispersive media owing to the negative fifth-order susceptibility
inducing a mass of negative frequency chirp. Furthermore, negatively
pre-chirped pulses help to achieve pulse self-compression at lower input peak
intensity which will avoid the damage of media. The optimized-choosing of
pre-chirp, input intensity and length of media are numerically analyzed.
Proof-of-principle experiments successfully prove the above theoretical
findings. It is expected that petawatt laser pulses with 25 fs/15 fs transform
limited pulse duration can be self-compressed to about 10.7 fs/8.8 fs in
normally dispersive media such as fused silica glass plates.Comment: 24 pages, 8 figures, 1 tabl
Unlocking the Power of Open Set : A New Perspective for Open-Set Noisy Label Learning
Learning from noisy data has attracted much attention, where most methods
focus on closed-set label noise. However, a more common scenario in the real
world is the presence of both open-set and closed-set noise. Existing methods
typically identify and handle these two types of label noise separately by
designing a specific strategy for each type. However, in many real-world
scenarios, it would be challenging to identify open-set examples, especially
when the dataset has been severely corrupted. Unlike the previous works, we
explore how models behave when faced with open-set examples, and find that
\emph{a part of open-set examples gradually get integrated into certain known
classes}, which is beneficial for the separation among known classes. Motivated
by the phenomenon, we propose a novel two-step contrastive learning method CECL
(Class Expansion Contrastive Learning) which aims to deal with both types of
label noise by exploiting the useful information of open-set examples.
Specifically, we incorporate some open-set examples into closed-set classes to
enhance performance while treating others as delimiters to improve
representative ability. Extensive experiments on synthetic and real-world
datasets with diverse label noise demonstrate the effectiveness of CECL
- …