204 research outputs found

    AVSegFormer: Audio-Visual Segmentation with Transformer

    Full text link
    The combination of audio and vision has long been a topic of interest in the multi-modal community. Recently, a new audio-visual segmentation (AVS) task has been introduced, aiming to locate and segment the sounding objects in a given video. This task demands audio-driven pixel-level scene understanding for the first time, posing significant challenges. In this paper, we propose AVSegFormer, a novel framework for AVS tasks that leverages the transformer architecture. Specifically, we introduce audio queries and learnable queries into the transformer decoder, enabling the network to selectively attend to interested visual features. Besides, we present an audio-visual mixer, which can dynamically adjust visual features by amplifying relevant and suppressing irrelevant spatial channels. Additionally, we devise an intermediate mask loss to enhance the supervision of the decoder, encouraging the network to produce more accurate intermediate predictions. Extensive experiments demonstrate that AVSegFormer achieves state-of-the-art results on the AVS benchmark. The code is available at https://github.com/vvvb-github/AVSegFormer.Comment: 9 pages, 7 figure

    Champion Solution for the WSDM2023 Toloka VQA Challenge

    Full text link
    In this report, we present our champion solution to the WSDM2023 Toloka Visual Question Answering (VQA) Challenge. Different from the common VQA and visual grounding (VG) tasks, this challenge involves a more complex scenario, i.e. inferring and locating the object implicitly specified by the given interrogative question. For this task, we leverage ViT-Adapter, a pre-training-free adapter network, to adapt multi-modal pre-trained Uni-Perceiver for better cross-modal localization. Our method ranks first on the leaderboard, achieving 77.5 and 76.347 IoU on public and private test sets, respectively. It shows that ViT-Adapter is also an effective paradigm for adapting the unified perception model to vision-language downstream tasks. Code and models will be released at https://github.com/czczup/ViT-Adapter/tree/main/wsdm2023.Comment: Technical report in WSDM Cup 202

    Consensus seeking in multi-agent systems with an active leader and communication delays

    Get PDF
    summary:In this paper, we consider a multi-agent consensus problem with an active leader and variable interconnection topology. The dynamics of the active leader is given in a general form of linear system. The switching interconnection topology with communication delay among the agents is taken into consideration. A neighbor-based estimator is designed for each agent to obtain the unmeasurable state variables of the dynamic leader, and then a distributed feedback control law is developed to achieve consensus. The feedback parameters are obtained by solving a Riccati equation. By constructing a common Lyapunov function, some sufficient conditions are established to guarantee that each agent can track the active leader by assumption that interconnection topology is undirected and connected. We also point out that some results can be generalized to a class of directed interaction topologies. Moreover, the input-to-state stability (ISS) is obtained for multi-agent system with variable interconnection topology and communication delays in a disturbed environment

    Beyond Myopia: Learning from Positive and Unlabeled Data through Holistic Predictive Trends

    Full text link
    Learning binary classifiers from positive and unlabeled data (PUL) is vital in many real-world applications, especially when verifying negative examples is difficult. Despite the impressive empirical performance of recent PUL methods, challenges like accumulated errors and increased estimation bias persist due to the absence of negative labels. In this paper, we unveil an intriguing yet long-overlooked observation in PUL: \textit{resampling the positive data in each training iteration to ensure a balanced distribution between positive and unlabeled examples results in strong early-stage performance. Furthermore, predictive trends for positive and negative classes display distinctly different patterns.} Specifically, the scores (output probability) of unlabeled negative examples consistently decrease, while those of unlabeled positive examples show largely chaotic trends. Instead of focusing on classification within individual time frames, we innovatively adopt a holistic approach, interpreting the scores of each example as a temporal point process (TPP). This reformulates the core problem of PUL as recognizing trends in these scores. We then propose a novel TPP-inspired measure for trend detection and prove its asymptotic unbiasedness in predicting changes. Notably, our method accomplishes PUL without requiring additional parameter tuning or prior assumptions, offering an alternative perspective for tackling this problem. Extensive experiments verify the superiority of our method, particularly in a highly imbalanced real-world setting, where it achieves improvements of up to 11.3%11.3\% in key metrics. The code is available at \href{https://github.com/wxr99/HolisticPU}{https://github.com/wxr99/HolisticPU}.Comment: 25 page

    Self-compression of femtosecond pulses in normally dispersive media

    Full text link
    Self-compression is a simple method to achieve ultrashort and ultraintense pulses. By solving a modified nonlinear Schrodinger equation considering the fifth-order susceptibility, it is found that self-compression appeared even in normally dispersive media owing to the negative fifth-order susceptibility inducing a mass of negative frequency chirp. Furthermore, negatively pre-chirped pulses help to achieve pulse self-compression at lower input peak intensity which will avoid the damage of media. The optimized-choosing of pre-chirp, input intensity and length of media are numerically analyzed. Proof-of-principle experiments successfully prove the above theoretical findings. It is expected that petawatt laser pulses with 25 fs/15 fs transform limited pulse duration can be self-compressed to about 10.7 fs/8.8 fs in normally dispersive media such as fused silica glass plates.Comment: 24 pages, 8 figures, 1 tabl

    Unlocking the Power of Open Set : A New Perspective for Open-Set Noisy Label Learning

    Full text link
    Learning from noisy data has attracted much attention, where most methods focus on closed-set label noise. However, a more common scenario in the real world is the presence of both open-set and closed-set noise. Existing methods typically identify and handle these two types of label noise separately by designing a specific strategy for each type. However, in many real-world scenarios, it would be challenging to identify open-set examples, especially when the dataset has been severely corrupted. Unlike the previous works, we explore how models behave when faced with open-set examples, and find that \emph{a part of open-set examples gradually get integrated into certain known classes}, which is beneficial for the separation among known classes. Motivated by the phenomenon, we propose a novel two-step contrastive learning method CECL (Class Expansion Contrastive Learning) which aims to deal with both types of label noise by exploiting the useful information of open-set examples. Specifically, we incorporate some open-set examples into closed-set classes to enhance performance while treating others as delimiters to improve representative ability. Extensive experiments on synthetic and real-world datasets with diverse label noise demonstrate the effectiveness of CECL
    corecore