56 research outputs found

    Audio-Visual Glance Network for Efficient Video Recognition

    Full text link
    Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of enhanced coarse visual features, which are fed to a policy network that produces the coordinates of the important patches. This approach enables us to focus only on the most important spatio-temporally parts of the video, leading to more efficient video recognition. Moreover, we incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN. By combining these strategies, our AVGN sets new state-of-the-art performance in multiple video recognition benchmarks while achieving faster processing speed.Comment: ICCV 202

    Towards Good Practices for Missing Modality Robust Action Recognition

    Full text link
    Standard multi-modal models assume the use of the same modalities in training and inference stages. However, in practice, the environment in which multi-modal models operate may not satisfy such assumption. As such, their performances degrade drastically if any modality is missing in the inference stage. We ask: how can we train a model that is robust to missing modalities? This paper seeks a set of good practices for multi-modal action recognition, with a particular interest in circumstances where some modalities are not available at an inference time. First, we study how to effectively regularize the model during training (e.g., data augmentation). Second, we investigate on fusion methods for robustness to missing modalities: we find that transformer-based fusion shows better robustness for missing modality than summation or concatenation. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding by randomly dropping modality features and tries to reconstruct them with the remaining modality features. Coupling these good practices, we build a model that is not only effective in multi-modal action recognition but also robust to modality missing. Our model achieves the state-of-the-arts on multiple benchmarks and maintains competitive performances even in missing modality scenarios. Codes are available at https://github.com/sangminwoo/ActionMAE.Comment: AAAI 202

    Adversarial Fine-tuning using Generated Respiratory Sound to Address Class Imbalance

    Full text link
    Deep generative models have emerged as a promising approach in the medical image domain to address data scarcity. However, their use for sequential data like respiratory sounds is less explored. In this work, we propose a straightforward approach to augment imbalanced respiratory sound data using an audio diffusion model as a conditional neural vocoder. We also demonstrate a simple yet effective adversarial fine-tuning method to align features between the synthetic and real respiratory sound samples to improve respiratory sound classification performance. Our experimental results on the ICBHI dataset demonstrate that the proposed adversarial fine-tuning is effective, while only using the conventional augmentation method shows performance degradation. Moreover, our method outperforms the baseline by 2.24% on the ICBHI Score and improves the accuracy of the minority classes up to 26.58%. For the supplementary material, we provide the code at https://github.com/kaen2891/adversarial_fine-tuning_using_generated_respiratory_sound.Comment: accepted in NeurIPS 2023 Workshop on Deep Generative Models for Health (DGM4H

    Sketch-based Video Object Localization

    Full text link
    We introduce Sketch-based Video Object Localization (SVOL), a new task aimed at localizing spatio-temporal object boxes in video queried by the input sketch. We first outline the challenges in the SVOL task and build the Sketch-Video Attention Network (SVANet) with the following design principles: (i) to consider temporal information of video and bridge the domain gap between sketch and video; (ii) to accurately identify and localize multiple objects simultaneously; (iii) to handle various styles of sketches; (iv) to be classification-free. In particular, SVANet is equipped with a Cross-modal Transformer that models the interaction between learnable object tokens, query sketch, and video through attention operations, and learns upon a per-frame set matching strategy that enables frame-wise prediction while utilizing global video context. We evaluate SVANet on a newly curated SVOL dataset. By design, SVANet successfully learns the mapping between the query sketches and video objects, achieving state-of-the-art results on the SVOL benchmark. We further confirm the effectiveness of SVANet via extensive ablation studies and visualizations. Lastly, we demonstrate its transfer capability on unseen datasets and novel categories, suggesting its high scalability in real-world application

    Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification

    Full text link
    Respiratory sound contains crucial information for the early diagnosis of fatal lung diseases. Since the COVID-19 pandemic, there has been a growing interest in contact-free medical care based on electronic stethoscopes. To this end, cutting-edge deep learning models have been developed to diagnose lung diseases; however, it is still challenging due to the scarcity of medical data. In this study, we demonstrate that the pretrained model on large-scale visual and audio datasets can be generalized to the respiratory sound classification task. In addition, we introduce a straightforward Patch-Mix augmentation, which randomly mixes patches between different samples, with Audio Spectrogram Transformer (AST). We further propose a novel and effective Patch-Mix Contrastive Learning to distinguish the mixed representations in the latent space. Our method achieves state-of-the-art performance on the ICBHI dataset, outperforming the prior leading score by an improvement of 4.08%.Comment: INTERSPEECH 2023, Code URL: https://github.com/raymin0223/patch-mix_contrastive_learnin

    SALM5 trans-synaptically interacts with LAR-RPTPs in a splicing-dependent manner to regulate synapse development

    Get PDF
    Synaptogenic adhesion molecules play critical roles in synapse formation. SALM5/Lrfn5, a SALM/Lrfn family adhesion molecule implicated in autism spectrum disorders (ASDs) and schizophrenia, induces presynaptic differentiation in contacting axons, but its presynaptic ligand remains unknown. We found that SALM5 interacts with the Ig domains of LAR family receptor protein tyrosine phosphatases (LAR-RPTPs; LAR, PTPδ, and PTPσ). These interactions are strongly inhibited by the splice insert B in the Ig domain region of LAR-RPTPs, and mediate SALM5-dependent presynaptic differentiation in contacting axons. In addition, SALM5 regulates AMPA receptor-mediated synaptic transmission through mechanisms involving the interaction of postsynaptic SALM5 with presynaptic LAR-RPTPs. These results suggest that postsynaptic SALM5 promotes synapse development by trans-synaptically interacting with presynaptic LAR-RPTPs and is important for the regulation of excitatory synaptic strength

    Protective Effects of Gabapentin on Allodynia and α2δ1-Subunit of Voltage-dependent Calcium Channel in Spinal Nerve-Ligated Rats

    Get PDF
    This study was designed to determine whether early gabapentin treatment has a protective analgesic effect on neuropathic pain and compared its effect to the late treatment in a rat neuropathic model, and as the potential mechanism of protective action, the α2δ1-subunit of the voltage-dependent calcium channel (α2δ1-subunit) was evaluated in both sides of the L5 dorsal root ganglia (DRG). Neuropathic pain was induced in male Sprague-Dawley rats by a surgical ligation of left L5 nerve. For the early treatment group, rats were injected with gabapentin (100 mg/kg) intraperitoneally 15 min prior to surgery and then every 24 hr during postoperative day (POD) 1-4. For the late treatment group, the same dose of gabapentin was injected every 24 hr during POD 8-12. For the control group, L5 nerve was ligated but no gabapentin was administered. In the early treatment group, the development of allodynia was delayed up to POD 10, whereas allodynia was developed on POD 2 in the control and the late treatment group (p<0.05). The α2δ1-subunit was up-regulated in all groups, however, there was no difference in the level of the α2δ1-subunit among the three groups. These results suggest that early treatment with gabapentin offers some protection against neuropathic pain but it is unlikely that this action is mediated through modulation of the α2δ1-subunit in DRG
    • …
    corecore