56 research outputs found
Audio-Visual Glance Network for Efficient Video Recognition
Deep learning has made significant strides in video understanding tasks, but
the computation required to classify lengthy and massive videos using
clip-level video classifiers remains impractical and prohibitively expensive.
To address this issue, we propose Audio-Visual Glance Network (AVGN), which
leverages the commonly available audio and visual modalities to efficiently
process the spatio-temporally important parts of a video. AVGN firstly divides
the video into snippets of image-audio clip pair and employs lightweight
unimodal encoders to extract global visual features and audio features. To
identify the important temporal segments, we use an Audio-Visual Temporal
Saliency Transformer (AV-TeST) that estimates the saliency scores of each
frame. To further increase efficiency in the spatial dimension, AVGN processes
only the important patches instead of the whole images. We use an
Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of
enhanced coarse visual features, which are fed to a policy network that
produces the coordinates of the important patches. This approach enables us to
focus only on the most important spatio-temporally parts of the video, leading
to more efficient video recognition. Moreover, we incorporate various training
techniques and multi-modal feature fusion to enhance the robustness and
effectiveness of our AVGN. By combining these strategies, our AVGN sets new
state-of-the-art performance in multiple video recognition benchmarks while
achieving faster processing speed.Comment: ICCV 202
Towards Good Practices for Missing Modality Robust Action Recognition
Standard multi-modal models assume the use of the same modalities in training
and inference stages. However, in practice, the environment in which
multi-modal models operate may not satisfy such assumption. As such, their
performances degrade drastically if any modality is missing in the inference
stage. We ask: how can we train a model that is robust to missing modalities?
This paper seeks a set of good practices for multi-modal action recognition,
with a particular interest in circumstances where some modalities are not
available at an inference time. First, we study how to effectively regularize
the model during training (e.g., data augmentation). Second, we investigate on
fusion methods for robustness to missing modalities: we find that
transformer-based fusion shows better robustness for missing modality than
summation or concatenation. Third, we propose a simple modular network,
ActionMAE, which learns missing modality predictive coding by randomly dropping
modality features and tries to reconstruct them with the remaining modality
features. Coupling these good practices, we build a model that is not only
effective in multi-modal action recognition but also robust to modality
missing. Our model achieves the state-of-the-arts on multiple benchmarks and
maintains competitive performances even in missing modality scenarios. Codes
are available at https://github.com/sangminwoo/ActionMAE.Comment: AAAI 202
Adversarial Fine-tuning using Generated Respiratory Sound to Address Class Imbalance
Deep generative models have emerged as a promising approach in the medical
image domain to address data scarcity. However, their use for sequential data
like respiratory sounds is less explored. In this work, we propose a
straightforward approach to augment imbalanced respiratory sound data using an
audio diffusion model as a conditional neural vocoder. We also demonstrate a
simple yet effective adversarial fine-tuning method to align features between
the synthetic and real respiratory sound samples to improve respiratory sound
classification performance. Our experimental results on the ICBHI dataset
demonstrate that the proposed adversarial fine-tuning is effective, while only
using the conventional augmentation method shows performance degradation.
Moreover, our method outperforms the baseline by 2.24% on the ICBHI Score and
improves the accuracy of the minority classes up to 26.58%. For the
supplementary material, we provide the code at
https://github.com/kaen2891/adversarial_fine-tuning_using_generated_respiratory_sound.Comment: accepted in NeurIPS 2023 Workshop on Deep Generative Models for
Health (DGM4H
Sketch-based Video Object Localization
We introduce Sketch-based Video Object Localization (SVOL), a new task aimed
at localizing spatio-temporal object boxes in video queried by the input
sketch. We first outline the challenges in the SVOL task and build the
Sketch-Video Attention Network (SVANet) with the following design principles:
(i) to consider temporal information of video and bridge the domain gap between
sketch and video; (ii) to accurately identify and localize multiple objects
simultaneously; (iii) to handle various styles of sketches; (iv) to be
classification-free. In particular, SVANet is equipped with a Cross-modal
Transformer that models the interaction between learnable object tokens, query
sketch, and video through attention operations, and learns upon a per-frame set
matching strategy that enables frame-wise prediction while utilizing global
video context. We evaluate SVANet on a newly curated SVOL dataset. By design,
SVANet successfully learns the mapping between the query sketches and video
objects, achieving state-of-the-art results on the SVOL benchmark. We further
confirm the effectiveness of SVANet via extensive ablation studies and
visualizations. Lastly, we demonstrate its transfer capability on unseen
datasets and novel categories, suggesting its high scalability in real-world
application
Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification
Respiratory sound contains crucial information for the early diagnosis of
fatal lung diseases. Since the COVID-19 pandemic, there has been a growing
interest in contact-free medical care based on electronic stethoscopes. To this
end, cutting-edge deep learning models have been developed to diagnose lung
diseases; however, it is still challenging due to the scarcity of medical data.
In this study, we demonstrate that the pretrained model on large-scale visual
and audio datasets can be generalized to the respiratory sound classification
task. In addition, we introduce a straightforward Patch-Mix augmentation, which
randomly mixes patches between different samples, with Audio Spectrogram
Transformer (AST). We further propose a novel and effective Patch-Mix
Contrastive Learning to distinguish the mixed representations in the latent
space. Our method achieves state-of-the-art performance on the ICBHI dataset,
outperforming the prior leading score by an improvement of 4.08%.Comment: INTERSPEECH 2023, Code URL:
https://github.com/raymin0223/patch-mix_contrastive_learnin
Recommended from our members
Cell Labeling and Tracking Method without Distorted Signals by Phagocytosis of Macrophages
Cell labeling and tracking are important processes in understanding biologic mechanisms and the therapeutic effect of inoculated cells in vivo. Numerous attempts have been made to label and track inoculated cells in vivo; however, these methods have limitations as a result of their biological effects, including secondary phagocytosis of macrophages and genetic modification. Here, we investigated a new cell labeling and tracking strategy based on metabolic glycoengineering and bioorthogonal click chemistry. We first treated cells with tetra-acetylated N-azidoacetyl-D-mannosamine to generate unnatural sialic acids with azide groups on the surface of the target cells. The azide-labeled cells were then transplanted to mouse liver, and dibenzyl cyclooctyne-conjugated Cy5 (DBCO-Cy5) was intravenously injected into mice to chemically bind with the azide groups on the surface of the target cells in vivo for target cell visualization. Unnatural sialic acids with azide groups could be artificially induced on the surface of target cells by glycoengineering. We then tracked the azide groups on the surface of the cells by DBCO-Cy5 in vivo using bioorthogonal click chemistry. Importantly, labeling efficacy was enhanced and false signals by phagocytosis of macrophages were reduced. This strategy will be highly useful for cell labeling and tracking
SALM5 trans-synaptically interacts with LAR-RPTPs in a splicing-dependent manner to regulate synapse development
Synaptogenic adhesion molecules play critical roles in synapse formation. SALM5/Lrfn5, a SALM/Lrfn family adhesion molecule implicated in autism spectrum disorders (ASDs) and schizophrenia, induces presynaptic differentiation in contacting axons, but its presynaptic ligand remains unknown. We found that SALM5 interacts with the Ig domains of LAR family receptor protein tyrosine phosphatases (LAR-RPTPs; LAR, PTPδ, and PTPσ). These interactions are strongly inhibited by the splice insert B in the Ig domain region of LAR-RPTPs, and mediate SALM5-dependent presynaptic differentiation in contacting axons. In addition, SALM5 regulates AMPA receptor-mediated synaptic transmission through mechanisms involving the interaction of postsynaptic SALM5 with presynaptic LAR-RPTPs. These results suggest that postsynaptic SALM5 promotes synapse development by trans-synaptically interacting with presynaptic LAR-RPTPs and is important for the regulation of excitatory synaptic strength
Protective Effects of Gabapentin on Allodynia and α2δ1-Subunit of Voltage-dependent Calcium Channel in Spinal Nerve-Ligated Rats
This study was designed to determine whether early gabapentin treatment has a protective analgesic effect on neuropathic pain and compared its effect to the late treatment in a rat neuropathic model, and as the potential mechanism of protective action, the α2δ1-subunit of the voltage-dependent calcium channel (α2δ1-subunit) was evaluated in both sides of the L5 dorsal root ganglia (DRG). Neuropathic pain was induced in male Sprague-Dawley rats by a surgical ligation of left L5 nerve. For the early treatment group, rats were injected with gabapentin (100 mg/kg) intraperitoneally 15 min prior to surgery and then every 24 hr during postoperative day (POD) 1-4. For the late treatment group, the same dose of gabapentin was injected every 24 hr during POD 8-12. For the control group, L5 nerve was ligated but no gabapentin was administered. In the early treatment group, the development of allodynia was delayed up to POD 10, whereas allodynia was developed on POD 2 in the control and the late treatment group (p<0.05). The α2δ1-subunit was up-regulated in all groups, however, there was no difference in the level of the α2δ1-subunit among the three groups. These results suggest that early treatment with gabapentin offers some protection against neuropathic pain but it is unlikely that this action is mediated through modulation of the α2δ1-subunit in DRG
- …