100 research outputs found
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection
Weakly-supervised audio-visual violence detection aims to distinguish
snippets containing multimodal violence events with video-level labels. Many
prior works perform audio-visual integration and interaction in an early or
intermediate manner, yet overlooking the modality heterogeneousness over the
weakly-supervised setting. In this paper, we analyze the modality asynchrony
and undifferentiated instances phenomena of the multiple instance learning
(MIL) procedure, and further investigate its negative impact on
weakly-supervised audio-visual learning. To address these issues, we propose a
modality-aware contrastive instance learning with self-distillation (MACIL-SD)
strategy. Specifically, we leverage a lightweight two-stream network to
generate audio and visual bags, in which unimodal background, violent, and
normal instances are clustered into semi-bags in an unsupervised way. Then
audio and visual violent semi-bag representations are assembled as positive
pairs, and violent semi-bags are combined with background and normal instances
in the opposite modality as contrastive negative pairs. Furthermore, a
self-distillation module is applied to transfer unimodal visual knowledge to
the audio-visual model, which alleviates noises and closes the semantic gap
between unimodal and multimodal features. Experiments show that our framework
outperforms previous methods with lower complexity on the large-scale
XD-Violence dataset. Results also demonstrate that our proposed approach can be
used as plug-in modules to enhance other networks. Codes are available at
https://github.com/JustinYuu/MACIL_SD.Comment: ACM MM 202
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
Recognizing and localizing events in videos is a fundamental task for video
understanding. Since events may occur in auditory and visual modalities,
multimodal detailed perception is essential for complete scene comprehension.
Most previous works attempted to analyze videos from a holistic perspective.
However, they do not consider semantic information at multiple scales, which
makes the model difficult to localize events in different lengths. In this
paper, we present a Multimodal Pyramid Attentional Network
(\textbf{MM-Pyramid}) for event localization. Specifically, we first propose
the attentive feature pyramid module. This module captures temporal pyramid
features via several stacking pyramid units, each of them is composed of a
fixed-size attention block and dilated convolution block. We also design an
adaptive semantic fusion module, which leverages a unit-level attention block
and a selective fusion block to integrate pyramid features interactively.
Extensive experiments on audio-visual event localization and weakly-supervised
audio-visual video parsing tasks verify the effectiveness of our approach.Comment: ACM MM 202
Alkylation of phosphorothioated thrombin binding aptamers improves the selectivity of inhibition of tumor cell proliferation upon anticoagulation
Background: Recently, aptamers have been extensively researched for therapy and diagnostic applications. Thrombin-binding aptamer is a 15 nt deoxyribonucleic acid screened by SELEX, it can specifically bind to thrombin and inhibit blood coagulation. Since it is also endowed with excellent antitumor activity, the intrinsic anticoagulation advantage converted to a main potential side effect for its further application in antiproliferative therapy. Methods: Site-specific alkylation was conducted through nucleophilic reaction of phosphorothioated TBAs using bromide reagents. Circular dichroism (CD) spectroscopy and surface plasmon resonance (SPR) measurements were used to evaluate anticoagulation activity, and a CCK-8 assay was used to determine cell proliferation activity. Results: The CD spectra of the modified TBAs were weakened, and their affinity for thrombin was dramatically reduced, as reflected by the K-D values. On the other hand, their inhibition of A549 cells was retained. Conclusions: Incorporation of different alkyls apparently disrupted the binding of TBA to thrombin while maintaining the antitumor activity. General significance: A new modification strategy was established for the use of TBA as a more selective antitumor agent.National Natural Science Foundation of China [21332010, 21572013]; Ministry of Science and Technology of the People's Republic of China [2012CB720604]SCI(E)ARTICLE71864-1869186
Identification of Nonlinear Latent Hierarchical Models
Identifying latent variables and causal structures from observational data is
essential to many real-world applications involving biological data, medical
data, and unstructured data such as images and languages. However, this task
can be highly challenging, especially when observed variables are generated by
causally related latent variables and the relationships are nonlinear. In this
work, we investigate the identification problem for nonlinear latent
hierarchical causal models in which observed variables are generated by a set
of causally related latent variables, and some latent variables may not have
observed children.
We show that the identifiability of causal structures and latent variables
(up to invertible transformations) can be achieved under mild assumptions: on
causal structures, we allow for multiple paths between any pair of variables in
the graph, which relaxes latent tree assumptions in prior work; on structural
functions, we permit general nonlinearity and multi-dimensional continuous
variables, alleviating existing work's parametric assumptions. Specifically, we
first develop an identification criterion in the form of novel identifiability
guarantees for an elementary latent variable model. Leveraging this criterion,
we show that both causal structures and latent variables of the hierarchical
model can be identified asymptotically by explicitly constructing an estimation
procedure. To the best of our knowledge, our work is the first to establish
identifiability guarantees for both causal structures and latent variables in
nonlinear latent hierarchical models.Comment: NeurIPS 202
Tag2Text: Guiding Vision-Language Model via Image Tagging
This paper presents Tag2Text, a vision language pre-training (VLP) framework,
which introduces image tagging into vision-language models to guide the
learning of visual-linguistic features. In contrast to prior works which
utilize object tags either manually labeled or automatically detected with a
limited detector, our approach utilizes tags parsed from its paired text to
learn an image tagger and meanwhile provides guidance to vision-language
models. Given that, Tag2Text can utilize large-scale annotation-free image tags
in accordance with image-text pairs, and provides more diverse tag categories
beyond objects. As a result, Tag2Text achieves a superior image tag recognition
ability by exploiting fine-grained text information. Moreover, by leveraging
tagging guidance, Tag2Text effectively enhances the performance of
vision-language models on both generation-based and alignment-based tasks.
Across a wide range of downstream benchmarks, Tag2Text achieves
state-of-the-art or competitive results with similar model sizes and data
scales, demonstrating the efficacy of the proposed tagging guidance
- …