88 research outputs found
On the Need for a Language Describing Distribution Shifts: Illustrations on Tabular Datasets
Different distribution shifts require different algorithmic and operational
interventions. Methodological research must be grounded by the specific shifts
they address. Although nascent benchmarks provide a promising empirical
foundation, they implicitly focus on covariate shifts, and the validity of
empirical findings depends on the type of shift, e.g., previous observations on
algorithmic performance can fail to be valid when the distribution
changes. We conduct a thorough investigation of natural shifts in 5 tabular
datasets over 86,000 model configurations, and find that -shifts are most
prevalent. To encourage researchers to develop a refined language for
distribution shifts, we build WhyShift, an empirical testbed of curated
real-world shifts where we characterize the type of shift we benchmark
performance over. Since -shifts are prevalent in tabular settings, we
identify covariate regions that suffer the biggest -shifts and discuss
implications for algorithmic and data-based interventions. Our testbed
highlights the importance of future research that builds an understanding of
how distributions differ.Comment: 41 page
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection
Weakly-supervised audio-visual violence detection aims to distinguish
snippets containing multimodal violence events with video-level labels. Many
prior works perform audio-visual integration and interaction in an early or
intermediate manner, yet overlooking the modality heterogeneousness over the
weakly-supervised setting. In this paper, we analyze the modality asynchrony
and undifferentiated instances phenomena of the multiple instance learning
(MIL) procedure, and further investigate its negative impact on
weakly-supervised audio-visual learning. To address these issues, we propose a
modality-aware contrastive instance learning with self-distillation (MACIL-SD)
strategy. Specifically, we leverage a lightweight two-stream network to
generate audio and visual bags, in which unimodal background, violent, and
normal instances are clustered into semi-bags in an unsupervised way. Then
audio and visual violent semi-bag representations are assembled as positive
pairs, and violent semi-bags are combined with background and normal instances
in the opposite modality as contrastive negative pairs. Furthermore, a
self-distillation module is applied to transfer unimodal visual knowledge to
the audio-visual model, which alleviates noises and closes the semantic gap
between unimodal and multimodal features. Experiments show that our framework
outperforms previous methods with lower complexity on the large-scale
XD-Violence dataset. Results also demonstrate that our proposed approach can be
used as plug-in modules to enhance other networks. Codes are available at
https://github.com/JustinYuu/MACIL_SD.Comment: ACM MM 202
Rethinking the Evaluation Protocol of Domain Generalization
Domain generalization aims to solve the challenge of Out-of-Distribution
(OOD) generalization by leveraging common knowledge learned from multiple
training domains to generalize to unseen test domains. To accurately evaluate
the OOD generalization ability, it is necessary to ensure that test data
information is unavailable. However, the current domain generalization protocol
may still have potential test data information leakage. This paper examines the
potential risks of test data information leakage in two aspects of the current
protocol: pretraining on ImageNet and oracle model selection. We propose that
training from scratch and using multiple test domains would result in a more
precise evaluation of OOD generalization ability. We also rerun the algorithms
with the modified protocol and introduce a new leaderboard to encourage future
research in domain generalization with a fairer comparison
Distributionally Robust Learning with Stable Adversarial Training
Machine learning algorithms with empirical risk minimization are vulnerable
under distributional shifts due to the greedy adoption of all the correlations
found in training data. There is an emerging literature on tackling this
problem by minimizing the worst-case risk over an uncertainty set. However,
existing methods mostly construct ambiguity sets by treating all variables
equally regardless of the stability of their correlations with the target,
resulting in the overwhelmingly-large uncertainty set and low confidence of the
learner. In this paper, we propose a novel Stable Adversarial Learning (SAL)
algorithm that leverages heterogeneous data sources to construct a more
practical uncertainty set and conduct differentiated robustness optimization,
where covariates are differentiated according to the stability of their
correlations with the target. We theoretically show that our method is
tractable for stochastic gradient-based optimization and provide the
performance guarantees for our method. Empirical studies on both simulation and
real datasets validate the effectiveness of our method in terms of uniformly
good performance across unknown distributional shifts.Comment: arXiv admin note: substantial text overlap with arXiv:2006.0441
Improving Multi-turn Emotional Support Dialogue Generation with Lookahead Strategy Planning
Providing Emotional Support (ES) to soothe people in emotional distress is an
essential capability in social interactions. Most existing researches on
building ES conversation systems only considered single-turn interactions with
users, which was over-simplified. In comparison, multi-turn ES conversation
systems can provide ES more effectively, but face several new technical
challenges, including: (1) how to adopt appropriate support strategies to
achieve the long-term dialogue goal of comforting the user's emotion; (2) how
to dynamically model the user's state. In this paper, we propose a novel system
MultiESC to address these issues. For strategy planning, drawing inspiration
from the A* search algorithm, we propose lookahead heuristics to estimate the
future user feedback after using particular strategies, which helps to select
strategies that can lead to the best long-term effects. For user state
modeling, MultiESC focuses on capturing users' subtle emotional expressions and
understanding their emotion causes. Extensive experiments show that MultiESC
significantly outperforms competitive baselines in both dialogue generation and
strategy planning. Our codes are available at
https://github.com/lwgkzl/MultiESC.Comment: Accepted by the main conference of EMNLP 202
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
This paper introduces InternVid, a large-scale video-centric multimodal
dataset that enables learning powerful and transferable video-text
representations for multimodal understanding and generation. The InternVid
dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M
video clips accompanied by detailed descriptions of total 4.1B words. Our core
contribution is to develop a scalable approach to autonomously build a
high-quality video-text dataset with large language models (LLM), thereby
showcasing its efficacy in learning video-language representation at scale.
Specifically, we utilize a multi-scale approach to generate video-related
descriptions. Furthermore, we introduce ViCLIP, a video-text representation
learning model based on ViT-L. Learned on InternVid via contrastive learning,
this model demonstrates leading zero-shot action recognition and competitive
video retrieval performance. Beyond basic video understanding tasks like
recognition and retrieval, our dataset and model have broad applications. They
are particularly beneficial for generating interleaved video-text data for
learning a video-centric dialogue system, advancing video-to-text and
text-to-video generation research. These proposed resources provide a tool for
researchers and practitioners interested in multimodal video understanding and
generation.Comment: Data and Code:
https://github.com/OpenGVLab/InternVideo/tree/main/Data/InternVi
- …