1,917 research outputs found
Multimodal Grounding for Language Processing
This survey discusses how recent developments in multimodal processing
facilitate conceptual grounding of language. We categorize the information flow
in multimodal processing with respect to cognitive models of human information
processing and analyze different methods for combining multimodal
representations. Based on this methodological inventory, we discuss the benefit
of multimodal grounding for a variety of language processing tasks and the
challenges that arise. We particularly focus on multimodal grounding of verbs
which play a crucial role for the compositional power of language.Comment: The paper has been published in the Proceedings of the 27 Conference
of Computational Linguistics. Please refer to this version for citations:
https://www.aclweb.org/anthology/papers/C/C18/C18-1197
AmicroN: A Framework for Generating Annotations for Human Activity Recognition with Granular Micro-Activities
Efficient human activity recognition (HAR) using sensor data needs a
significant volume of annotated data. The growing volume of unlabelled sensor
data has challenged conventional practices for gathering HAR annotations with
human-in-the-loop approaches, often leading to the collection of shallower
annotations. These shallower annotations ignore the fine-grained
micro-activities that constitute any complex activities of daily living (ADL).
Understanding this, we, in this paper, first analyze this lack of granular
annotations from available pre-annotated datasets to understand the practical
inconsistencies and also perform a detailed survey to look into the human
perception surrounding annotations. Drawing motivations from these, we next
develop the framework AmicroN that can automatically generate micro-activity
annotations using locomotive signatures and the available coarse-grain
macro-activity labels. In the backend, AmicroN applies change-point detection
followed by zero-shot learning with activity embeddings to identify the unseen
micro-activities in an unsupervised manner. Rigorous evaluation on publicly
available datasets shows that AmicroN can accurately generate micro-activity
annotations with a median F1-score of >0.75. Additionally, we also show that
AmicroN can be used in a plug-and-play manner with Large Language Models (LLMs)
to obtain the micro-activity labels, thus making it more practical for
realistic applications.Comment: 27 pages, 5 tables, 9 figure
Action Modifiers:Learning from Adverbs in Instructional Videos
We present a method to learn a representation for adverbs from instructional
videos using weak supervision from the accompanying narrations. Key to our
method is the fact that the visual representation of the adverb is highly
dependant on the action to which it applies, although the same adverb will
modify multiple actions in a similar way. For instance, while 'spread quickly'
and 'mix quickly' will look dissimilar, we can learn a common representation
that allows us to recognize both, among other actions. We formulate this as an
embedding problem, and use scaled dot-product attention to learn from
weakly-supervised video narrations. We jointly learn adverbs as invertible
transformations operating on the embedding space, so as to add or remove the
effect of the adverb. As there is no prior work on weakly supervised learning
from adverbs, we gather paired action-adverb annotations from a subset of the
HowTo100M dataset for 6 adverbs: quickly/slowly, finely/coarsely, and
partially/completely. Our method outperforms all baselines for video-to-adverb
retrieval with a performance of 0.719 mAP. We also demonstrate our model's
ability to attend to the relevant video parts in order to determine the adverb
for a given action.Comment: CVPR 202
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
Large scale Vision-Language (VL) models have shown tremendous success in
aligning representations between visual and text modalities. This enables
remarkable progress in zero-shot recognition, image generation & editing, and
many other exciting tasks. However, VL models tend to over-represent objects
while paying much less attention to verbs, and require additional tuning on
video data for best zero-shot action recognition performance. While previous
work relied on large-scale, fully-annotated data, in this work we propose an
unsupervised approach. We adapt a VL model for zero-shot and few-shot action
recognition using a collection of unlabeled videos and an unpaired action
dictionary. Based on that, we leverage Large Language Models and VL models to
build a text bag for each unlabeled video via matching, text expansion and
captioning. We use those bags in a Multiple Instance Learning setup to adapt an
image-text backbone to video data. Although finetuned on unlabeled video data,
our resulting models demonstrate high transferability to numerous unseen
zero-shot downstream tasks, improving the base VL model performance by up to
14\%, and even comparing favorably to fully-supervised baselines in both
zero-shot and few-shot video recognition transfer. The code will be released
later at \url{https://github.com/wlin-at/MAXI}.Comment: Accepted at ICCV 202
Learning from Very Few Samples: A Survey
Few sample learning (FSL) is significant and challenging in the field of
machine learning. The capability of learning and generalizing from very few
samples successfully is a noticeable demarcation separating artificial
intelligence and human intelligence since humans can readily establish their
cognition to novelty from just a single or a handful of examples whereas
machine learning algorithms typically entail hundreds or thousands of
supervised samples to guarantee generalization ability. Despite the long
history dated back to the early 2000s and the widespread attention in recent
years with booming deep learning technologies, little surveys or reviews for
FSL are available until now. In this context, we extensively review 300+ papers
of FSL spanning from the 2000s to 2019 and provide a timely and comprehensive
survey for FSL. In this survey, we review the evolution history as well as the
current progress on FSL, categorize FSL approaches into the generative model
based and discriminative model based kinds in principle, and emphasize
particularly on the meta learning based FSL approaches. We also summarize
several recently emerging extensional topics of FSL and review the latest
advances on these topics. Furthermore, we highlight the important FSL
applications covering many research hotspots in computer vision, natural
language processing, audio and speech, reinforcement learning and robotic, data
analysis, etc. Finally, we conclude the survey with a discussion on promising
trends in the hope of providing guidance and insights to follow-up researches.Comment: 30 page
A Kind Introduction to Lexical and Grammatical Aspect, with a Survey of Computational Approaches
Aspectual meaning refers to how the internal temporal structure of situations
is presented. This includes whether a situation is described as a state or as
an event, whether the situation is finished or ongoing, and whether it is
viewed as a whole or with a focus on a particular phase. This survey gives an
overview of computational approaches to modeling lexical and grammatical aspect
along with intuitive explanations of the necessary linguistic concepts and
terminology. In particular, we describe the concepts of stativity, telicity,
habituality, perfective and imperfective, as well as influential inventories of
eventuality and situation types. We argue that because aspect is a crucial
component of semantics, especially when it comes to reporting the temporal
structure of situations in a precise way, future NLP approaches need to be able
to handle and evaluate it systematically in order to achieve human-level
language understanding.Comment: Accepted at EACL 2023, camera ready versio
- …