12 research outputs found
UR-FUNNY: A Multimodal Language Dataset for Understanding Humor
Humor is a unique and creative communicative behavior displayed during social
interactions. It is produced in a multimodal manner, through the usage of words
(text), gestures (vision) and prosodic cues (acoustic). Understanding humor
from these three modalities falls within boundaries of multimodal language; a
recent research trend in natural language processing that models natural
language as it happens in face-to-face communication. Although humor detection
is an established research area in NLP, in a multimodal context it is an
understudied area. This paper presents a diverse multimodal dataset, called
UR-FUNNY, to open the door to understanding multimodal language used in
expressing humor. The dataset and accompanying studies, present a framework in
multimodal humor detection for the natural language processing community.
UR-FUNNY is publicly available for research
TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models
Pre-trained large language models have recently achieved ground-breaking
performance in a wide variety of language understanding tasks. However, the
same model can not be applied to multimodal behavior understanding tasks (e.g.,
video sentiment/humor detection) unless non-verbal features (e.g., acoustic and
visual) can be integrated with language. Jointly modeling multiple modalities
significantly increases the model complexity, and makes the training process
data-hungry. While an enormous amount of text data is available via the web,
collecting large-scale multimodal behavioral video datasets is extremely
expensive, both in terms of time and money. In this paper, we investigate
whether large language models alone can successfully incorporate non-verbal
information when they are presented in textual form. We present a way to
convert the acoustic and visual information into corresponding textual
descriptions and concatenate them with the spoken text. We feed this augmented
input to a pre-trained BERT model and fine-tune it on three downstream
multimodal tasks: sentiment, humor, and sarcasm detection. Our approach,
TextMI, significantly reduces model complexity, adds interpretability to the
model's decision, and can be applied for a diverse set of tasks while achieving
superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment
analysis and multimodal humor detection) performance. We propose TextMI as a
general, competitive baseline for multimodal behavioral analysis tasks,
particularly in a low-resource setting
Using AI to Measure Parkinson's Disease Severity at Home
We present an artificial intelligence system to remotely assess the motor
performance of individuals with Parkinson's disease (PD). Participants
performed a motor task (i.e., tapping fingers) in front of a webcam, and data
from 250 global participants were rated by three expert neurologists following
the Movement Disorder Society Unified Parkinson's Disease Rating Scale
(MDS-UPDRS). The neurologists' ratings were highly reliable, with an
intra-class correlation coefficient (ICC) of 0.88. We developed computer
algorithms to obtain objective measurements that align with the MDS-UPDRS
guideline and are strongly correlated with the neurologists' ratings. Our
machine learning model trained on these measures outperformed an MDS-UPDRS
certified rater, with a mean absolute error (MAE) of 0.59 compared to the
rater's MAE of 0.79. However, the model performed slightly worse than the
expert neurologists (0.53 MAE). The methodology can be replicated for similar
motor tasks, providing the possibility of evaluating individuals with PD and
other movement disorders remotely, objectively, and in areas with limited
access to neurological care
AI and Machine Learning
A primer to AI and Machine Learning which also touches upon "good" and "bad" AI and its relationship with governments and corporations
Humor Knowledge Enriched Transformer for Understanding Multimodal Humor
Recognizing humor from a video utterance requires understanding the verbal and non-verbal components as well as incorporating the appropriate context and external knowledge. In this paper, we propose Humor Knowledge enriched Transformer (HKT) that can capture the gist of a multimodal humorous expression by integrating the preceding context and external knowledge. We incorporate humor centric external knowledge into the model by capturing the ambiguity and sentiment present in the language. We encode all the language, acoustic, vision, and humor centric features separately using Transformer based encoders, followed by a cross attention layer to exchange information among them. Our model achieves 77.36% and 79.41% accuracy in humorous punchline detection on UR-FUNNY and MUStaRD datasets -- achieving a new state-of-the-art on both datasets with the margin of 4.93% and 2.94% respectively. Furthermore, we demonstrate that our model can capture interpretable, humor-inducing patterns from all modalities