3 research outputs found
World futures through RT’s eyes: multimodal dataset and interdisciplinary methodology
There is a need to develop new interdisciplinary approaches suitable for a more complete analysis of multimodal data. Such approaches need to go beyond case studies and leverage technology to allow for statistically valid analysis of the data. Our study addresses this need by engaging with the research question of how humans communicate about the future for persuasive and manipulative purposes, and how they do this multimodally. It introduces a new methodology for computer-assisted multimodal analysis of video data. The study also introduces the resulting dataset, featuring annotations for speech (textual and acoustic modalities) and gesticulation and corporal behaviour (visual modality). To analyse and annotate the data and develop the methodology, the study engages with 23 26-min episodes of the show ‘SophieCo Visionaries’, broadcast by RT (formerly ‘Russia Today’)
Project Achoo: A Practical Model and Application for COVID-19 Detection from Recordings of Breath, Voice, and Cough
The COVID-19 pandemic created a significant interest and demand for infection
detection and monitoring solutions. In this paper we propose a machine learning
method to quickly triage COVID-19 using recordings made on consumer devices.
The approach combines signal processing methods with fine-tuned deep learning
networks and provides methods for signal denoising, cough detection and
classification. We have also developed and deployed a mobile application that
uses symptoms checker together with voice, breath and cough signals to detect
COVID-19 infection. The application showed robust performance on both open
sourced datasets and on the noisy data collected during beta testing by the end
users
Co-Speech Gesture Detection through Multi-phase Sequence Labeling
Gestures are integral components of face-to-face communication. They unfold
over time, often following predictable movement phases of preparation, stroke,
and retraction. Yet, the prevalent approach to automatic gesture detection
treats the problem as binary classification, classifying a segment as either
containing a gesture or not, thus failing to capture its inherently sequential
and contextual nature. To address this, we introduce a novel framework that
reframes the task as a multi-phase sequence labeling problem rather than binary
classification. Our model processes sequences of skeletal movements over time
windows, uses Transformer encoders to learn contextual embeddings, and
leverages Conditional Random Fields to perform sequence labeling. We evaluate
our proposal on a large dataset of diverse co-speech gestures in task-oriented
face-to-face dialogues. The results consistently demonstrate that our method
significantly outperforms strong baseline models in detecting gesture strokes.
Furthermore, applying Transformer encoders to learn contextual embeddings from
movement sequences substantially improves gesture unit detection. These results
highlight our framework's capacity to capture the fine-grained dynamics of
co-speech gesture phases, paving the way for more nuanced and accurate gesture
detection and analysis