1,540 research outputs found
Augmenting Librispeech with French Translations: A Multimodal Corpus for Direct Speech Translation Evaluation
Recent works in spoken language translation (SLT) have attempted to build
end-to-end speech-to-text translation without using source language
transcription during learning or decoding. However, while large quantities of
parallel texts (such as Europarl, OpenSubtitles) are available for training
machine translation systems, there are no large (100h) and open source parallel
corpora that include speech in a source language aligned to text in a target
language. This paper tries to fill this gap by augmenting an existing
(monolingual) corpus: LibriSpeech. This corpus, used for automatic speech
recognition, is derived from read audiobooks from the LibriVox project, and has
been carefully segmented and aligned. After gathering French e-books
corresponding to the English audio-books from LibriSpeech, we align speech
segments at the sentence level with their respective translations and obtain
236h of usable parallel data. This paper presents the details of the processing
as well as a manual evaluation conducted on a small subset of the corpus. This
evaluation shows that the automatic alignments scores are reasonably correlated
with the human judgments of the bilingual alignment quality. We believe that
this corpus (which is made available online) is useful for replicable
experiments in direct speech translation or more general spoken language
translation experiments.Comment: LREC 2018, Japa
Spoken content retrieval: A survey of techniques and technologies
Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Speechreading or lipreading is the technique of understanding and getting
phonetic features from a speaker's visual features such as movement of lips,
face, teeth and tongue. It has a wide range of multimedia applications such as
in surveillance, Internet telephony, and as an aid to a person with hearing
impairments. However, most of the work in speechreading has been limited to
text generation from silent videos. Recently, research has started venturing
into generating (audio) speech from silent video sequences but there have been
no developments thus far in dealing with divergent views and poses of a
speaker. Thus although, we have multiple camera feeds for the speech of a user,
but we have failed in using these multiple video feeds for dealing with the
different poses. To this end, this paper presents the world's first ever
multi-view speech reading and reconstruction system. This work encompasses the
boundaries of multimedia research by putting forth a model which leverages
silent video feeds from multiple cameras recording the same subject to generate
intelligent speech for a speaker. Initial results confirm the usefulness of
exploiting multiple camera views in building an efficient speech reading and
reconstruction system. It further shows the optimal placement of cameras which
would lead to the maximum intelligibility of speech. Next, it lays out various
innovative applications for the proposed system focusing on its potential
prodigious impact in not just security arena but in many other multimedia
analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul,
Republic of Kore
Multimodal Speech Recognition for Language-Guided Embodied Agents
Benchmarks for language-guided embodied agents typically assume text-based
instructions, but deployed agents will encounter spoken instructions. While
Automatic Speech Recognition (ASR) models can bridge the input gap, erroneous
ASR transcripts can hurt the agents' ability to complete tasks. In this work,
we propose training a multimodal ASR model to reduce errors in transcribing
spoken instructions by considering the accompanying visual context. We train
our model on a dataset of spoken instructions, synthesized from the ALFRED task
completion dataset, where we simulate acoustic noise by systematically masking
spoken words. We find that utilizing visual observations facilitates masked
word recovery, with multimodal ASR models recovering up to 30% more masked
words than unimodal baselines. We also find that a text-trained embodied agent
successfully completes tasks more often by following transcribed instructions
from multimodal ASR models. github.com/Cylumn/embodied-multimodal-asrComment: 5 pages, 5 figures, 24th ISCA Interspeech Conference (INTERSPEECH
2023
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Despite the recent advances in opinion mining for written reviews, few works
have tackled the problem on other sources of reviews. In light of this issue,
we propose a multi-modal approach for mining fine-grained opinions from video
reviews that is able to determine the aspects of the item under review that are
being discussed and the sentiment orientation towards them. Our approach works
at the sentence level without the need for time annotations and uses features
derived from the audio, video and language transcriptions of its contents. We
evaluate our approach on two datasets and show that leveraging the video and
audio modalities consistently provides increased performance over text-only
baselines, providing evidence these extra modalities are key in better
understanding video reviews.Comment: Second Grand Challenge and Workshop on Multimodal Language ACL 202
Affect-LM: A Neural Language Model for Customizable Affective Text Generation
Human verbal communication includes affective messages which are conveyed
through use of emotionally colored words. There has been a lot of research in
this direction but the problem of integrating state-of-the-art neural language
models with affective information remains an area ripe for exploration. In this
paper, we propose an extension to an LSTM (Long Short-Term Memory) language
model for generating conversational text, conditioned on affect categories. Our
proposed model, Affect-LM enables us to customize the degree of emotional
content in generated sentences through an additional design parameter.
Perception studies conducted using Amazon Mechanical Turk show that Affect-LM
generates naturally looking emotional sentences without sacrificing grammatical
correctness. Affect-LM also learns affect-discriminative word representations,
and perplexity experiments show that additional affective information in
conversational text can improve language model prediction
- …