133,433 research outputs found
DiffS2UT: A Semantic Preserving Diffusion Model for Textless Direct Speech-to-Speech Translation
While Diffusion Generative Models have achieved great success on image
generation tasks, how to efficiently and effectively incorporate them into
speech generation especially translation tasks remains a non-trivial problem.
Specifically, due to the low information density of speech data, the
transformed discrete speech unit sequence is much longer than the corresponding
text transcription, posing significant challenges to existing auto-regressive
models. Furthermore, it is not optimal to brutally apply discrete diffusion on
the speech unit sequence while disregarding the continuous space structure,
which will degrade the generation performance significantly. In this paper, we
propose a novel diffusion model by applying the diffusion forward process in
the \textit{continuous} speech representation space, while employing the
diffusion backward process in the \textit{discrete} speech unit space. In this
way, we preserve the semantic structure of the continuous speech representation
space in the diffusion process and integrate the continuous and discrete
diffusion models. We conduct extensive experiments on the textless direct
speech-to-speech translation task, where the proposed method achieves
comparable results to the computationally intensive auto-regressive baselines
(500 steps on average) with significantly fewer decoding steps (50 steps).Comment: Accepted in EMNLP2023 main conferenc
Direct Speech-to-Text Translation Models as Students of Text-to-Text Models
Direct speech-to-text translation (ST) is an emerging approach that consists in performing the ST task with a single neural model. Although this paradigm comes with the promise to outperform the traditional pipeline systems, its rise is still limited by the paucity of speech-translation paired corpora compared to the large amount of speech-transcript and parallel bilingual corpora available to train previous solutions. As such, the research community focused on techniques to transfer knowledge from automatic speech recognition (ASR) and machine translation (MT) models trained on huge datasets. In this paper, we extend and integrate our recent work (Gaido, Gangi, et al. 2020) analysing the best performing approach to transfer learning from MT, which is represented by knowledge distillation (KD) in sequence-to-sequence models. After the comparison of the different KD methods to understand which one is the most effective, we extend our previous analysis of the effects – both in terms of benefits and drawbacks – to different language pairs in high-resource conditions, ensuring the generalisability of our findings. Altogether, these extensions complement and complete our investigation on KD for speech translation leading to the following overall findings: i) the best training recipe involves a word-level KD training followed by a fine-tuning step on the ST task, ii) word-level KD from MT can be detrimental for gender translation and can lead to output truncation (though these problems are alleviated by the fine-tuning on the ST task), and iii) the quality of the ST student model strongly depends on the quality of the MT teacher model, although the correlation is not linear
On the Locality of Attention in Direct Speech Translation
Transformers have achieved state-of-the-art results across multiple NLP
tasks. However, the self-attention mechanism complexity scales quadratically
with the sequence length, creating an obstacle for tasks involving long
sequences, like in the speech domain. In this paper, we discuss the usefulness
of self-attention for Direct Speech Translation. First, we analyze the
layer-wise token contributions in the self-attention of the encoder, unveiling
local diagonal patterns. To prove that some attention weights are avoidable, we
propose to substitute the standard self-attention with a local efficient one,
setting the amount of context used based on the results of the analysis. With
this approach, our model matches the baseline performance, and improves the
efficiency by skipping the computation of those weights that standard attention
discards.Comment: ACL-SRW 2022. Equal contribution between Belen Alastruey and Javier
Ferrand
On the locality of attention in direct speech translation
Transformers have achieved state-of-the-art results across multiple NLP tasks. However, the self-attention mechanism complexity scales quadratically with the sequence length, creating an obstacle for tasks involving long sequences, like in the speech domain. In this paper, we discuss the usefulness of self-attention for Direct Speech Translation. First, we analyze the layer-wise token contributions in the self-attention of the encoder, unveiling local diagonal patterns. To prove that some attention weights are avoidable, we propose to substitute the standard self-attention with a local efficient one, setting the amount of context used based on the results of the analysis. With this approach, our model matches the baseline performance, and improves the efficiency by skipping the computation of those weights that standard attention discards.This work was partially funded by the project ADAVOICE, PID2019-107579RB-I00 / AEI / 10.13039/501100011033, and the UPC INIREC scholarship nº3522.Peer ReviewedPostprint (published version
Low-resource speech translation
We explore the task of speech-to-text translation (ST), where speech in one language
(source) is converted to text in a different one (target). Traditional ST systems go
through an intermediate step where the source language speech is first converted to
source language text using an automatic speech recognition (ASR) system, which
is then converted to target language text using a machine translation (MT) system.
However, this pipeline based approach is impractical for unwritten languages spoken by
millions of people around the world, leaving them without access to free and automated
translation services such as Google Translate. The lack of such translation services can
have important real-world consequences. For example, in the aftermath of a disaster
scenario, easily available translation services can help better co-ordinate relief efforts.
How can we expand the coverage of automated ST systems to include scenarios which
lack source language text? In this thesis we investigate one possible solution: we
build ST systems to directly translate source language speech into target language text,
thereby forgoing the dependency on source language text. To build such a system, we
use only speech data paired with text translations as training data. We also specifically
focus on low-resource settings, where we expect at most tens of hours of training data
to be available for unwritten or endangered languages.
Our work can be broadly divided into three parts. First we explore how we can leverage
prior work to build ST systems. We find that neural sequence-to-sequence models are
an effective and convenient method for ST, but produce poor quality translations when
trained in low-resource settings.
In the second part of this thesis, we explore methods to improve the translation performance
of our neural ST systems which do not require labeling additional speech
data in the low-resource language, a potentially tedious and expensive process. Instead
we exploit labeled speech data for high-resource languages which is widely available
and relatively easier to obtain. We show that pretraining a neural model with ASR data
from a high-resource language, different from both the source and target ST languages,
improves ST performance.
In the final part of our thesis, we study whether ST systems can be used to build
applications which have traditionally relied on the availability of ASR systems, such
as information retrieval, clustering audio documents, or question/answering. We build
proof-of-concept systems for two downstream applications: topic prediction for speech
and cross-lingual keyword spotting. Our results indicate that low-resource ST systems
can still outperform simple baselines for these tasks, leaving the door open for further
exploratory work.
This thesis provides, for the first time, an in-depth study of neural models for the
task of direct ST across a range of training data settings on a realistic multi-speaker
speech corpus. Our contributions include a set of open-source tools to encourage further
research
Relative Positional Encoding for Speech Recognition and Direct Translation
Transformer models are powerful sequence-to-sequence architectures that are
capable of directly mapping speech inputs to transcriptions or translations.
However, the mechanism for modeling positions in this model was tailored for
text modeling, and thus is less ideal for acoustic inputs. In this work, we
adapt the relative position encoding scheme to the Speech Transformer, where
the key addition is relative distance between input states in the
self-attention network. As a result, the network can better adapt to the
variable distributions present in speech data. Our experiments show that our
resulting model achieves the best recognition result on the Switchboard
benchmark in the non-augmentation condition, and the best published result in
the MuST-C speech translation benchmark. We also show that this model is able
to better utilize synthetic data than the Transformer, and adapts better to
variable sentence segmentation quality for speech translation.Comment: Submitted to Interspeech 202
- …