Towards Direct Simultaneous Speech Translation

Abstract

Simultaneous speech translation (SimulST) is widely useful in many cross-lingual communication scenarios, including multinational conferences and international traveling. Since text-based simultaneous machine translation (SimulMT) has achieved great success in recent years. The conventional cascaded approach for SimulST uses a pipeline of streaming ASR followed by simultaneous MT but suffers from error propagation and extra latency. Recent efforts attempt to directly translate the source speech into the target text or speech simultaneously, but this is much harder due to the combination of separate tasks. In this dissertation, we focus on improving simultaneous translation model, enabling it to handle speech input and directly generate the translated text in the target language. First, we investigate how to improve simultaneous translation by incorporating generated more monotonic pseudo references in training. These pseudo references with fewer reorderings cause fewer anticipations and can substantially improve simultaneous translation quality. Then, we propose an ASR-assisted direct SimulST framework. The model can directly translate from the given speech with a wait-k policy guided by a synchronized streaming ASR. However, speech translation tasks suffer from data scarcity problems. To alleviate the issue, we next introduce a Fused Acoustic and Text Masked Language Model (FAT-MLM), which jointly learns a unified representation for both acoustic and text input from various types of corpora, including parallel data for speech recognition and machine translation, and even pure speech and text data. By finetuning from FAT, the speech translation model can be substantially improved. Besides that, we further extend FAT to cross-lingual speech synthesis. Our proposed model can clone the voice of the source speaker and generate the corresponding speech in the target language

    Similar works