3 research outputs found

    Towards cross-language prosody transfer for dialog

    Full text link
    Speech-to-speech translation systems today do not adequately support use for dialog purposes. In particular, nuances of speaker intent and stance can be lost due to improper prosody transfer. We present an exploration of what needs to be done to overcome this. First, we developed a data collection protocol in which bilingual speakers re-enact utterances from an earlier conversation in their other language, and used this to collect an English-Spanish corpus, so far comprising 1871 matched utterance pairs. Second, we developed a simple prosodic dissimilarity metric based on Euclidean distance over a broad set of prosodic features. We then used these to investigate cross-language prosodic differences, measure the likely utility of three simple baseline models, and identify phenomena which will require more powerful modeling. Our findings should inform future research on cross-language prosody and the design of speech-to-speech translation systems capable of effective prosody transfer.Comment: Accepted to Interspeech 202

    End-to-End Simultaneous Speech Translation

    Get PDF
    Speech translation is the task of translating speech in one language to text or speech in another language, while simultaneous translation aims at lower translation latency by starting the translation before the speaker finishes a sentence. The combination of the two, simultaneous speech translation, can be applied in low latency scenarios such as live video caption translation and real-time interpretation. This thesis will focus on an end-to-end or direct approach for simultaneous speech translation. We first define the task of simultaneous speech translation, including the challenges of the task and its evaluation metrics. We then progressly introduce our contributions to tackle the challenges. First, we proposed a novel simultaneous translation policy, mono- tonic multihead attention, for transformer models on text-to-text translation. Second, we investigate the issues and potential solutions when adapting text-to-text simultaneous policies to end-to-end speech-to-text translation models. Third, we introduced the augmented memory transformer encoder for simultaneous speech-to-text translation models for better computation efficiency. Fourth, we explore a direct simultaneous speech translation with variational monotonic multihead attention policy, based on recent speech-to-unit models. At the end, we provide some directions for potential future research
    corecore