70 research outputs found

    Experiments on Longitudinal and Transverse Bedload Transport in Sine-Generated Meandering Channels

    Get PDF
    Bedload grains in consecutive meandering bends either move longitudinally or across the channel centerline. This study traces and quantifies the grains’ movement in two laboratorial sine-generated channels, i.e., one with deflection angle θ0 = 30◦ and the other 110◦. The grains originally paved along the channels are uniform in size with D = 1 mm and are dyed in various colors, according to their initial location. The experiments recorded the changes in the flow patterns, bed deformation, and the gain-loss distribution of the colored grains in the pool-bar complexes. We observed the formation of two types of erosion zones during the process of the bed deformation, i.e., Zone 1 in the foreside of the point bars and Zone 2 near the concave bank downstream of the bend apexes. Most grains eroded from Zone 1 are observed moving longitudinally as opposed to crossing the channel centerline. Contrastingly, the dominant moving direction of the grains eroded from Zone 2 changes from the longitudinal direction to the transversal one as the bed topography evolves. Besides, most building material of the point bars comes from the upstream bends, although low-and highly curved channels behave differently

    Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts

    Full text link
    Zero-shot text-to-speech aims at synthesizing voices with unseen speech prompts. Previous large-scale multispeaker TTS models have successfully achieved this goal with an enrolled recording within 10 seconds. However, most of them are designed to utilize only short speech prompts. The limited information in short speech prompts significantly hinders the performance of fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a generic zero-shot multispeaker TTS model that is capable of synthesizing speech for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a multi-reference timbre encoder to extract timbre information from multiple reference speeches; 2) and train a prosody language model with arbitrary-length speech prompts; With these designs, our model is suitable for prompts of different lengths, which extends the upper bound of speech quality for zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce arbitrary-source prompts, which leverages the probabilities derived from multiple P-LLM outputs to produce expressive and controlled prosody. Furthermore, we propose a phoneme-level auto-regressive duration model to introduce in-context learning capabilities to duration modeling. Experiments demonstrate that our method could not only synthesize identity-preserving speech with a short prompt of an unseen speaker but also achieve improved performance with longer speech prompts. Audio samples can be found in https://mega-tts.github.io/mega2_demo/

    AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation

    Full text link
    Direct speech-to-speech translation (S2ST) aims to convert speech from one language into another, and has demonstrated significant progress to date. Despite the recent success, current S2ST models still suffer from distinct degradation in noisy environments and fail to translate visual speech (i.e., the movement of lips and teeth). In this work, we present AV-TranSpeech, the first audio-visual speech-to-speech (AV-S2ST) translation model without relying on intermediate text. AV-TranSpeech complements the audio stream with visual information to promote system robustness and opens up a host of practical applications: dictation or dubbing archival films. To mitigate the data scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised pre-training with unlabeled audio-visual data to learn contextual representation, and 2) introduce cross-modal distillation with S2ST models trained on the audio-only corpus to further reduce the requirements of visual data. Experimental results on two language pairs demonstrate that AV-TranSpeech outperforms audio-only models under all settings regardless of the type of noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation yields an improvement of 7.6 BLEU on average compared with baselines. Audio samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202
    • …
    corecore