70 research outputs found
Experiments on Longitudinal and Transverse Bedload Transport in Sine-Generated Meandering Channels
Bedload grains in consecutive meandering bends either move longitudinally or across the channel centerline. This study traces and quantifies the grainsâ movement in two laboratorial sine-generated channels, i.e., one with deflection angle θ0 = 30⌠and the other 110âŚ. The grains originally paved along the channels are uniform in size with D = 1 mm and are dyed in various colors, according to their initial location. The experiments recorded the changes in the flow patterns, bed deformation, and the gain-loss distribution of the colored grains in the pool-bar complexes. We observed the formation of two types of erosion zones during the process of the bed deformation, i.e., Zone 1 in the foreside of the point bars and Zone 2 near the concave bank downstream of the bend apexes. Most grains eroded from Zone 1 are observed moving longitudinally as opposed to crossing the channel centerline. Contrastingly, the dominant moving direction of the grains eroded from Zone 2 changes from the longitudinal direction to the transversal one as the bed topography evolves. Besides, most building material of the point bars comes from the upstream bends, although low-and highly curved channels behave differently
Mega-TTS 2: Zero-Shot Text-to-Speech with Arbitrary Length Speech Prompts
Zero-shot text-to-speech aims at synthesizing voices with unseen speech
prompts. Previous large-scale multispeaker TTS models have successfully
achieved this goal with an enrolled recording within 10 seconds. However, most
of them are designed to utilize only short speech prompts. The limited
information in short speech prompts significantly hinders the performance of
fine-grained identity imitation. In this paper, we introduce Mega-TTS 2, a
generic zero-shot multispeaker TTS model that is capable of synthesizing speech
for unseen speakers with arbitrary-length prompts. Specifically, we 1) design a
multi-reference timbre encoder to extract timbre information from multiple
reference speeches; 2) and train a prosody language model with arbitrary-length
speech prompts; With these designs, our model is suitable for prompts of
different lengths, which extends the upper bound of speech quality for
zero-shot text-to-speech. Besides arbitrary-length prompts, we introduce
arbitrary-source prompts, which leverages the probabilities derived from
multiple P-LLM outputs to produce expressive and controlled prosody.
Furthermore, we propose a phoneme-level auto-regressive duration model to
introduce in-context learning capabilities to duration modeling. Experiments
demonstrate that our method could not only synthesize identity-preserving
speech with a short prompt of an unseen speaker but also achieve improved
performance with longer speech prompts. Audio samples can be found in
https://mega-tts.github.io/mega2_demo/
AV-TranSpeech: Audio-Visual Robust Speech-to-Speech Translation
Direct speech-to-speech translation (S2ST) aims to convert speech from one
language into another, and has demonstrated significant progress to date.
Despite the recent success, current S2ST models still suffer from distinct
degradation in noisy environments and fail to translate visual speech (i.e.,
the movement of lips and teeth). In this work, we present AV-TranSpeech, the
first audio-visual speech-to-speech (AV-S2ST) translation model without relying
on intermediate text. AV-TranSpeech complements the audio stream with visual
information to promote system robustness and opens up a host of practical
applications: dictation or dubbing archival films. To mitigate the data
scarcity with limited parallel AV-S2ST data, we 1) explore self-supervised
pre-training with unlabeled audio-visual data to learn contextual
representation, and 2) introduce cross-modal distillation with S2ST models
trained on the audio-only corpus to further reduce the requirements of visual
data. Experimental results on two language pairs demonstrate that AV-TranSpeech
outperforms audio-only models under all settings regardless of the type of
noise. With low-resource audio-visual data (10h, 30h), cross-modal distillation
yields an improvement of 7.6 BLEU on average compared with baselines. Audio
samples are available at https://AV-TranSpeech.github.ioComment: Accepted to ACL 202
Recommended from our members
Unveiling the phonon scattering mechanisms in half-Heusler thermoelectric compounds
Half-Heusler (HH) compounds are among the most promising thermoelectric (TE) materials for large-scale applications due to their superior properties such as high power factor, excellent mechanical and thermal reliability, and non-toxicity. Their only drawback is the remaining-high lattice thermal conductivity. Various mechanisms were reported with claimed effectiveness to enhance the phonon scattering of HH compounds including grain-boundary scattering, phase separation, and electronâphonon interaction. In this work, however, we show that point-defect scattering has been the dominant mechanism for phonon scattering other than the intrinsic phononâphonon interaction for ZrCoSb and possibly many other HH compounds. Induced by the charge-compensation effect, the formation of Co/4d Frenkel point defects is responsible for the drastic reduction of lattice thermal conductivity in ZrCoSb1âxSnx. Our work systematically depicts the phonon scattering profile of HH compounds and illuminates subsequent material optimizations
- âŚ