21 research outputs found

    Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech

    Full text link
    This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.Comment: Submitted to ICASSP 202

    LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

    Full text link
    This paper introduces a new speech dataset called ``LibriTTS-R'' designed for text-to-speech (TTS) use. It is derived by applying speech restoration to the LibriTTS corpus, which consists of 585 hours of speech data at 24 kHz sampling rate from 2,456 speakers and the corresponding texts. The constituent samples of LibriTTS-R are identical to those of LibriTTS, with only the sound quality improved. Experimental results show that the LibriTTS-R ground-truth samples showed significantly improved sound quality compared to those in LibriTTS. In addition, neural end-to-end TTS trained with LibriTTS-R achieved speech naturalness on par with that of the ground-truth samples. The corpus is freely available for download from \url{http://www.openslr.org/141/}.Comment: Accepted to Interspeech 202

    Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations

    Full text link
    Speech restoration (SR) is a task of converting degraded speech signals into high-quality ones. In this study, we propose a robust SR model called Miipher, and apply Miipher to a new SR application: increasing the amount of high-quality training data for speech generation by converting speech samples collected from the Web to studio-quality. To make our SR model robust against various degradation, we use (i) a speech representation extracted from w2v-BERT for the input feature, and (ii) a text representation extracted from transcripts via PnG-BERT as a linguistic conditioning feature. Experiments show that Miipher (i) is robust against various audio degradation and (ii) enable us to train a high-quality text-to-speech (TTS) model from restored speech samples collected from the Web. Audio samples are available at our demo page: google.github.io/df-conformer/miipher/Comment: Accepted to WASPAA 202

    Engineering thermal conductance using a two-dimensional phononic crystal

    No full text
    Abstract. Controlling thermal transport has become relevant in recent years. Traditionally, this control has been achieved by tuning the scattering of phonons by including various types of scattering centres in the material (nanoparticles, impurities, etc). Here we take another approach and demonstrate that one can also use coherent band structure effects to control phonon thermal conductance, with the help of periodically nanostructured phononic crystals. We perform the experiments at low temperatures below 1 K, which not only leads to negligible bulk phonon scattering, but also increases the wavelength of the dominant thermal phonons by more than two orders of magnitude compared to room temperature. Thus, phononic crystals with lattice constants ≥1 μm are shown to strongly reduce the thermal conduction. The observed effect is in quantitative agreement with the theoretical calculation presented, which accurately determined the ballistic thermal conductance in a phononic crystal device.peerReviewe

    Operation of superconducting nano-stripline detector (SSLD) mounted on cryogen-free cryostat

    Get PDF
    Recently, various types of superconducting detectors have been applied to time-of-flight mass spectrometers (TOF MS) because they can achieve 100% detection efficiency for a wide mass range from atoms to huge biomolecules. The wide mass range coverage is impossible with conventional microchannel plate (MCP) ion detectors. Superconducting stripline detectors (SSLD) that consist of several hundreds of superconducting nanostrips with a width of < 1 μm and a thickness of a few tens nm have a high sensitivity for biomolecules and a response time of ∼ 1 ns that cannot be achieved by other superconducting detectors. For the practical use of SSLD, an easy operation system is necessary. In this study, we will present the proper operation of SSLD which is mounted on a cryogen-free pulse tube cryostat
    corecore