19 research outputs found
Speech Enhancement Using Speech Synthesis Techniques
Traditional speech enhancement systems reduce noise by modifying the noisy signal to make it more like a clean signal, which suffers from two problems: under-suppression of noise and over-suppression of speech. These problems create distortions in enhanced speech and hurt the quality of the enhanced signal. We propose to utilize speech synthesis techniques for a higher quality speech enhancement system. Synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. We first show that we can replace the noisy speech with its clean resynthesis from a previously recorded clean speech dictionary from the same speaker (concatenative resynthesis). Next, we show that using a speech synthesizer (vocoder) we can create a clean resynthesis of the noisy speech for more than one speaker. We term this parametric resynthesis (PR). PR can generate better prosody from noisy speech than a TTS system which uses textual information only. Additionally, we can use the high quality speech generation capability of neural vocoders for better quality speech enhancement. When trained on data from enough speakers, these vocoders can generate speech from unseen speakers, both male, and female, with similar quality as seen speakers in training. Finally, we show that using neural vocoders we can achieve better objective signal and overall quality than the state-of-the-art speech enhancement systems and better subjective quality than an oracle mask-based system
SpeechLMScore: Evaluating speech generation using speech language model
While human evaluation is the most reliable metric for evaluating speech
generation systems, it is generally costly and time-consuming. Previous studies
on automatic speech quality assessment address the problem by predicting human
evaluation scores with machine learning models. However, they rely on
supervised learning and thus suffer from high annotation costs and domain-shift
problems. We propose SpeechLMScore, an unsupervised metric to evaluate
generated speech using a speech-language model. SpeechLMScore computes the
average log-probability of a speech signal by mapping it into discrete tokens
and measures the average probability of generating the sequence of tokens.
Therefore, it does not require human annotation and is a highly scalable
framework. Evaluation results demonstrate that the proposed metric shows a
promising correlation with human evaluation scores on different speech
generation tasks including voice conversion, text-to-speech, and speech
enhancement
Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks
We propose a decoder-only language model, \textit{VoxtLM}, that can perform
four tasks: speech recognition, speech synthesis, text generation, and speech
continuation. VoxtLM integrates text vocabulary with discrete speech tokens
from self-supervised speech features and uses special tokens to enable
multitask learning. Compared to a single-task model, VoxtLM exhibits a
significant improvement in speech synthesis, with improvements in both speech
intelligibility from 28.9 to 5.6 and objective quality from 2.68 to 3.90.
VoxtLM also improves speech generation and speech recognition performance over
the single-task counterpart. VoxtLM is trained with publicly available data and
training recipes and model checkpoints will be open-sourced to make fully
reproducible work
Unsupervised Data Selection for TTS: Using Arabic Broadcast News as a Case Study
Several high-resource Text to Speech (TTS) systems currently produce natural,
well-established human-like speech. In contrast, low-resource languages,
including Arabic, have very limited TTS systems due to the lack of resources.
We propose a fully unsupervised method for building TTS, including automatic
data selection and pre-training/fine-tuning strategies for TTS training, using
broadcast news as a case study. We show how careful selection of data, yet
smaller amounts, can improve the efficiency of TTS system in generating more
natural speech than a system trained on a bigger dataset. We adopt to propose
different approaches for the: 1) data: we applied automatic annotations using
DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for
fixing transcriptions' errors; 2) model: we used transfer learning from
high-resource language in TTS model and fine-tuned it with one hour broadcast
recording then we used this model to guide a FastSpeech2-based Conformer model
for duration. Our objective evaluation shows 3.9% character error rate (CER),
while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1
is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a
mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness,
where many annotators recognized the voice of the broadcaster, which proves the
effectiveness of our proposed unsupervised method
SpeechComposer: Unifying Multiple Speech Tasks with Prompt Composition
Recent advancements in language models have significantly enhanced
performance in multiple speech-related tasks. Existing speech language models
typically utilize task-dependent prompt tokens to unify various speech tasks in
a single model. However, this design omits the intrinsic connections between
different speech tasks, which can potentially boost the performance of each
task. In this work, we propose a novel decoder-only speech language model,
SpeechComposer, that can unify common speech tasks by composing a fixed set of
prompt tokens. Built upon four primary tasks -- speech synthesis, speech
recognition, speech language modeling, and text language modeling --
SpeechComposer can easily extend to more speech tasks via compositions of
well-designed prompt tokens, like voice conversion and speech enhancement. The
unification of prompt tokens also makes it possible for knowledge sharing among
different speech tasks in a more structured manner. Experimental results
demonstrate that our proposed SpeechComposer can improve the performance of
both primary tasks and composite tasks, showing the effectiveness of the shared
prompt tokens. Remarkably, the unified decoder-only model achieves a comparable
and even better performance than the baselines which are expert models designed
for single tasks.Comment: 11 pages, 2 figure
EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers
In this paper, we present a novel framework that jointly performs speaker
diarization, speech separation, and speaker counting. Our proposed method
combines end-to-end speaker diarization and speech separation methods, namely,
End-to-End Neural Speaker Diarization with Encoder-Decoder-based Attractor
calculation (EEND-EDA) and the Convolutional Time-domain Audio Separation
Network (ConvTasNet) as multi-tasking joint model. We also propose the multiple
1x1 convolutional layer architecture for estimating the separation masks
corresponding to the number of speakers, and a post-processing technique for
refining the separated speech signal with speech activity. Experiments using
LibriMix dataset show that our proposed method outperforms the baselines in
terms of diarization and separation performance for both fixed and flexible
numbers of speakers, as well as speaker counting performance for flexible
numbers of speakers. All materials will be open-sourced and reproducible in
ESPnet toolkit.Comment: submitted to INTERSPEECH 202