43 research outputs found
Voice Conversion Using Sequence-to-Sequence Learning of Context Posterior Probabilities
Voice conversion (VC) using sequence-to-sequence learning of context
posterior probabilities is proposed. Conventional VC using shared context
posterior probabilities predicts target speech parameters from the context
posterior probabilities estimated from the source speech parameters. Although
conventional VC can be built from non-parallel data, it is difficult to convert
speaker individuality such as phonetic property and speaking rate contained in
the posterior probabilities because the source posterior probabilities are
directly used for predicting target speech parameters. In this work, we assume
that the training data partly include parallel speech data and propose
sequence-to-sequence learning between the source and target posterior
probabilities. The conversion models perform non-linear and variable-length
transformation from the source probability sequence to the target one. Further,
we propose a joint training algorithm for the modules. In contrast to
conventional VC, which separately trains the speech recognition that estimates
posterior probabilities and the speech synthesis that predicts target speech
parameters, our proposed method jointly trains these modules along with the
proposed probability conversion modules. Experimental results demonstrate that
our approach outperforms the conventional VC.Comment: Accepted to INTERSPEECH 201
A Technical Comparison of IPSec and SSL
IPSec (IP Security) and SSL (Secure Socket Layer) have been the most robust and most potential tools available for securing communications over the Internet. Both IPSec and SSL have advantages and shortcomings. Yet no paper has been found comparing the two protocols in terms of characteristic and functionality. Our objective is to present an analysis of security and performance properties for IPSec and SSL
JVNV: A Corpus of Japanese Emotional Speech with Verbal Content and Nonverbal Expressions
We present the JVNV, a Japanese emotional speech corpus with verbal content
and nonverbal vocalizations whose scripts are generated by a large-scale
language model. Existing emotional speech corpora lack not only proper
emotional scripts but also nonverbal vocalizations (NVs) that are essential
expressions in spoken language to express emotions. We propose an automatic
script generation method to produce emotional scripts by providing seed words
with sentiment polarity and phrases of nonverbal vocalizations to ChatGPT using
prompt engineering. We select 514 scripts with balanced phoneme coverage from
the generated candidate scripts with the assistance of emotion confidence
scores and language fluency scores. We demonstrate the effectiveness of JVNV by
showing that JVNV has better phoneme coverage and emotion recognizability than
previous Japanese emotional speech corpora. We then benchmark JVNV on emotional
text-to-speech synthesis using discrete codes to represent NVs. We show that
there still exists a gap between the performance of synthesizing read-aloud
speech and emotional speech, and adding NVs in the speech makes the task even
harder, which brings new challenges for this task and makes JVNV a valuable
resource for relevant works in the future. To our best knowledge, JVNV is the
first speech corpus that generates scripts automatically using large language
models
Coco-Nut: Corpus of Japanese Utterance and Voice Characteristics Description for Prompt-based Control
In text-to-speech, controlling voice characteristics is important in
achieving various-purpose speech synthesis. Considering the success of
text-conditioned generation, such as text-to-image, free-form text instruction
should be useful for intuitive and complicated control of voice
characteristics. A sufficiently large corpus of high-quality and diverse voice
samples with corresponding free-form descriptions can advance such control
research. However, neither an open corpus nor a scalable method is currently
available. To this end, we develop Coco-Nut, a new corpus including diverse
Japanese utterances, along with text transcriptions and free-form voice
characteristics descriptions. Our methodology to construct this corpus consists
of 1) automatic collection of voice-related audio data from the Internet, 2)
quality assurance, and 3) manual annotation using crowdsourcing. Additionally,
we benchmark our corpus on the prompt embedding model trained by contrastive
speech-text learning.Comment: Submitted to ASRU202