240 research outputs found
Detection of Prosodic Boundaries in Speech Using Wav2Vec 2.0
Prosodic boundaries in speech are of great relevance to both speech synthesis
and audio annotation. In this paper, we apply the wav2vec 2.0 framework to the
task of detecting these boundaries in speech signal, using only acoustic
information. We test the approach on a set of recordings of Czech broadcast
news, labeled by phonetic experts, and compare it to an existing text-based
predictor, which uses the transcripts of the same data. Despite using a
relatively small amount of labeled data, the wav2vec2 model achieves an
accuracy of 94% and F1 measure of 83% on within-sentence prosodic boundaries
(or 95% and 89% on all prosodic boundaries), outperforming the text-based
approach. However, by combining the outputs of the two different models we can
improve the results even further.Comment: This preprint is a pre-review version of the paper and does not
contain any post-submission improvements or corrections. The Version of
Record of this contribution is published in the proceedings of the
International Conference on Text, Speech, and Dialogue (TSD 2022), LNAI
volume 13502, and is available online at
https://doi.org/10.1007/978-3-031-16270-1_3
Multi-Module G2P Converter for Persian Focusing on Relations between Words
In this paper, we investigate the application of end-to-end and multi-module
frameworks for G2P conversion for the Persian language. The results demonstrate
that our proposed multi-module G2P system outperforms our end-to-end systems in
terms of accuracy and speed. The system consists of a pronunciation dictionary
as our look-up table, along with separate models to handle homographs, OOVs and
ezafe in Persian created using GRU and Transformer architectures. The system is
sequence-level rather than word-level, which allows it to effectively capture
the unwritten relations between words (cross-word information) necessary for
homograph disambiguation and ezafe recognition without the need for any
pre-processing. After evaluation, our system achieved a 94.48% word-level
accuracy, outperforming the previous G2P systems for Persian.Comment: 10 pages, 4 figure
The SP2 SCOPES Project on Speech Prosody
This is an overview of a Joint Research Project within the Scientific co-operation between Eastern Europe and Switzerland (SCOPES) Program of the Swiss National Science Foundation (SNFS) and Swiss Agency for Development and Cooperation (SDC). Within the SP2 SCOPES Project on Speech Prosody, in the course of the following two years, the four partners aim to collaborate on the subject of speech prosody and advance the extraction, processing, modeling and transfer of prosody for a large portfolio of European languages: French, German, Italian, English, Hungarian, Serbian, Croatian, Bosnian, Montenegrin, and Macedonian. Through the intertwined four research plans, synergies are foreseen to emerge that will build a foundation for submitting strong joint proposals for EU funding
Implementing and Improving a Speech Synthesis System
Tato práce se zabývá syntézou řeči z textu. V práci je podán základní teoretický úvod do syntézy řeči z textu. Práce je postavena na MARY TTS systému, který umožňuje využít existujících modulů k vytvoření vlastního systému pro syntézu řeči z textu, a syntéze řeči pomocí skrytých Markovových modelů natrénovaných na vytvořené řečové databázi. Bylo vytvořeno několik jednoduchých programů ulehčujících vytvoření databáze a přidání nového jazyka a hlasu pro MARY TTS systém bylo demonstrováno. Byl vytvořen a publikován modul a hlas pro Český jazyk. Byl popsán a implementován algoritmus pro přepis grafémů na fonémy.This work deals with text-to-speech synthesis. A general theoretical introduction to TTS is~given. This work is based on the MARY TTS system which allows to use existing modules for the creation of an own text-to-speech system and a speech synthesis model using hidden Markov models trained on the created speech database. Several simple programs to ease database creation were created and adding a new language and voice to the MARY TTS system was shown hot to add. The Czech language module and voice for the MARY TTS system was created and published. An algorithm for grapheme-to-phoneme transcription was described and implemented.
A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond
Non-autoregressive (NAR) generation, which is first proposed in neural
machine translation (NMT) to speed up inference, has attracted much attention
in both machine learning and natural language processing communities. While NAR
generation can significantly accelerate inference speed for machine
translation, the speedup comes at the cost of sacrificed translation accuracy
compared to its counterpart, auto-regressive (AR) generation. In recent years,
many new models and algorithms have been designed/proposed to bridge the
accuracy gap between NAR generation and AR generation. In this paper, we
conduct a systematic survey with comparisons and discussions of various
non-autoregressive translation (NAT) models from different aspects.
Specifically, we categorize the efforts of NAT into several groups, including
data manipulation, modeling methods, training criterion, decoding algorithms,
and the benefit from pre-trained models. Furthermore, we briefly review other
applications of NAR models beyond machine translation, such as dialogue
generation, text summarization, grammar error correction, semantic parsing,
speech synthesis, and automatic speech recognition. In addition, we also
discuss potential directions for future exploration, including releasing the
dependency of KD, dynamic length prediction, pre-training for NAR, and wider
applications, etc. We hope this survey can help researchers capture the latest
progress in NAR generation, inspire the design of advanced NAR models and
algorithms, and enable industry practitioners to choose appropriate solutions
for their applications. The web page of this survey is at
\url{https://github.com/LitterBrother-Xiao/Overview-of-Non-autoregressive-Applications}.Comment: 25 pages, 11 figures, 4 table
Speech Processing and Prosody
International audienceThe prosody of the speech signal conveys information over the linguistic content of the message: prosody structures the utterance, and also brings information on speaker's attitude and speaker's emotion. Duration of sounds, energy and fundamental frequency are the prosodic features. However their automatic computation and usage are not obvious. Sound duration features are usually extracted from speech recognition results or from a force speech-text alignment. Although the resulting segmentation is usually acceptable on clean native speech data, performance degrades on noisy or not non-native speech. Many algorithms have been developed for computing the fundamental frequency, they lead to rather good performance on clean speech, but again, performance degrades in noisy conditions. However, in some applications, as for example in computer assisted language learning, the relevance of the prosodic features is critical; indeed, the quality of the diagnostic on the learner's pronunciation will heavily depend on the precision and reliability of the estimated prosodic parameters. The paper considers the computation of prosodic features, shows the limitations of automatic approaches, and discusses the problem of computing confidence measures on such features. Then the paper discusses the role of prosodic features and how they can be handled for automatic processing in some tasks such as the detection of discourse particles, the characterization of emotions, the classification of sentence modalities, as well as in computer assisted language learning and in expressive speech synthesis
- …