2,140 research outputs found
Meta-Learning for Phonemic Annotation of Corpora
We apply rule induction, classifier combination and meta-learning (stacked
classifiers) to the problem of bootstrapping high accuracy automatic annotation
of corpora with pronunciation information. The task we address in this paper
consists of generating phonemic representations reflecting the Flemish and
Dutch pronunciations of a word on the basis of its orthographic representation
(which in turn is based on the actual speech recordings). We compare several
possible approaches to achieve the text-to-pronunciation mapping task:
memory-based learning, transformation-based learning, rule induction, maximum
entropy modeling, combination of classifiers in stacked learning, and stacking
of meta-learners. We are interested both in optimal accuracy and in obtaining
insight into the linguistic regularities involved. As far as accuracy is
concerned, an already high accuracy level (93% for Celex and 86% for Fonilex at
word level) for single classifiers is boosted significantly with additional
error reductions of 31% and 38% respectively using combination of classifiers,
and a further 5% using combination of meta-learners, bringing overall word
level accuracy to 96% for the Dutch variant and 92% for the Flemish variant. We
also show that the application of machine learning methods indeed leads to
increased insight into the linguistic regularities determining the variation
between the two pronunciation variants studied.Comment: 8 page
Radio Oranje: Enhanced Access to a Historical Spoken Word Collection
Access to historical audio collections is typically very restricted:\ud
content is often only available on physical (analog) media and the\ud
metadata is usually limited to keywords, giving access at the level\ud
of relatively large fragments, e.g., an entire tape. Many spoken\ud
word heritage collections are now being digitized, which allows the\ud
introduction of more advanced search technology. This paper presents\ud
an approach that supports online access and search for recordings of\ud
historical speeches. A demonstrator has been built, based on the\ud
so-called Radio Oranje collection, which contains radio speeches by\ud
the Dutch Queen Wilhelmina that were broadcast during World War II.\ud
The audio has been aligned with its original 1940s manual\ud
transcriptions to create a time-stamped index that enables the speeches to be\ud
searched at the word level. Results are presented together with\ud
related photos from an external database
MUST&P-SRL: Multi-lingual and Unified Syllabification in Text and Phonetic Domains for Speech Representation Learning
In this paper, we present a methodology for linguistic feature extraction,
focusing particularly on automatically syllabifying words in multiple
languages, with a design to be compatible with a forced-alignment tool, the
Montreal Forced Aligner (MFA). In both the textual and phonetic domains, our
method focuses on the extraction of phonetic transcriptions from text, stress
marks, and a unified automatic syllabification (in text and phonetic domains).
The system was built with open-source components and resources. Through an
ablation study, we demonstrate the efficacy of our approach in automatically
syllabifying words from several languages (English, French and Spanish).
Additionally, we apply the technique to the transcriptions of the CMU ARCTIC
dataset, generating valuable annotations available
online\footnote{\url{https://github.com/noetits/MUST_P-SRL}} that are ideal for
speech representation learning, speech unit discovery, and disentanglement of
speech factors in several speech-related fields.Comment: Accepted for publication at EMNLP 202
An overview of text-to-speech systems and media applications
Producing synthetic voice, similar to human-like sound, is an emerging
novelty of modern interactive media systems. Text-To-Speech (TTS) systems try
to generate synthetic and authentic voices via text input. Besides, well known
and familiar dubbing, announcing and narrating voices, as valuable possessions
of any media organization, can be kept forever by utilizing TTS and Voice
Conversion (VC) algorithms . The emergence of deep learning approaches has made
such TTS systems more accurate and accessible. To understand TTS systems
better, this paper investigates the key components of such systems including
text analysis, acoustic modelling and vocoding. The paper then provides details
of important state-of-the-art TTS systems based on deep learning. Finally, a
comparison is made between recently released systems in term of backbone
architecture, type of input and conversion, vocoder used and subjective
assessment (MOS). Accordingly, Tacotron 2, Transformer TTS, WaveNet and
FastSpeech 1 are among the most successful TTS systems ever released. In the
discussion section, some suggestions are made to develop a TTS system with
regard to the intended application.Comment: Accepted in ABU Technical Review journal 2023/
- …