31 research outputs found
Sparks of Large Audio Models: A Survey and Outlook
This survey paper provides a comprehensive overview of the recent
advancements and challenges in applying large language models to the field of
audio signal processing. Audio processing, with its diverse signal
representations and a wide range of sources--from human voices to musical
instruments and environmental sounds--poses challenges distinct from those
found in traditional Natural Language Processing scenarios. Nevertheless,
\textit{Large Audio Models}, epitomized by transformer-based architectures,
have shown marked efficacy in this sphere. By leveraging massive amount of
data, these models have demonstrated prowess in a variety of audio tasks,
spanning from Automatic Speech Recognition and Text-To-Speech to Music
Generation, among others. Notably, recently these Foundational Audio Models,
like SeamlessM4T, have started showing abilities to act as universal
translators, supporting multiple speech tasks for up to 100 languages without
any reliance on separate task-specific systems. This paper presents an in-depth
analysis of state-of-the-art methodologies regarding \textit{Foundational Large
Audio Models}, their performance benchmarks, and their applicability to
real-world scenarios. We also highlight current limitations and provide
insights into potential future research directions in the realm of
\textit{Large Audio Models} with the intent to spark further discussion,
thereby fostering innovation in the next generation of audio-processing
systems. Furthermore, to cope with the rapid development in this area, we will
consistently update the relevant repository with relevant recent articles and
their open-source implementations at
https://github.com/EmulationAI/awesome-large-audio-models.Comment: work in progress, Repo URL:
https://github.com/EmulationAI/awesome-large-audio-model
Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis
Statistical parametric speech synthesis (SPSS) has seen improvements over
recent years, especially in terms of intelligibility. Synthetic speech is often clear
and understandable, but it can also be bland and monotonous. Proper generation
of natural speech prosody is still a largely unsolved problem. This is relevant
especially in the context of expressive audiobook speech synthesis, where speech
is expected to be fluid and captivating.
In general, prosody can be seen as a layer that is superimposed on the segmental
(phone) sequence. Listeners can perceive the same melody or rhythm
in different utterances, and the same segmental sequence can be uttered with a
different prosodic layer to convey a different message. For this reason, prosody
is commonly accepted to be inherently suprasegmental. It is governed by longer
units within the utterance (e.g. syllables, words, phrases) and beyond the utterance
(e.g. discourse). However, common techniques for the modeling of speech
prosody - and speech in general - operate mainly on very short intervals, either at
the state or frame level, in both hidden Markov model (HMM) and deep neural
network (DNN) based speech synthesis.
This thesis presents contributions supporting the claim that stronger representations
of suprasegmental variation are essential for the natural generation of
fundamental frequency for statistical parametric speech synthesis. We conceptualize
the problem by dividing it into three sub-problems: (1) representations of
acoustic signals, (2) representations of linguistic contexts, and (3) the mapping
of one representation to another. The contributions of this thesis provide novel
methods and insights relating to these three sub-problems.
In terms of sub-problem 1, we propose a multi-level representation of f0 using
the continuous wavelet transform and the discrete cosine transform, as well
as a wavelet-based decomposition strategy that is linguistically and perceptually
motivated. In terms of sub-problem 2, we investigate additional linguistic
features such as text-derived word embeddings and syllable bag-of-phones and
we propose a novel method for learning word vector representations based on
acoustic counts. Finally, considering sub-problem 3, insights are given regarding
hierarchical models such as parallel and cascaded deep neural networks
IberSPEECH 2020: XI Jornadas en Tecnología del Habla and VII Iberian SLTech
IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de Tecnologías del Habla. Universidad de Valladoli
Unsupervised learning for text-to-speech synthesis
This thesis introduces a general method for incorporating the distributional analysis
of textual and linguistic objects into text-to-speech (TTS) conversion systems.
Conventional TTS conversion uses intermediate layers of representation to bridge
the gap between text and speech. Collecting the annotated data needed to produce
these intermediate layers is a far from trivial task, possibly prohibitively so
for languages in which no such resources are in existence. Distributional analysis,
in contrast, proceeds in an unsupervised manner, and so enables the creation of
systems using textual data that are not annotated. The method therefore aids
the building of systems for languages in which conventional linguistic resources
are scarce, but is not restricted to these languages.
The distributional analysis proposed here places the textual objects analysed
in a continuous-valued space, rather than specifying a hard categorisation of those
objects. This space is then partitioned during the training of acoustic models for
synthesis, so that the models generalise over objects' surface forms in a way that
is acoustically relevant.
The method is applied to three levels of textual analysis: to the characterisation
of sub-syllabic units, word units and utterances. Entire systems for three
languages (English, Finnish and Romanian) are built with no reliance on manually
labelled data or language-specific expertise. Results of a subjective evaluation
are presented