1,396 research outputs found
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in inference
speed and synthesis quality when reconstructing an audible waveform from an
acoustic representation. This study focuses on improving the discriminator to
promote GAN-based vocoders. Most existing time-frequency-representation-based
discriminators are rooted in Short-Time Fourier Transform (STFT), whose
time-frequency resolution in a spectrogram is fixed, making it incompatible
with signals like singing voices that require flexible attention for different
frequency bands. Motivated by that, our study utilizes the Constant-Q Transform
(CQT), which owns dynamic resolution among frequencies, contributing to a
better modeling ability in pitch accuracy and harmonic tracking. Specifically,
we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates
on the CQT spectrogram at multiple scales and performs sub-band processing
according to different octaves. Experiments conducted on both speech and
singing voices confirm the effectiveness of our proposed method. Moreover, we
also verified that the CQT-based and the STFT-based discriminators could be
complementary under joint training. Specifically, enhanced by the proposed
MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be
boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen
singers
Disentangling Prosody Representations with Unsupervised Speech Reconstruction
Human speech can be characterized by different components, including semantic
content, speaker identity and prosodic information. Significant progress has
been made in disentangling representations for semantic content and speaker
identity in Automatic Speech Recognition (ASR) and speaker verification tasks
respectively. However, it is still an open challenging research question to
extract prosodic information because of the intrinsic association of different
attributes, such as timbre and rhythm, and because of the need for supervised
training schemes to achieve robust large-scale and speaker-independent ASR. The
aim of this paper is to address the disentanglement of emotional prosody from
speech based on unsupervised reconstruction. Specifically, we identify, design,
implement and integrate three crucial components in our proposed speech
reconstruction model Prosody2Vec: (1) a unit encoder that transforms speech
signals into discrete units for semantic content, (2) a pretrained speaker
verification model to generate speaker identity embeddings, and (3) a trainable
prosody encoder to learn prosody representations. We first pretrain the
Prosody2Vec representations on unlabelled emotional speech corpora, then
fine-tune the model on specific datasets to perform Speech Emotion Recognition
(SER) and Emotional Voice Conversion (EVC) tasks. Both objective (weighted and
unweighted accuracies) and subjective (mean opinion score) evaluations on the
EVC task suggest that Prosody2Vec effectively captures general prosodic
features that can be smoothly transferred to other emotional speech. In
addition, our SER experiments on the IEMOCAP dataset reveal that the prosody
features learned by Prosody2Vec are complementary and beneficial for the
performance of widely used speech pretraining models and surpass the
state-of-the-art methods when combining Prosody2Vec with HuBERT
representations.Comment: Accepted by IEEE/ACM Transactions on Audio, Speech, and Language
Processin
Ariel - Volume 10 Number 3
Executive Editors
Madalyn Schaefgen
David Reich
Business Manager
David Reich
News Editors
Medical College
Edward Zurad
CAHS
John Guardiani
World
Mark Zwanger
Features Editors
Meg Trexler
Jim O\u27Brien
Editorials Editor
Jeffrey Banyas
Photography and Sports Editor
Stuart Singer
Commons Editor
Brenda Peterso
Sourcing Within: A Reflexive Investigation of a Creative Path
My Practice-as-Research doctoral dissertation, Sourcing Within: A Reflexive Investigation of a Creative Path explores potential sources for performer training and the creative process in performance work. I draw extensively from my embodied research in Taijiquan and songs from my Chinese cultural source, which I have explored since the early 1990s. Stemming from this embodied practice, I examine the notion of "care of the self" since Greco-Roman time, and how it can enhance the work of the performer via her physical and vocal presence. My research methodology draws upon Robin Nelson's writings on Practice as Research, Foucault's ideas in Practice of Self and Clark Moustakas's Phenomenological Research, primarily Heuristic Inquiry with the practitioner and her experiences as key resources.
In parallel to care of the self runs the idea of care of the craft as in Konstantin Stanislavski's notion. What I argue for is a persistent practice on/through the source techniques that can lead to a unity of the bodymind, thus elevating the quality of the performer's practice. My hands-on experiences have also nourished my pedagogical work with cross-disciplinary and cross-cultural performing practitioners through my Sourcing Within project (since 2006). All these human interactions and shared experiences have enriched and expanded the scope of my research, as evidenced in my writing.
Part of my Practice-as-Research is a creative synthesis entitled Wandering Sounds which I have created in collaboration with musician Nickolai D. Nickolov. Our collaboration explores the coalescence of text, song, music, and movement within a performance. The performance encompasses "musicality" and asks whether musical and performative integration can yield a renewed path for creativity. It aims at reaching audiences from various cultural and linguistic backgrounds. This piece has served as an important platform in my investigation of whether or not it is possible to transmit the insight gained in a performer's self-practice to the audience in the performance context. Wandering Sounds has also facilitated my inquiry into whether or not the audience can follow the performer's process of self-transformation by watching it in performance
Application of automatic speech recognition technologies to singing
The research field of Music Information Retrieval is concerned with the automatic analysis of musical characteristics. One aspect that has not received much attention so far is the automatic analysis of sung lyrics. On the other hand, the field of Automatic Speech Recognition has produced many methods for the automatic analysis of speech, but those have rarely been employed for singing. This thesis analyzes the feasibility of applying various speech recognition methods to singing, and suggests adaptations. In addition, the routes to practical applications for these systems are described. Five tasks are considered: Phoneme recognition, language identification, keyword spotting, lyrics-to-audio alignment, and retrieval of lyrics from sung queries. The main bottleneck in almost all of these tasks lies in the recognition of phonemes from sung audio. Conventional models trained on speech do not perform well when applied to singing. Training models on singing is difficult due to a lack of annotated data. This thesis offers two approaches for generating such data sets. For the first one, speech recordings are made more “song-like”. In the second approach, textual lyrics are automatically aligned to an existing singing data set. In both cases, these new data sets are then used for training new acoustic models, offering considerable improvements over models trained on speech. Building on these improved acoustic models, speech recognition algorithms for the individual tasks were adapted to singing by either improving their robustness to the differing characteristics of singing, or by exploiting the specific features of singing performances. Examples of improving robustness include the use of keyword-filler HMMs for keyword spotting, an i-vector approach for language identification, and a method for alignment and lyrics retrieval that allows highly varying durations. Features of singing are utilized in various ways: In an approach for language identification that is well-suited for long recordings; in a method for keyword spotting based on phoneme durations in singing; and in an algorithm for alignment and retrieval that exploits known phoneme confusions in singing.Das Gebiet des Music Information Retrieval befasst sich mit der automatischen Analyse von musikalischen Charakteristika. Ein Aspekt, der bisher kaum erforscht wurde, ist dabei der gesungene Text. Auf der anderen Seite werden in der automatischen Spracherkennung viele Methoden für die automatische Analyse von Sprache entwickelt, jedoch selten für Gesang. Die vorliegende Arbeit untersucht die Anwendung von Methoden aus der Spracherkennung auf Gesang und beschreibt mögliche Anpassungen. Zudem werden Wege zur praktischen Anwendung dieser Ansätze aufgezeigt. Fünf Themen werden dabei betrachtet: Phonemerkennung, Sprachenidentifikation, Schlagwortsuche, Text-zu-Gesangs-Alignment und Suche von Texten anhand von gesungenen Anfragen. Das größte Hindernis bei fast allen dieser Themen ist die Erkennung von Phonemen aus Gesangsaufnahmen. Herkömmliche, auf Sprache trainierte Modelle, bieten keine guten Ergebnisse für Gesang. Das Trainieren von Modellen auf Gesang ist schwierig, da kaum annotierte Daten verfügbar sind. Diese Arbeit zeigt zwei Ansätze auf, um solche Daten zu generieren. Für den ersten wurden Sprachaufnahmen künstlich gesangsähnlicher gemacht. Für den zweiten wurden Texte automatisch zu einem vorhandenen Gesangsdatensatz zugeordnet. Die neuen Datensätze wurden zum Trainieren neuer Modelle genutzt, welche deutliche Verbesserungen gegenüber sprachbasierten Modellen bieten. Auf diesen verbesserten akustischen Modellen aufbauend wurden Algorithmen aus der Spracherkennung für die verschiedenen Aufgaben angepasst, entweder durch das Verbessern der Robustheit gegenüber Gesangscharakteristika oder durch das Ausnutzen von hilfreichen Besonderheiten von Gesang. Beispiele für die verbesserte Robustheit sind der Einsatz von Keyword-Filler-HMMs für die Schlagwortsuche, ein i-Vector-Ansatz für die Sprachenidentifikation sowie eine Methode für das Alignment und die Textsuche, die stark schwankende Phonemdauern nicht bestraft. Die Besonderheiten von Gesang werden auf verschiedene Weisen genutzt: So z.B. in einem Ansatz für die Sprachenidentifikation, der lange Aufnahmen benötigt; in einer Methode für die Schlagwortsuche, die bekannte Phonemdauern in Gesang mit einbezieht; und in einem Algorithmus für das Alignment und die Textsuche, der bekannte Phonemkonfusionen verwertet
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance
Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning
Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model – target approximation (TA) – between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis
- …