47 research outputs found

    Singing Voice Synthesis with Vibrato Modeling and Latent Energy Representation

    Full text link
    This paper proposes an expressive singing voice synthesis system by introducing explicit vibrato modeling and latent energy representation. Vibrato is essential to the naturalness of synthesized sound, due to the inherent characteristics of human singing. Hence, a deep learning-based vibrato model is introduced in this paper to control the vibrato's likeliness, rate, depth and phase in singing, where the vibrato likeliness represents the existence probability of vibrato and it would help improve the singing voice's naturalness. Actually, there is no annotated label about vibrato likeliness in existing singing corpus. We adopt a novel vibrato likeliness labeling method to label the vibrato likeliness automatically. Meanwhile, the power spectrogram of audio contains rich information that can improve the expressiveness of singing. An autoencoder-based latent energy bottleneck feature is proposed for expressive singing voice synthesis. Experimental results on the open dataset NUS48E show that both the vibrato modeling and the latent energy representation could significantly improve the expressiveness of singing voice. The audio samples are shown in the demo website

    조건부 자기회귀형 인공신경망을 이용한 제어 가능한 가창 음성 합성

    Get PDF
    학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 지능정보융합학과, 2022. 8. 이교구.Singing voice synthesis aims at synthesizing a natural singing voice from given input information. A successful singing synthesis system is important not only because it can significantly reduce the cost of the music production process, but also because it helps to more easily and conveniently reflect the creator's intentions. However, there are three challenging problems in designing such a system - 1) It should be possible to independently control the various elements that make up the singing. 2) It must be possible to generate high-quality sound sources, 3) It is difficult to secure sufficient training data. To deal with this problem, we first paid attention to the source-filter theory, which is a representative speech production modeling technique. We tried to secure training data efficiency and controllability at the same time by modeling a singing voice as a convolution of the source, which is pitch information, and filter, which is the pronunciation information, and designing a structure that can model each independently. In addition, we used a conditional autoregressive model-based deep neural network to effectively model sequential data in a situation where conditional inputs such as pronunciation, pitch, and speaker are given. In order for the entire framework to generate a high-quality sound source with a distribution more similar to that of a real singing voice, the adversarial training technique was applied to the training process. Finally, we applied a self-supervised style modeling technique to model detailed unlabeled musical expressions. We confirmed that the proposed model can flexibly control various elements such as pronunciation, pitch, timbre, singing style, and musical expression, while synthesizing high-quality singing that is difficult to distinguish from ground truth singing. Furthermore, we proposed a generation and modification framework that considers the situation applied to the actual music production process, and confirmed that it is possible to apply it to expand the limits of the creator's imagination, such as new voice design and cross-generation.가창 합성은 주어진 입력 악보로부터 자연스러운 가창 음성을 합성해내는 것을 목표로 한다. 가창 합성 시스템은 음악 제작 비용을 크게 줄일 수 있을 뿐만 아니라 창작자의 의도를 보다 쉽고 편리하게 반영할 수 있도록 돕는다. 하지만 이러한 시스템의 설계를 위해서는 다음 세 가지의 도전적인 요구사항이 존재한다. 1) 가창을 이루는 다양한 요소를 독립적으로 제어할 수 있어야 한다. 2) 높은 품질 수준 및 사용성을 달성해야 한다. 3) 충분한 훈련 데이터를 확보하기 어렵다. 이러한 문제에 대응하기 위해 우리는 대표적인 음성 생성 모델링 기법인 소스-필터 이론에 주목하였다. 가창 신호를 음정 정보에 해당하는 소스와 발음 정보에 해당하는 필터의 합성곱으로 정의하고, 이를 각각 독립적으로 모델링할 수 있는 구조를 설계하여 훈련 데이터 효율성과 제어 가능성을 동시에 확보하고자 하였다. 또한 우리는 발음, 음정, 화자 등 조건부 입력이 주어진 상황에서 시계열 데이터를 효과적으로 모델링하기 위하여 조건부 자기회귀 모델 기반의 심층신경망을 활용하였다. 마지막으로 레이블링 되어있지 않은 음악적 표현을 모델링할 수 있도록 우리는 자기지도학습 기반의 스타일 모델링 기법을 제안했다. 우리는 제안한 모델이 발음, 음정, 음색, 창법, 표현 등 다양한 요소를 유연하게 제어하면서도 실제 가창과 구분이 어려운 수준의 고품질 가창 합성이 가능함을 확인했다. 나아가 실제 음악 제작 과정을 고려한 생성 및 수정 프레임워크를 제안하였고, 새로운 목소리 디자인, 교차 생성 등 창작자의 상상력과 한계를 넓힐 수 있는 응용이 가능함을 확인했다.1 Introduction 1 1.1 Motivation 1 1.2 Problems in singing voice synthesis 4 1.3 Task of interest 8 1.3.1 Single-singer SVS 9 1.3.2 Multi-singer SVS 10 1.3.3 Expressive SVS 11 1.4 Contribution 11 2 Background 13 2.1 Singing voice 14 2.2 Source-filter theory 18 2.3 Autoregressive model 21 2.4 Related works 22 2.4.1 Speech synthesis 25 2.4.2 Singing voice synthesis 29 3 Adversarially Trained End-to-end Korean Singing Voice Synthesis System 31 3.1 Introduction 31 3.2 Related work 33 3.3 Proposed method 35 3.3.1 Input representation 35 3.3.2 Mel-synthesis network 36 3.3.3 Super-resolution network 38 3.4 Experiments 42 3.4.1 Dataset 42 3.4.2 Training 42 3.4.3 Evaluation 43 3.4.4 Analysis on generated spectrogram 46 3.5 Discussion 49 3.5.1 Limitations of input representation 49 3.5.2 Advantages of using super-resolution network 53 3.6 Conclusion 55 4 Disentangling Timbre and Singing Style with multi-singer Singing Synthesis System 57 4.1Introduction 57 4.2 Related works 59 4.2.1 Multi-singer SVS system 60 4.3 Proposed Method 60 4.3.1 Singer identity encoder 62 4.3.2 Disentangling timbre & singing style 64 4.4 Experiment 64 4.4.1 Dataset and preprocessing 64 4.4.2 Training & inference 65 4.4.3 Analysis on generated spectrogram 65 4.4.4 Listening test 66 4.4.5 Timbre & style classification test 68 4.5 Discussion 70 4.5.1 Query audio selection strategy for singer identity encoder 70 4.5.2 Few-shot adaptation 72 4.6 Conclusion 74 5 Expressive Singing Synthesis Using Local Style Token and Dual-path Pitch Encoder 77 5.1 Introduction 77 5.2 Related work 79 5.3 Proposed method 80 5.3.1 Local style token module 80 5.3.2 Dual-path pitch encoder 85 5.3.3 Bandwidth extension vocoder 85 5.4 Experiment 86 5.4.1 Dataset 86 5.4.2 Training 86 5.4.3 Qualitative evaluation 87 5.4.4 Dual-path reconstruction analysis 89 5.4.5 Qualitative analysis 90 5.5 Discussion 93 5.5.1 Difference between midi pitch and f0 93 5.5.2 Considerations for use in the actual music production process 94 5.6 Conclusion 95 6 Conclusion 97 6.1 Thesis summary 97 6.2 Limitations and future work 99 6.2.1 Improvements to a faster and robust system 99 6.2.2 Explainable and intuitive controllability 101 6.2.3 Extensions to common speech synthesis tools 103 6.2.4 Towards a collaborative and creative tool 104박

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy

    A review of differentiable digital signal processing for music and speech synthesis

    Get PDF
    The term “differentiable digital signal processing” describes a family of techniques in which loss function gradients are backpropagated through digital signal processors, facilitating their integration into neural networks. This article surveys the literature on differentiable audio signal processing, focusing on its use in music and speech synthesis. We catalogue applications to tasks including music performance rendering, sound matching, and voice transformation, discussing the motivations for and implications of the use of this methodology. This is accompanied by an overview of digital signal processing operations that have been implemented differentiably, which is further supported by a web book containing practical advice on differentiable synthesiser programming (https://intro2ddsp.github.io/). Finally, we highlight open challenges, including optimisation pathologies, robustness to real-world conditions, and design trade-offs, and discuss directions for future research

    Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy

    Get PDF
    The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference

    Modélisation des paramètres de contrôle pour la synthèse de voix chantée

    Get PDF
    National audienceL'état de l'art de la synthèse vocale, et en particulier la synthèse concaténative , nous permet a ce jour d'obtenir une qualité d'élocution proche de la voix réelle, aussi bien pour la parole que pour le chant. Mais une synthèse a la fois naturelle et expressive ne peut être conçue sans un contrôle approprié, recouvrant de nombreux aspects a la fois timbraux et prosodiques, ainsi que leurs interdépendances. Pour le chant, la fréquence fondamentale (F0), portant la mélodie ainsi que certains aspects stylistiques, est a considérer en premier lieu. Une méthode de modélisation de la courbe de F0 a partir de la partition, basée sur l'utilisation de B-splines, a été mise en place. Celle-ci permet une représentation paramétrique des variations expressives de la F0 telles que le vibrato, les attaques, ou les transitions entre notes, avec un contrôle intuitif. Une première étude a permis d'établir qu'une telle représentation permet de reproduire de façon satisfaisante les variations propres a différents styles de chant. Mais le réglage manuel de l'ensemble des paramètres reste une tâche fastidieuse. Une gestion automatique de ces paramètres, basée sur un apprentissage et certaines règles, s'avère donc nécessaire, afin de réduire la quantité de réglages manuels a fournir. Les différents paramètres considérés varient d'un style de chant a l'autre. L'extraction de ces paramètres a partir d'enregistrements, ainsi que des contextes liés a la partition, doit donc permettre de capturer les caractéristiques propres au style interprétatif du chanteur, tout en conservant une certaine variabilité et la cohérence nécessaires a la production d'un chant naturel

    Conveying expressivity and vocal effort transformation in synthetic speech with Harmonic plus Noise Models

    Get PDF
    Aquesta tesi s'ha dut a terme dins del Grup en de Tecnologies Mèdia (GTM) de l'Escola d'Enginyeria i Arquitectura la Salle. El grup te una llarga trajectòria dins del cap de la síntesi de veu i fins i tot disposa d'un sistema propi de síntesi per concatenació d'unitats (US-TTS) que permet sintetitzar diferents estils expressius usant múltiples corpus. De forma que per a realitzar una síntesi agressiva, el sistema usa el corpus de l'estil agressiu, i per a realitzar una síntesi sensual, usa el corpus de l'estil corresponent. Aquesta tesi pretén proposar modificacions del esquema del US-TTS que permetin millorar la flexibilitat del sistema per sintetitzar múltiples expressivitats usant només un únic corpus d'estil neutre. L'enfoc seguit en aquesta tesi es basa en l'ús de tècniques de processament digital del senyal (DSP) per aplicar modificacions de senyal a la veu sintetitzada per tal que aquesta expressi l'estil de parla desitjat. Per tal de dur a terme aquestes modificacions de senyal s'han usat els models harmònic més soroll per la seva flexibilitat a l'hora de realitzar modificacions de senyal. La qualitat de la veu (VoQ) juga un paper important en els diferents estils expressius. És per això que es va estudiar la síntesi de diferents emocions mitjançant la modificació de paràmetres de VoQ de baix nivell. D'aquest estudi es van identificar un conjunt de limitacions que van donar lloc als objectius d'aquesta tesi, entre ells el trobar un paràmetre amb gran impacte sobre els estils expressius. Per aquest fet l'esforç vocal (VE) es va escollir per el seu paper important en la parla expressiva. Primer es va estudiar la possibilitat de transferir l'VE entre dues realitzacions amb diferent VE de la mateixa paraula basant-se en la tècnica de predicció lineal adaptativa del filtre de pre-èmfasi (APLP). La proposta va permetre transferir l'VE correctament però presentava limitacions per a poder generar nivells intermitjos d'VE. Amb la finalitat de millorar la flexibilitat i control de l'VE expressat a la veu sintetitzada, es va proposar un nou model d'VE basat en polinomis lineals. Aquesta proposta va permetre transferir l'VE entre dues paraules qualsevols i sintetitzar nous nivells d'VE diferents dels disponibles al corpus. Aquesta flexibilitat esta alineada amb l'objectiu general d'aquesta tesi, permetre als sistemes US-TTS sintetitzar diferents estils expressius a partir d'un únic corpus d'estil neutre. La proposta realitzada també inclou un paràmetre que permet controlar fàcilment el nivell d'VE sintetitzat. Això obre moltes possibilitats per controlar fàcilment el procés de síntesi tal i com es va fer al projecte CreaVeu usant interfícies gràfiques simples i intuïtives, també realitzat dins del grup GTM. Aquesta memòria conclou presentant el treball realitzat en aquesta tesi i amb una proposta de modificació de l'esquema d'un sistema US-TTS per incloure els blocs de DSP desenvolupats en aquesta tesi que permetin al sistema sintetitzar múltiple nivells d'VE a partir d'un corpus d'estil neutre. Això obre moltes possibilitats per generar interfícies d'usuari que permetin controlar fàcilment el procés de síntesi, tal i com es va fer al projecte CreaVeu, també realitzat dins del grup GTM. Aquesta memòria conclou presentant el treball realitzat en aquesta tesi i amb una proposta de modificació de l'esquema del sistema US-TTS per incloure els blocs de DSP desenvolupats en aquesta tesi que permetin al sistema sintetitzar múltiple nivells d'VE a partir d'un corpus d'estil neutre.Esta tesis se llevó a cabo en el Grup en Tecnologies Mèdia de la Escuela de Ingeniería y Arquitectura la Salle. El grupo lleva una larga trayectoria dentro del campo de la síntesis de voz y cuenta con su propio sistema de síntesis por concatenación de unidades (US-TTS). El sistema permite sintetizar múltiples estilos expresivos mediante el uso de corpus específicos para cada estilo expresivo. De este modo, para realizar una síntesis agresiva, el sistema usa el corpus de este estilo, y para un estilo sensual, usa otro corpus específico para ese estilo. La presente tesis aborda el problema con un enfoque distinto proponiendo cambios en el esquema del sistema con el fin de mejorar la flexibilidad para sintetizar múltiples estilos expresivos a partir de un único corpus de estilo de habla neutro. El planteamiento seguido en esta tesis esta basado en el uso de técnicas de procesamiento de señales (DSP) para llevar a cabo modificaciones del señal de voz para que este exprese el estilo de habla deseado. Para llevar acabo las modificaciones de la señal de voz se han usado los modelos harmónico más ruido (HNM) por su flexibilidad para efectuar modificaciones de señales. La cualidad de la voz (VoQ) juega un papel importante en diferentes estilos expresivos. Por ello se exploró la síntesis expresiva basada en modificaciones de parámetros de bajo nivel de la VoQ. Durante este estudio se detectaron diferentes problemas que dieron pié a los objetivos planteados en esta tesis, entre ellos el encontrar un único parámetro con fuerte influencia en la expresividad. El parámetro seleccionado fue el esfuerzo vocal (VE) por su importante papel a la hora de expresar diferentes emociones. Las primeras pruebas se realizaron con el fin de transferir el VE entre dos realizaciones con diferente grado de VE de la misma palabra usando una metodología basada en un proceso filtrado de pre-émfasis adaptativo con coeficientes de predicción lineales (APLP). Esta primera aproximación logró transferir el nivel de VE entre dos realizaciones de la misma palabra, sin embargo el proceso presentaba limitaciones para generar niveles de esfuerzo vocal intermedios. A fin de mejorar la flexibilidad y el control del sistema para expresar diferentes niveles de VE, se planteó un nuevo modelo de VE basado en polinomios lineales. Este modelo permitió transferir el VE entre dos palabras diferentes e incluso generar nuevos niveles no presentes en el corpus usado para la síntesis. Esta flexibilidad está alineada con el objetivo general de esta tesis de permitir a un sistema US-TTS expresar múltiples estilos de habla expresivos a partir de un único corpus de estilo neutro. Además, la metodología propuesta incorpora un parámetro que permite de forma sencilla controlar el nivel de VE expresado en la voz sintetizada. Esto abre la posibilidad de controlar fácilmente el proceso de síntesis tal y como se hizo en el proyecto CreaVeu usando interfaces simples e intuitivas, también realizado dentro del grupo GTM. Esta memoria concluye con una revisión del trabajo realizado en esta tesis y con una propuesta de modificación de un esquema de US-TTS para expresar diferentes niveles de VE a partir de un único corpus neutro.This thesis was conducted in the Grup en Tecnologies M`edia (GTM) from Escola d’Enginyeria i Arquitectura la Salle. The group has a long trajectory in the speech synthesis field and has developed their own Unit-Selection Text-To-Speech (US-TTS) which is able to convey multiple expressive styles using multiple expressive corpora, one for each expressive style. Thus, in order to convey aggressive speech, the US-TTS uses an aggressive corpus, whereas for a sensual speech style, the system uses a sensual corpus. Unlike that approach, this dissertation aims to present a new schema for enhancing the flexibility of the US-TTS system for performing multiple expressive styles using a single neutral corpus. The approach followed in this dissertation is based on applying Digital Signal Processing (DSP) techniques for carrying out speech modifications in order to synthesize the desired expressive style. For conducting the speech modifications the Harmonics plus Noise Model (HNM) was chosen for its flexibility in conducting signal modifications. Voice Quality (VoQ) has been proven to play an important role in different expressive styles. Thus, low-level VoQ acoustic parameters were explored for conveying multiple emotions. This raised several problems setting new objectives for the rest of the thesis, among them finding a single parameter with strong impact on the expressive style conveyed. Vocal Effort (VE) was selected for conducting expressive speech style modifications due to its salient role in expressive speech. The first approach working with VE was based on transferring VE between two parallel utterances based on the Adaptive Pre-emphasis Linear Prediction (APLP) technique. This approach allowed transferring VE but the model presented certain restrictions regarding its flexibility for generating new intermediate VE levels. Aiming to improve the flexibility and control of the conveyed VE, a new approach using polynomial model for modelling VE was presented. This model not only allowed transferring VE levels between two different utterances, but also allowed to generate other VE levels than those present in the speech corpus. This is aligned with the general goal of this thesis, allowing US-TTS systems to convey multiple expressive styles with a single neutral corpus. Moreover, the proposed methodology introduces a parameter for controlling the degree of VE in the synthesized speech signal. This opens new possibilities for controlling the synthesis process such as the one in the CreaVeu project using a simple and intuitive graphical interfaces, also conducted in the GTM group. The dissertation concludes with a review of the conducted work and a proposal for schema modifications within a US-TTS system for introducing the VE modification blocks designed in this dissertation

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The Models and Analysis of Vocal Emissions with Biomedical Applications (MAVEBA) workshop came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy

    The Pitch Range of Italians and Americans. A Comparative Study

    Get PDF
    Linguistic experiments have investigated the nature of F0 span and level in cross-linguistic comparisons. However, only few studies have focused on the elaboration of a general-agreed methodology that may provide a unifying approach to the analysis of pitch range (Ladd, 1996; Patterson and Ladd, 1999; Daly and Warren, 2001; Bishop and Keating, 2010; Mennen et al. 2012). Pitch variation is used in different languages to convey different linguistic and paralinguistic meanings that may range from the expression of sentence modality to the marking of emotional and attitudinal nuances (Grice and Baumann, 2007). A number of factors have to be taken into consideration when determining the existence of measurable and reliable differences in pitch values. Daly and Warren (2001) demonstrated the importance of some independent variables such as language, age, body size, speaker sex (female vs. male), socio-cultural background, regional accents, speech task (read sentences vs. spontaneous dialogues), sentence type (questions vs. statements) and measure scales (Hertz, semitones, ERB etc.). Coherently with the model proposed by Mennen et al. (2012), my analysis of pitch range is based on the investigation of LTD (long-term distributional) and linguistic measures. LTD measures deal with the F0 distribution within a speaker’s contour (e.g. F0 minimum, F0 maximum, F0 mean, F0 median, standard deviation, F0 span) while linguistic measures are linked to specific targets within the contour, such as peaks and valleys (e.g. high and low landmarks) and preserve the temporal sequences of pitch contours. This investigation analyzed the characteristics of pitch range production and perception in English sentences uttered by Americans and Italians. Four experiments were conducted to examine different phenomena: i) the contrast between measures of F0 level and span in utterances produced by Americans and Italians (experiments 1-2); ii) the contrast between the pitch range produced by males and females in L1 and L2 (experiment 1); iii) the F0 patterns in different sentence types, that is, yes-no questions, wh-questions, and exclamations (experiment 2); iv) listeners’ evaluations of pitch span in terms of ±interesting, ±excited, ±credible, ±friendly ratings of different sentence types (experiments 3-4); v) the correlation between pitch span of the sentences and the evaluations given by American and Italian listeners (experiment 3); vi) the listeners’ evaluations of pitch span values in manipulated stimuli, whose F0 span was re-synthesized under three conditions: narrow span, original span, and wide span (experiment 4); vii) the different evaluations given to the sentences by male and female listeners. The results of this investigation supported the following generalizations. First, pitch span more than level was found to be a cue for non-nativeness, because L2 speakers of English used a narrower span, compared to the native norm. What is more, the experimental data in the production studies indicated that the mode of sentences was better captured by F0 span than level. Second, the Italian learners of English were influenced by their L1 and transferred L1 pitch range variation into their L2. The English sentences produced by the Italians had overall higher pitch levels and narrower pitch span than those produced by the Americans. In addition, the Italians used overall higher pitch levels when speaking Italian and lower levels when speaking English. Conversely, their pitch span was generally higher in English and lower in Italian. When comparing productions in English, the Italian females used higher F0 levels than the American females; vice versa, the Italian males showed slightly lower F0 levels than the American males. Third, there was a systematic relation between pitch span values and the listeners’ evaluations of the sentences. The two groups of listeners (the Americans and the Italians) rated the stimuli with larger pitch span as more interesting, exciting and credible than the stimuli with narrower pitch span. Thus, the listeners relied on the perceived pitch span to differentiate among the stimuli. Fourth, both the American and the Italian speakers were considered more friendly when the pitch span of their sentences was widened (wide span manipulation) and less friendly when the pitch span was narrowed (narrow span manipulation). This happened in all the stimuli regardless of the native language of the speakers (American vs. Italian)

    Proceedings of the 7th Sound and Music Computing Conference

    Get PDF
    Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010
    corecore