57 research outputs found

    Speech wave-form driven motion synthesis for embodied agents

    Get PDF
    The main objective of this thesis is to synthesise motion from speech, especially in conversation. Based on previous research into different acoustic features or the combination of them were investigated, no one has investigated in estimating head motion from waveform directly, which is the stem of the speech. Thus, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, there are a few problems if we would like to apply speech waveform, 1) high dimensional, where the dimension of the waveform data is much higher than those common acoustic features and thus making the training of the model more difficult, and 2) irrelevant information, which refers to the full information in the original waveform implicating potential cumbrance for neural network training. To resolve these problems, we applied a deep canonical correlated constrainted auto-encoder (DCCCAE) to compress the waveform into low dimensional and highly correlated embedded features with head motion. The estimated head motion was evaluated both objectively and subjectively. In objective evaluation, the result confirmed that DCCCAE enables the creation of a more correlated feature with the head motion than standard AE and other popular spectral features such as MFCC and FBank, and is capable of being used in achieving state-of-the-art results for predicting natural head motion with the advantage of the DCCCAE. Besides investigating the representation learning of the feature, we also explored the LSTM-based regression model for the proposed feature. The LSTM-based models were able to boost the overall performance in the objective evaluation and adapt better to the proposed feature than MFCC. MUSHRA-liked subjective evaluation results suggest that the animations generated by models with the proposed feature were chosen to be better than the other models by the participants of MUSHRA-liked test. A/B test further that the LSTM-based regression model adapts better to the proposed feature. Furthermore, we extended the architecture to estimate the upper body motion as well. We submitted our result to GENEA2020 and our model achieved a higher score than BA in both aspects (human-likeness and appropriateness) according to the participant’s preference, suggesting that the highly correlated feature pair and the sequential estimation helped in improving the model generalisation

    Head motion synthesis: evaluation and a template motion approach

    Get PDF
    The use of conversational agents has increased across the world. From providing automated support for companies to being virtual psychologists they have moved from an academic curiosity to an application with real world relevance. While many researchers have focused on the content of the dialogue and synthetic speech to give the agents a voice, more recently animating these characters has become a topic of interest. An additional use for character animation technology is in the film and video game industry where having characters animated without needing to pay for expensive labour would save tremendous costs. When animating characters there are many aspects to consider, for example the way they walk. However, to truly assist with communication automated animation needs to duplicate the body language used when speaking. In particular conversational agents are often only an animation of the upper parts of the body, so head motion is one of the keys to a believable agent. While certain linguistic features are obvious, such as nodding to indicate agreement, research has shown that head motion also aids understanding of speech. Additionally head motion often contains emotional cues, prosodic information, and other paralinguistic information. In this thesis we will present our research into synthesising head motion using only recorded speech as input. During this research we collected a large dataset of head motion synchronised with speech, examined evaluation methodology, and developed a synthesis system. Our dataset is one of the larger ones available. From it we present some statistics about head motion in general. Including differences between read speech and story telling speech, and differences between speakers. From this we are able to draw some conclusions as to what type of source data will be the most interesting in head motion research, and if speaker-dependent models are needed for synthesis. In our examination of head motion evaluation methodology we introduce Forced Canonical Correlation Analysis (FCCA). FCCA shows the difference between head motion shaped noise and motion capture better than standard methods for objective evaluation used in the literature. We have shown that for subjective testing it is best practice to use a variation of MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) based testing, adapted for head motion. Through experimentation we have developed guidelines for the implementation of the test, and the constraints on the length. Finally we present a new system for head motion synthesis. We make use of simple templates of motion, automatically extracted from source data, that are warped to suit the speech features. Our system uses clustering to pick the small motion units, and a combined HMM and GMM based approach for determining the values of warping parameters at synthesis time. This results in highly natural looking motion that outperforms other state of the art systems. Our system requires minimal human intervention and produces believable motion. The key innovates were the new methods for segmenting head motion and creating a process similar to language modelling for synthesising head motion

    A large, crowdsourced evaluation of gesture generation systems on common data : the GENEA Challenge 2020

    Get PDF
    Co-speech gestures, gestures that accompany speech, play an important role in human communication. Automatic co-speech gesture generation is thus a key enabling technology for embodied conversational agents (ECAs), since humans expect ECAs to be capable of multi-modal communication. Research into gesture generation is rapidly gravitating towards data-driven methods. Unfortunately, individual research efforts in the field are difficult to compare: there are no established benchmarks, and each study tends to use its own dataset, motion visualisation, and evaluation methodology. To address this situation, we launched the GENEA Challenge, a gesture-generation challenge wherein participating teams built automatic gesture-generation systems on a common dataset, and the resulting systems were evaluated in parallel in a large, crowdsourced user study using the same motion-rendering pipeline. Since differences in evaluation outcomes between systems now are solely attributable to differences between the motion-generation methods, this enables benchmarking recent approaches against one another in order to get a better impression of the state of the art in the field. This paper reports on the purpose, design, results, and implications of our challenge.Part of Proceedings: ISBN 978-145038017-1QC 20210607</p

    Audio-Visual Speech Enhancement Based on Deep Learning

    Get PDF

    Progressive Perceptual Audio Rendering of Complex Scenes

    Get PDF
    International audienceDespite recent advances, including sound source clustering and perceptual auditory masking, high quality rendering of complex virtual scenes with thousands of sound sources remains a challenge. Two major bottlenecks appear as the scene complexity increases: the cost of clustering itself, and the cost of pre-mixing source signals within each cluster. In this paper, we first propose an improved hierarchical clustering algorithm that remains efficient for large numbers of sources and clusters while providing progressive refinement capabilities. We then present a lossy pre-mixing method based on a progressive representation of the input audio signals and the perceptual importance of each sound source. Our quality evaluation user tests indicate that the recently introduced audio saliency map is inappropriate for this task. Consequently we propose a "pinnacle", loudness-based metric, which gives the best results for a variety of target computing budgets. We also performed a perceptual pilot study which indicates that in audio-visual environments, it is better to allocate more clusters to visible sound sources. We propose a new clustering metric using this result. As a result of these three solutions, our system can provide high quality rendering of thousands of 3D-sound sources on a "gamer-style" PC

    Machine Learning for Auditory Hierarchy

    Get PDF
    Coleman, W. (2021). Machine Learning for Auditory Hierarchy. This dissertation is submitted for the degree of Doctor of Philosophy, Technological University Dublin. Audio content is predominantly delivered in a stereo audio file of a static, pre-formed mix. The content creator makes volume, position and effects decisions, generally for presentation in stereo speakers, but has no control ultimately over how the content will be consumed. This leads to poor listener experience when, for example, a feature film is mixed such that the dialogue is at a low level relative to the sound effects. Consumers can complain that they must turn the volume up to hear the words, but back down again because the effects levels are too loud. Addressing this problem requires a television mix optimised for the stereo speakers used in the vast majority of homes, which is not always available

    Automating the Production of the Balance Mix in Music Production

    Get PDF
    Historically, the junior engineer is an individual who would assist the sound engineer to produce a mix by performing a number of mixing and pre-processing tasks ahead of the main session. With improvements in technology, these tasks can be done more efficiently, so many aspects of this role are now assigned to the lead engineer. Similarly, these technological advances mean amateur producers now have access to similar mixing tools at home, without the need for any studio time or record label investments. As the junior engineer’s role is now embedded into the process it creates a steeper learning curve for these amateur engineers, and adding time onto the mixing process. In order to build tools to help users overcome the hurdles associated with this increased workload, we first aim to quantify the role of a modern studio engineer. To do this, a production environment was built to collect session data, allowing subjects to construct a balance mix, which is the starting point of the mixing life-cycle. This balance-mix is generally designed to ensure that all the recordings in a mix are audible, as well as to build routing structures and apply pre-processing. Improvements in web technologies allow for this data-collection system to run in a browser, making remote data acquisition feasible in a short space of time. The data collected in this study was then used to develop a set of assistive tools, designed to be non-intrusive and to provide guidance, allowing the engineer to understand the process. From the data, grouping of the audio tracks proved to be one of the most important, yet overlooked tasks in the production life-cycle. This step is often misunderstood by novice engineers, and can enhance the quality of the final product. The first assistive tool we present in this thesis takes multi-track audio sessions and uses semantic information to group and label them. The system can work with any collection of audio tracks, and can be embedded into a poroduction environment. It was also apparent from the data that the minimisation of masking is a primary task of the mixing stage. We therefore present a tool which can automatically balance a mix by minimising the masking between separate audio tracks. Using evolutionary computing as a solver, the mix space can be searched effectively without the requirement for complex models to be trained on production data. The evaluation of these systems show they are capable of producing a session structure similar to that of a real engineer. This provides a balance mix which is routed and pre-processed, before creative mixing can take place. This provides an engineer with several steps completed for them, similar to the work of a junior engineer

    Computing Methodologies Supporting the Preservation of Electroacoustic Music from Analog Magnetic Tape

    Get PDF
    Electroacoustic music on analog magnetic tape is characterized by several specificities related to the carrier that have to be considered during the creation of a digital preservation copy of a document. The tape recorder need to be setup with the correct speed and equalization; moreover, the magnetic tape could present some intentional or unintentional alterations. During both the creation and the musicological analysis of a digital preservation copy, the quality of the work could be affected by human attention. This paper presents a methodology based on neural networks able to recognize and classify the alterations of a magnetic tape from the video of the tape itself flowing in the head of the tape recorder. Furthermore, some machine learning techniques has been tested to recognize equalization of a tape from its background noise. The encouraging results open the way to innovative tools able to unburden audio technicians and musicologists from repetitive tasks and improve the quality of their works

    Zero-Shot Blind Audio Bandwidth Extension

    Full text link
    Audio bandwidth extension involves the realistic reconstruction of high-frequency spectra from bandlimited observations. In cases where the lowpass degradation is unknown, such as in restoring historical audio recordings, this becomes a blind problem. This paper introduces a novel method called BABE (Blind Audio Bandwidth Extension) that addresses the blind problem in a zero-shot setting, leveraging the generative priors of a pre-trained unconditional diffusion model. During the inference process, BABE utilizes a generalized version of diffusion posterior sampling, where the degradation operator is unknown but parametrized and inferred iteratively. The performance of the proposed method is evaluated using objective and subjective metrics, and the results show that BABE surpasses state-of-the-art blind bandwidth extension baselines and achieves competitive performance compared to non-blind filter-informed methods when tested with synthetic data. Moreover, BABE exhibits robust generalization capabilities when enhancing real historical recordings, effectively reconstructing the missing high-frequency content while maintaining coherence with the original recording. Subjective preference tests confirm that BABE significantly improves the audio quality of historical music recordings. Examples of historical recordings restored with the proposed method are available on the companion webpage: (http://research.spa.aalto.fi/publications/papers/ieee-taslp-babe/)Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processin

    μ œμ–΄ κ°€λŠ₯ν•œ μŒμ„± 합성을 μœ„ν•œ 게이트 μž¬κ·€ μ–΄ν…μ…˜κ³Ό λ‹€λ³€μˆ˜ 정보 μ΅œμ†Œν™”

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 전기·정보곡학뢀, 2021.8. μ²œμ„±μ€€.Speech is one the most useful interface that enables a person to communicate with distant others while using hands for other tasks. With the growing usage of speech interfaces in mobile devices, home appliances, and automobiles, the research on human-machine speech interface is expanding. This thesis deals with the speech synthesis which enable machines to generate speech. With the application of deep learning technology, the quality of synthesized speech has become similar to that of human speech, but natural style control is still a challenging task. In this thesis, we propose novel techniques for expressing various styles such as prosody and emotion, and for controlling the style of synthesized speech factor-by-factor. First, the conventional style control techniques which have proposed for speech synthesis systems are introduced. In order to control speaker identity, emotion, accent, prosody, we introduce the control method both for statistical parametric-based and deep learning-based speech synthesis systems. We propose a gated recurrent attention (GRA), a novel attention mechanism with a controllable gated recurence. GRA is suitable for learning various styles because it can control the recurrent state for attention corresponds to the location with two gates. By experiments, GRA was found to be more effective in transferring unseen styles, which implies that the GRA outperform in generalization to conventional techniques. We propose a multivariate information minimization method which disentangle three or more latent representations. We show that control factors can be disentangled by minimizing interactive dependency which can be expressed as a sum of mutual information upper bound terms. Since the upper bound estimate converges from the early training stage, there is little performance degradation due to auxiliary loss. The proposed technique is applied to train a text-to-speech synthesizer with multi-lingual, multi-speaker, and multi-style corpora. Subjective listening tests validate the proposed method can improve the synthesizer in terms of quality as well as controllability.μŒμ„±μ€ μ‚¬λžŒμ΄ μ†μœΌλ‘œ λ‹€λ₯Έ 일을 ν•˜λ©΄μ„œλ„, 멀리 떨어진 μƒλŒ€μ™€ ν™œμš©ν•  수 μžˆλŠ” κ°€μž₯ μœ μš©ν•œ μΈν„°νŽ˜μ΄μŠ€ 쀑 ν•˜λ‚˜μ΄λ‹€. λŒ€λΆ€λΆ„μ˜ μ‚¬λžŒμ΄ μƒν™œμ—μ„œ λ°€μ ‘ν•˜κ²Œ μ ‘ν•˜λŠ” λͺ¨λ°”일 κΈ°κΈ°, κ°€μ „, μžλ™μ°¨ λ“±μ—μ„œ μŒμ„± μΈν„°νŽ˜μ΄μŠ€λ₯Ό ν™œμš©ν•˜κ²Œ λ˜λ©΄μ„œ, 기계와 μ‚¬λžŒ κ°„μ˜ μŒμ„± μΈν„°νŽ˜μ΄μŠ€μ— λŒ€ν•œ 연ꡬ가 λ‚ λ‘œ μ¦κ°€ν•˜κ³  μžˆλ‹€. λ³Έ 논문은 기계가 μŒμ„±μ„ λ§Œλ“œλŠ” 과정인 μŒμ„± 합성을 닀룬닀. λ”₯ λŸ¬λ‹ 기술이 μ μš©λ˜λ©΄μ„œ ν•©μ„±λœ μŒμ„±μ˜ ν’ˆμ§ˆμ€ μ‚¬λžŒμ˜ μŒμ„±κ³Ό μœ μ‚¬ν•΄μ‘Œμ§€λ§Œ, μžμ—°μŠ€λŸ¬μš΄ μŠ€νƒ€μΌμ˜ μ œμ–΄λŠ” 아직도 도전적인 κ³Όμ œμ΄λ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ‹€μ–‘ν•œ 운율과 감정을 ν‘œν˜„ν•  수 μžˆλŠ” μŒμ„±μ„ ν•©μ„±ν•˜κΈ° μœ„ν•œ 기법듀을 μ œμ•ˆν•˜λ©°, μŠ€νƒ€μΌμ„ μš”μ†Œλ³„λ‘œ μ œμ–΄ν•˜μ—¬ μ†μ‰½κ²Œ μ›ν•˜λŠ” μŠ€νƒ€μΌμ˜ μŒμ„±μ„ ν•©μ„±ν•  수 μžˆλ„λ‘ ν•˜λŠ” 기법을 μ œμ•ˆν•œλ‹€. λ¨Όμ € μŒμ„± 합성을 μœ„ν•΄ μ œμ•ˆλœ κΈ°μ‘΄ μŠ€νƒ€μΌ μ œμ–΄ 기법듀을 μ†Œκ°œν•œλ‹€. ν™”μž, 감정, λ§νˆ¬λ‚˜, 음운 등을 μ œμ–΄ν•˜λ©΄μ„œλ„ μžμ—°μŠ€λŸ¬μš΄ λ°œν™”λ₯Ό ν•©μ„±ν•˜κ³ μž 톡계적 νŒŒλΌλ―Έν„° μŒμ„± ν•©μ„± μ‹œμŠ€ν…œμ„ μœ„ν•΄ μ œμ•ˆλœ 기법듀과, λ”₯λŸ¬λ‹ 기반 μŒμ„± ν•©μ„± μ‹œμŠ€ν…œμ„ μœ„ν•΄ μ œμ•ˆλœ 기법을 μ†Œκ°œν•œλ‹€. λ‹€μŒμœΌλ‘œ 두 μ‹œν€€μŠ€(sequence) κ°„μ˜ 관계λ₯Ό ν•™μŠ΅ν•˜μ—¬, μž…λ ₯ μ‹œν€€μŠ€μ— 따라 좜λ ₯ μ‹œν€€μŠ€λ₯Ό μƒμ„±ν•˜λŠ” μ–΄ν…μ…˜(attention) 기법에 μ œμ–΄ κ°€λŠ₯ν•œ μž¬κ·€μ„±μ„ μΆ”κ°€ν•œ 게이트 μž¬κ·€ μ–΄ν…μ…˜(Gated Recurrent Attention) λ₯Ό μ œμ•ˆν•œλ‹€. 게이트 μž¬κ·€ μ–΄ν…μ…˜μ€ μΌμ •ν•œ μž…λ ₯에 λŒ€ν•΄ 좜λ ₯ μœ„μΉ˜μ— 따라 λ‹¬λΌμ§€λŠ” λ‹€μ–‘ν•œ 좜λ ₯을 두 개의 게이트λ₯Ό 톡해 μ œμ–΄ν•  수 μžˆμ–΄ λ‹€μ–‘ν•œ μŠ€νƒ€μΌμ„ ν•™μŠ΅ν•˜λŠ”λ° μ ν•©ν•˜λ‹€. 게이트 μž¬κ·€ μ–΄ν…μ…˜μ€ ν•™μŠ΅ 데이터에 μ—†μ—ˆλ˜ μŠ€νƒ€μΌμ„ ν•™μŠ΅ν•˜κ³  μƒμ„±ν•˜λŠ”λ° μžˆμ–΄ κΈ°μ‘΄ 기법에 λΉ„ν•΄ μžμ—°μŠ€λŸ¬μ›€μ΄λ‚˜ μŠ€νƒ€μΌ μœ μ‚¬λ„ λ©΄μ—μ„œ 높은 μ„±λŠ₯을 λ³΄μ΄λŠ” 것을 μ‹€ν—˜μ„ 톡해 확인할 수 μžˆμ—ˆλ‹€. λ‹€μŒμœΌλ‘œ μ„Έ 개 μ΄μƒμ˜ μŠ€νƒ€μΌ μš”μ†Œλ“€μ˜ μƒν˜Έμ˜μ‘΄μ„±μ„ μ œκ±°ν•  수 μžˆλŠ” 기법을 μ œμ•ˆν•œλ‹€. μ—¬λŸ¬κ°œμ˜ μ œμ–΄ μš”μ†Œλ“€(factors)을 λ³€μˆ˜κ°„ μƒν˜Έμ˜μ‘΄μ„± μƒν•œ ν•­λ“€μ˜ ν•©μœΌλ‘œ λ‚˜νƒ€λ‚΄κ³ , 이λ₯Ό μ΅œμ†Œν™”ν•˜μ—¬ μ˜μ‘΄μ„±μ„ μ œκ±°ν•  수 μžˆμŒμ„ 보인닀. 이 μƒν•œ μΆ”μ •μΉ˜λŠ” ν•™μŠ΅ μ΄ˆκΈ°μ— μˆ˜λ ΄ν•˜μ—¬ 0에 κ°€κΉκ²Œ μœ μ§€λ˜κΈ° λ•Œλ¬Έμ—, μ†μ‹€ν•¨μˆ˜λ₯Ό λ”ν•¨μœΌλ‘œμ¨ μƒκΈ°λŠ” μ„±λŠ₯ μ €ν•˜κ°€ 거의 μ—†λ‹€. μ œμ•ˆν•˜λŠ” 기법은 λ‹€μ–Έμ–΄, λ‹€ν™”μž, μŠ€νƒ€μΌ λ°μ΄ν„°λ² μ΄μŠ€λ‘œ μŒμ„±ν•©μ„±κΈ°λ₯Ό ν•™μŠ΅ν•˜λŠ”λ° ν™œμš©λœλ‹€. 15λͺ…μ˜ μŒμ„± μ „λ¬Έκ°€λ“€μ˜ 주관적인 λ“£κΈ° 평가λ₯Ό 톡해 μ œμ•ˆν•˜λŠ” 기법이 ν•©μ„±κΈ°μ˜ μŠ€νƒ€μΌ μ œμ–΄κ°€λŠ₯성을 높일 뿐만 μ•„λ‹ˆλΌ ν•©μ„±μŒμ˜ ν’ˆμ§ˆκΉŒμ§€ 높일 수 μžˆμŒμ„ 보인닀.1 Introduction 1 1.1 Evolution of Speech Synthesis Technology 1 1.2 Attention-based Speech Synthesis Systems 2 1.2.1 Tacotron 2 1.2.2 Deep Convolutional TTS 3 1.3 Non-autoregressive Speech Synthesis Systems 6 1.3.1 Glow-TTS 6 1.3.2 SpeedySpeech 8 1.4 Outline of the thesis 8 2 Style Modeling Techniques for Speech Synthesis 13 2.1 Introduction 13 2.2 Style Modeling Techniques for Statistical Parametric Speech Synthesis 14 2.3 Style Modeling Techniques for Deep Learning-based Speech Synthesis 15 2.4 Summary 17 3 Gated Recurrent Attention for Multi-Style Speech Synthesis 19 3.1 Introduction 19 3.2 Related Works 20 3.2.1 Gated recurrent unit 20 3.2.2 Location-sensitive attention 22 3.3 Gated Recurrent Attention 24 3.4 Experiments and results 28 3.4.1 Tacotron2 with global style tokens 28 3.4.2 Decaying guided attention 29 3.4.3 Datasets and feature processing 30 3.4.4 Evaluation methods 32 3.4.5 Evaluation results 33 3.5 Guided attention and decaying guided attention 34 3.6 Summary 35 4 A Controllable Multi-lingual Multi-speaker Multi-style Text-to-Speech Synthesis with Multivariate Information Minimization 41 4.1 Introduction 41 4.2 Related Works 44 4.2.1 Disentanglement Studies for Speech Synthesis 44 4.2.2 Total Correlation and Mutual Information 45 4.2.3 CLUB:A Contrastive Log-ratio Upper Bound of Mutual Information 46 4.3 Proposed method 46 4.4 Experiments and Results 47 4.4.1 Quality and Naturalness of Speech 51 4.4.2 Speaker and style similarity 52 4.5 Summary 53 5 Conclusions 55 Bibliography 57 초 둝 67 κ°μ‚¬μ˜ κΈ€ 69λ°•
    • …
    corecore