9 research outputs found

    Head motion synthesis: evaluation and a template motion approach

    Get PDF
    The use of conversational agents has increased across the world. From providing automated support for companies to being virtual psychologists they have moved from an academic curiosity to an application with real world relevance. While many researchers have focused on the content of the dialogue and synthetic speech to give the agents a voice, more recently animating these characters has become a topic of interest. An additional use for character animation technology is in the film and video game industry where having characters animated without needing to pay for expensive labour would save tremendous costs. When animating characters there are many aspects to consider, for example the way they walk. However, to truly assist with communication automated animation needs to duplicate the body language used when speaking. In particular conversational agents are often only an animation of the upper parts of the body, so head motion is one of the keys to a believable agent. While certain linguistic features are obvious, such as nodding to indicate agreement, research has shown that head motion also aids understanding of speech. Additionally head motion often contains emotional cues, prosodic information, and other paralinguistic information. In this thesis we will present our research into synthesising head motion using only recorded speech as input. During this research we collected a large dataset of head motion synchronised with speech, examined evaluation methodology, and developed a synthesis system. Our dataset is one of the larger ones available. From it we present some statistics about head motion in general. Including differences between read speech and story telling speech, and differences between speakers. From this we are able to draw some conclusions as to what type of source data will be the most interesting in head motion research, and if speaker-dependent models are needed for synthesis. In our examination of head motion evaluation methodology we introduce Forced Canonical Correlation Analysis (FCCA). FCCA shows the difference between head motion shaped noise and motion capture better than standard methods for objective evaluation used in the literature. We have shown that for subjective testing it is best practice to use a variation of MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) based testing, adapted for head motion. Through experimentation we have developed guidelines for the implementation of the test, and the constraints on the length. Finally we present a new system for head motion synthesis. We make use of simple templates of motion, automatically extracted from source data, that are warped to suit the speech features. Our system uses clustering to pick the small motion units, and a combined HMM and GMM based approach for determining the values of warping parameters at synthesis time. This results in highly natural looking motion that outperforms other state of the art systems. Our system requires minimal human intervention and produces believable motion. The key innovates were the new methods for segmenting head motion and creating a process similar to language modelling for synthesising head motion

    Mapping Techniques for Voice Conversion

    Get PDF
    Speaker identity plays an important role in human communication. In addition to the linguistic content, speech utterances contain acoustic information of the speaker characteristics. This thesis focuses on voice conversion, a technique that aims at changing the voice of one speaker (a source speaker) into the voice of another specific speaker (a target speaker) without changing the linguistic information. The relationship between the source and target speaker characteristics is learned from the training data. Voice conversion can be used in various applications and fields: text-to-speech systems, dubbing, speech-to-speech translation, games, voice restoration, voice pathology, etc. Voice conversion offers many challenges: which features to extract from speech, how to find linguistic correspondences (alignment) between source and target features, which machine learning techniques to use for creating a mapping function between the features of the speakers, and finally, how to make the desired modifications to the speech waveform. The features can be any parameters that describe the speech and the speaker identity, e.g. spectral envelope, excitation, fundamental frequency, and phone durations. The main focus of the thesis is on the design of suitable mapping techniques between frame-level source and target features, but also aspects related to parallel data alignment and prosody conversion are addressed. The perception of the quality and the success of the identity conversion are largely subjective. Conventional statistical techniques are able to produce good similarity between the original and the converted target voices but the quality is usually degraded. The objective of this thesis is to design conversion techniques that enable successful identity conversion while maintaining the original speech quality. Due to the limited amount of data, statistical techniques are usually utilized in extracting the mapping function. The most popular technique is based on a Gaussian mixture model (GMM). However, conventional GMM-based conversion suffers from many problems that result in degraded speech quality. The problems are analyzed in this thesis, and a technique that combines GMM-based conversion with partial least squares regression is introduced to alleviate these problems. Additionally, approaches to solve the time-independent mapping problem associated with many algorithms are proposed. The most significant contribution of the thesis is the proposed novel dynamic kernel partial least squares regression technique that allows creating a non-linear mapping function and improves temporal correlation. The technique is straightforward, efficient and requires very little tuning. It is shown to outperform the state-of-the-art GMM-based technique using both subjective and objective tests over a variety of speaker pairs. In addition, quality is further improved when aperiodicity and binary voicing values are predicted using the same technique. The vast majority of the existing voice conversion algorithms concern the transformation of the spectral envelopes. However, prosodic features, such as fundamental frequency movements and speaking rhythm, also contain important cues of identity. It is shown in the thesis that pure prosody alone can be used, to some extent, to recognize speakers that are familiar to the listeners. Furthermore, a prosody conversion technique is proposed that transforms fundamental frequency contours and durations at syllable level. The technique is shown to improve similarity to the target speaker’s prosody and reduce roboticness compared to a conventional frame-based conversion technique. Recently, the trend has shifted from text-dependent to text-independent use cases meaning that there is no parallel data available. The techniques proposed in the thesis currently assume parallel data, i.e. that the same texts have been spoken by both speakers. However, excluding the prosody conversion algorithm, the proposed techniques require no phonetic information and are applicable for a small amount of training data. Moreover, many text-independent approaches are based on extracting a sort of alignment as a pre-processing step. Thus the techniques proposed in the thesis can be exploited after the alignment process

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018 : 10-12 December 2018, Torino

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Proceedings of the Fifth Italian Conference on Computational Linguistics CLiC-it 2018

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Fifth Italian Conference on Computational Linguistics (CLiC-­‐it 2018). This edition of the conference is held in Torino. The conference is locally organised by the University of Torino and hosted into its prestigious main lecture hall “Cavallerizza Reale”. The CLiC-­‐it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after five years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Annual Report

    Get PDF
    corecore