218 research outputs found

    Development of Continuous Voice Morphing Using Separated Vocal Tract Area Functions, Glottal Source Waves, and Prosodic Features

    Get PDF
    科学研究費助成事業(科学研究費補助金)研究成果報告書:基盤研究(C)2010-2012課題番号:2250014

    Behavioural and neural insights into the recognition and motivational salience of familiar voice identities

    Get PDF
    The majority of voices encountered in everyday life belong to people we know, such as close friends, relatives, or romantic partners. However, research to date has overlooked this type of familiarity when investigating voice identity perception. This thesis aimed to address this gap in the literature, through a detailed investigation of voice perception across different types of familiarity: personally familiar voices, famous voices, and lab-trained voices. The experimental chapters of the thesis cover two broad research topics: 1) Measuring the recognition and representation of personally familiar voice identities in comparison with labtrained identities, and 2) Investigating motivation and reward in relation to hearing personally valued voices compared with unfamiliar voice identities. In the first of these, an exploration of the extent of human voice recognition capabilities was undertaken using personally familiar voices of romantic partners. The perceptual benefits of personal familiarity for voice and speech perception were examined, as well as an investigation into how voice identity representations are formed through exposure to new voice identities. Evidence for highly robust voice representations for personally familiar voices was found in the face of perceptual challenges, which greatly exceeded those found for lab-trained voices of varying levels of familiarity. Conclusions are drawn about the relevance of the amount and type of exposure on speaker recognition, the expertise we have with certain voices, and the framing of familiarity as a continuum rather than a binary categorisation. The second topic utilised voices of famous singers and their “super-fans” as listeners to probe reward and motivational responses to hearing these valued voices, using behavioural and neuroimaging experiments. Listeners were found to work harder, as evidenced by faster reaction times, to hear their musical idol compared to less valued voices in an effort-based decision-making task, and the neural correlates of these effects are reported and examined

    Voice source characterization for prosodic and spectral manipulation

    Get PDF
    The objective of this dissertation is to study and develop techniques to decompose the speech signal into its two main components: voice source and vocal tract. Our main efforts are on the glottal pulse analysis and characterization. We want to explore the utility of this model in different areas of speech processing: speech synthesis, voice conversion or emotion detection among others. Thus, we will study different techniques for prosodic and spectral manipulation. One of our requirements is that the methods should be robust enough to work with the large databases typical of speech synthesis. We use a speech production model in which the glottal flow produced by the vibrating vocal folds goes through the vocal (and nasal) tract cavities and its radiated by the lips. Removing the effect of the vocal tract from the speech signal to obtain the glottal pulse is known as inverse filtering. We use a parametric model fo the glottal pulse directly in the source-filter decomposition phase. In order to validate the accuracy of the parametrization algorithm, we designed a synthetic corpus using LF glottal parameters reported in the literature, complemented with our own results from the vowel database. The results show that our method gives satisfactory results in a wide range of glottal configurations and at different levels of SNR. Our method using the whitened residual compared favorably to this reference, achieving high quality ratings (Good-Excellent). Our full parametrized system scored lower than the other two ranking in third place, but still higher than the acceptance threshold (Fair-Good). Next we proposed two methods for prosody modification, one for each of the residual representations explained above. The first method used our full parametrization system and frame interpolation to perform the desired changes in pitch and duration. The second method used resampling on the residual waveform and a frame selection technique to generate a new sequence of frames to be synthesized. The results showed that both methods are rated similarly (Fair-Good) and that more work is needed in order to achieve quality levels similar to the reference methods. As part of this dissertation, we have studied the application of our models in three different areas: voice conversion, voice quality analysis and emotion recognition. We have included our speech production model in a reference voice conversion system, to evaluate the impact of our parametrization in this task. The results showed that the evaluators preferred our method over the original one, rating it with a higher score in the MOS scale. To study the voice quality, we recorded a small database consisting of isolated, sustained Spanish vowels in four different phonations (modal, rough, creaky and falsetto) and were later also used in our study of voice quality. Comparing the results with those reported in the literature, we found them to generally agree with previous findings. Some differences existed, but they could be attributed to the difficulties in comparing voice qualities produced by different speakers. At the same time we conducted experiments in the field of voice quality identification, with very good results. We have also evaluated the performance of an automatic emotion classifier based on GMM using glottal measures. For each emotion, we have trained an specific model using different features, comparing our parametrization to a baseline system using spectral and prosodic characteristics. The results of the test were very satisfactory, showing a relative error reduction of more than 20% with respect to the baseline system. The accuracy of the different emotions detection was also high, improving the results of previously reported works using the same database. Overall, we can conclude that the glottal source parameters extracted using our algorithm have a positive impact in the field of automatic emotion classification

    Cross-linguistic exploration of phonemic representations

    Get PDF
    All languages around the world have their own vast sound inventories. Understanding each other through verbal communication requires, first of all, understanding each other\u2019s phonemes. This often overlooked constraint is non-trivial already among native speakers of the same language, given the variability with which we all articulate our phonemes. It becomes even more challenging when interacting with non-native speakers, who have developed neural representations of different sets of phonemes. How can the brain make sense of such diversity? It is remarkable that the sounds produced by the vocal tract, that have evolved to serve as sym-bols in natural languages, fall almost neatly into two classes with such different characteristics, consonants and vowels. Consonants are complex in nature: beyond acoustically-defined formant (resonant) frequencies, additional physical parameters such as formant transitions, the delay period in those transitions, energy bursts, the vibrations of the vocal cords occurring before and during the consonant burst, and the length of those vibrations are needed to identify them. Surprisingly, consonants are very quickly categorized through a quite mysterious form of invariant feature ex-traction. In contrast to consonants, vowels can be represented in a simple and transparent manner and that is because, amazingly, only two analog dimensions within a continuous space are essen-tially enough to characterize a vowel. The first dimension corresponds to the degree to which the vocal tract is open when producing the vowel and the second dimension is the location of the main occlusion. Surprisingly, these anatomically-defined production modes match very precisely the first two acoustically-defined formant frequencies, namely F1 and F2. While for some languages some additional features are necessary to specify a vowel, such as its length or roundedness, whose nature may be more discrete, for many others F1 and F2 are all there is to it. In this thesis, we use both behavioral (phoneme confusion frequencies) and neural measures (the spatio- temporal distribution of phoneme-evoked neural activation) to study the cross-linguistic organization of phoneme perception. In Chapter 2, we study the perception of consonants by repli-cating and extending a classical study on sub-phonemic features underlying perceptual differences between phonemes. Comparing the responses of native listeners to that of Italian, Turkish, Hebrew, and (Argentinian) Spanish listeners to a range of American English consonants, we look at the specific patterns of errors that speakers of different languages make by using the metric content index, which was previously used in entirely different contexts, with either discrete, e.g. in face space, or continuous representations, e.g. of the spatial environment. Beyond the analysis of percent correct score, and transmitted information, we frame the problem in terms of \u2018place attractors\u2019, in analogy to those which have been well studied in spatial memory. Through our experimental paradigm, we try to access distinct attractors in different languages. In the same chapter, we provide auditory evoked potentials of some consonant-vowel syllables, which hint at transparent processing of the vowels regulated by the first two formants that characterize them, and accordingly we then turn to investigating the vowel trajectories in the vowel manifold. We start our exploration of the vowel space in Chapter 3 by addressing a perceptually important third dimension for native Turkish speakers \u2013 that is rounding. Can native Turkish speakers navigate better vowel trajectories in which the second formant changes over a short time, to reflect rounding, compared to native Italian speakers, who are not required to make such fine discriminations on this dimension? We found no mother tongue effects. We have found, however, that rounding in vowels could be represented with similar efficiency by fine differences in a F2 peak frequency which is constant in time, or inverting the temporal dynamics of a changing F2, which then makes vowels not mere points in the space, but rather continuous trajectories.We walk through phoneme trajectories at every tens of milliseconds, it comes to us as nat-urally as walking in a room, if not more. Similar to spatial trajectories, we create equidistant continuous vowel trajectories in Chapter 4 on a vowel wheel positioned in the central region of the two-dimensional vowel space where in some languages like Italian there are no standard vowel categories, and in some other, like English, there are. Is the central region in languages like Italian to be regarded as a flat empty space with no attractors? Is there any reminiscence of their own phoneme memories? We ask whether this central region is flat, or can at least be flattened through extensive training. If so, would then we find a neural substrate that modulates the perception in the 2D vowel plane, similar to grid cell representation that is involved in the spatial navigation of empty 2D arenas? Our results are not suggestive of a grid-like representation, but rather points at the modulation of the neural signal by the position of Italian vowels around the outer contour of the wheel. Therefore in Chapter 5, we ask how our representation of the vowel space, not only in the central region but rather in the entirely of its linguistically relevant portion, is deformed by the presence of the standard categories of our vowel repertoire. We use \u2018belts\u2019, that are short stretches along which formant frequencies are varied quasi-continuously, to determine the local metric that best describes, for each language, the vowel manifold as a non-flat space constructed in our brain. As opposed to the \u2018consonant planes\u2019, that we constructed in Chapter 2, which appear to have a similar structure to a great extent, we find that the vowel plane is subjective and that it is language dependent. In light of language-specific transformations of the vowel plane, we wonder whether native bilinguals hold simultaneously multiple maps available and use one or the other to interpret linguistic sources depending on context. Or alternatively, we ask, do they construct and use a fusion of the two original maps, that allows them to efficiently discriminate vowel contrast that have to be discriminated in either language? The neural mechanisms underlying the physical map switch, known as remapping, have been well studied in rodent hippocampus; is the vowel map alternation governed by similar principles? We compare and show that the perceptual vowel maps of native Norwegian speakers, who are not bilingual but fluent in English, are unique, probably sculpted by their long-term memory codes, and we leave the curious case of bilinguals for future studies. Overall we attempt to investigate phoneme perception in a different framework compared to how it has been studied in the literature, which has been in the interest of a large community for many years, but largely disconnected from the study of cortical computation. Our aim is to demonstrate that insights about persisting questions in the field may be reached from another well explored part of cognition

    Vocal emotions on the brain: the role of acoustic parameters and musicality

    Get PDF
    The human voice is a powerful transmitter of emotions. This dissertation addresses three main gaps in the field of vocal emotion perception. The first is the quantification of the relative contribution of fundamental frequency (F0) and timbre cues to the perception of different emotions and their associated electrophysiological correlates. Using parameter-specific voice morphing, the results show that both F0 and timbre carry unique information that allow emotional inferences, although F0 seems to be relatively more important overall. The electrophysiological data revealed F0- and timbre-specific modulations in several ERP components, such as the P200 and the N400. Second, it was explored how musicality affects the processing of emotional voice cues, by providing a review on the literature linking musicality to emotion perception and subsequently showing that musicians have a benefit in vocal emotion perception compared to non-musicians. The present data offer original insight into the special role of pitch cues: musicians outperformed non-musicians when emotions were expressed by the pitch contour only, but not when they were expressed by vocal timbre. Although the electrophysiological patterns were less conclusive, they imply that musicality may modulate brain responses to vocal emotions. Third, this work provides a critical reflection on parameter-specific voice morphing and its suitability to study the processing of vocal emotions. Distortions in voice naturalness resulting from extreme acoustic manipulations were identified as one of the major threats to the ecological validity of the stimulus material produced with this technique. However, the results suggested that while voice morphing does affect the perceived naturalness of stimuli, behavioral measures of emotion perception were found to be remarkably robust against these distortions. Thus, the present data advocate parameter-specific voice morphing as a valid tool for vocal emotional research

    Developing Sparse Representations for Anchor-Based Voice Conversion

    Get PDF
    Voice conversion is the task of transforming speech from one speaker to sound as if it was produced by another speaker, changing the identity while retaining the linguistic content. There are many methods for performing voice conversion, but oftentimes these methods have onerous training requirements or fail in instances where one speaker has a nonnative accent. To address these issues, this dissertation presents and evaluates a novel “anchor-based” representation of speech that separates speaker content from speaker identity by modeling how speakers form English phonemes. We call the proposed method Sparse, Anchor-Based Representation of Speech (SABR), and explore methods for optimizing the parameters of this model in native-to-native and native-to-nonnative voice conversion contexts. We begin the dissertation by demonstrating how sparse coding in combination with a compact, phoneme-based dictionary can be used to separate speaker identity from content in objective and subjective tests. The formulation of the representation then presents several research questions. First, we propose a method for improving the synthesis quality by using the sparse coding residual in combination with a frequency warping algorithm to convert the residual from the source to target speaker’s space, and add it to the target speaker’s estimated spectrum. Experimentally, we find that synthesis quality is significantly improved via this transform. Second, we propose and evaluate two methods for selecting and optimizing SABR anchors in native-to-native and native-to-nonnative voice conversion. We find that synthesis quality is significantly improved by the proposed methods, especially in native-to- nonnative voice conversion over baseline algorithms. In a detailed analysis of the algorithms, we find they focus on phonemes that are difficult for nonnative speakers of English or naturally have multiple acoustic states. Following this, we examine methods for adding in temporal constraints to SABR via the Fused Lasso. The proposed method significantly reduces the inter-frame variance in the sparse codes over other methods that incorporate temporal features into sparse coding representations. Finally, in a case study, we examine the use of the SABR methods and optimizations in the context of a computer aided pronunciation training system for building “Golden Speakers”, or ideal models for nonnative speakers of a second language to learn correct pronunciation. Under the hypothesis that the optimal “Golden Speaker” was the learner’s voice, synthesized with a native accent, we used SABR to build voice models for nonnative speakers and evaluated the resulting synthesis in terms of quality, identity, and accentedness. We found that even when deployed in the field, the SABR method generated synthesis with low accentedness and similar acoustic identity to the target speaker, validating the use of the method for building “golden speakers”

    A Parametric Sound Object Model for Sound Texture Synthesis

    Get PDF
    This thesis deals with the analysis and synthesis of sound textures based on parametric sound objects. An overview is provided about the acoustic and perceptual principles of textural acoustic scenes, and technical challenges for analysis and synthesis are considered. Four essential processing steps for sound texture analysis are identifi ed, and existing sound texture systems are reviewed, using the four-step model as a guideline. A theoretical framework for analysis and synthesis is proposed. A parametric sound object synthesis (PSOS) model is introduced, which is able to describe individual recorded sounds through a fi xed set of parameters. The model, which applies to harmonic and noisy sounds, is an extension of spectral modeling and uses spline curves to approximate spectral envelopes, as well as the evolution of parameters over time. In contrast to standard spectral modeling techniques, this representation uses the concept of objects instead of concatenated frames, and it provides a direct mapping between sounds of diff erent length. Methods for automatic and manual conversion are shown. An evaluation is presented in which the ability of the model to encode a wide range of di fferent sounds has been examined. Although there are aspects of sounds that the model cannot accurately capture, such as polyphony and certain types of fast modulation, the results indicate that high quality synthesis can be achieved for many different acoustic phenomena, including instruments and animal vocalizations. In contrast to many other forms of sound encoding, the parametric model facilitates various techniques of machine learning and intelligent processing, including sound clustering and principal component analysis. Strengths and weaknesses of the proposed method are reviewed, and possibilities for future development are discussed

    An exploration of the rhythm of Malay

    Get PDF
    In recent years there has been a surge of interest in speech rhythm. However we still lack a clear understanding of the nature of rhythm and rhythmic differences across languages. Various metrics have been proposed as means for measuring rhythm on the phonetic level and making typological comparisons between languages (Ramus et al, 1999; Grabe & Low, 2002; Dellwo, 2006) but the debate is ongoing on the extent to which these metrics capture the rhythmic basis of speech (Arvaniti, 2009; Fletcher, in press). Furthermore, cross linguistic studies of rhythm have covered a relatively small number of languages and research on previously unclassified languages is necessary to fully develop the typology of rhythm. This study examines the rhythmic features of Malay, for which, to date, relatively little work has been carried out on aspects rhythm and timing. The material for the analysis comprised 10 sentences produced by 20 speakers of standard Malay (10 males and 10 females). The recordings were first analysed using rhythm metrics proposed by Ramus et. al (1999) and Grabe & Low (2002). These metrics (∆C, %V, rPVI, nPVI) are based on durational measurements of vocalic and consonantal intervals. The results indicated that Malay clustered with other so-called syllable-timed languages like French and Spanish on the basis of all metrics. However, underlying the overall findings for these metrics there was a large degree of variability in values across speakers and sentences, with some speakers having values in the range typical of stressed-timed languages like English. Further analysis has been carried out in light of Fletcher’s (in press) argument that measurements based on duration do not wholly reflect speech rhythm as there are many other factors that can influence values of consonantal and vocalic intervals, and Arvaniti’s (2009) suggestion that other features of speech should also be considered in description of rhythm to discover what contributes to listeners’ perception of regularity. Spectrographic analysis of the Malay recordings brought to light two parameters that displayed consistency and regularity for all speakers and sentences: the duration of individual vowels and the duration of intervals between intensity minima. This poster presents the results of these investigations and points to connections between the features which seem to be consistently regulated in the timing of Malay connected speech and aspects of Malay phonology. The results are discussed in light of current debate on the descriptions of rhythm

    Secure Automatic Speaker Verification Systems

    Get PDF
    The growing number of voice-enabled devices and applications consider automatic speaker verification (ASV) a fundamental component. However, maximum outreach for ASV in critical domains e.g., financial services and health care, is not possible unless we overcome security breaches caused by voice cloning, and replayed audios collectively known as the spoofing attacks. The audio spoofing attacks over ASV systems on one hand strictly limit the usability of voice-enabled applications; and on the other hand, the counterfeiter also remains untraceable. Therefore, to overcome these vulnerabilities, a secure ASV (SASV) system is presented in this dissertation. The proposed SASV system is based on the concept of novel sign modified acoustic local ternary pattern (sm-ALTP) features and asymmetric bagging-based classifier-ensemble. The proposed audio representation approach clusters the high and low-frequency components in audio frames by normally distributing frequency components against a convex function. Then, the neighborhood statistics are applied to capture the user specific vocal tract information. This information is then utilized by the classifier ensemble that is based on the concept of weighted normalized voting rule to detect various spoofing attacks. Contrary to the existing ASV systems, the proposed SASV system not only detects the conventional spoofing attacks (i.e. voice cloning, and replays), but also the new attacks that are still unexplored by the research community and a requirement of the future. In this regard, a concept of cloned replays is presented in this dissertation, where, replayed audios contains the microphone characteristics as well as the voice cloning artifacts. This depicts the scenario when voice cloning is applied in real-time. The voice cloning artifacts suppresses the microphone characteristics thus fails replay detection modules and similarly with the amalgamation of microphone characteristics the voice cloning detection gets deceived. Furthermore, the proposed scheme can be utilized to obtain a possible clue against the counterfeiter through voice cloning algorithm detection module that is also a novel concept proposed in this dissertation. The voice cloning algorithm detection module determines the voice cloning algorithm used to generate the fake audios. Overall, the proposed SASV system simultaneously verifies the bonafide speakers and detects the voice cloning attack, cloning algorithm used to synthesize cloned audio (in the defined settings), and voice-replay attacks over the ASVspoof 2019 dataset. In addition, the proposed method detects the voice replay and cloned voice replay attacks over the VSDC dataset. Rigorous experimentation against state-of-the-art approaches also confirms the robustness of the proposed research
    corecore