387 research outputs found

    Relation between acoustic and articulatory dimensions of speech sounds

    Get PDF
    In their daily communication, speakers produce speech by pushing a controlled air stream past their vocal folds and through a vocal tract configuration formed by a set of articulators which ultimately results in a certain acoustic output. In this sense, speech and, specifically, speech sounds can be understood as a relation between articulatory and acoustic dimensions. This idea is supported by more recent neuroimaging results which suggest that sensory representations of speech sounds are stored across auditory and somatosensory cortices and are characterized by neural auditory-somatosensory mappings. The overall aim of the current dissertation is to improve our understanding of the functional nature of this relation. To this end, this thesis investigates the influence of a stronger linguo-palatal contact on speakers’ ability to employ multiple concurrent compensatory strategies during production of vowels and fricatives. During the data analysis, speakers’ individual as well as average compensatory behavior is investigated by means of generalized additive mixed models (GAMM) and a supervised classification algorithm (random forest). A framework is then developed that allows to estimate the extent of spectral adaptations in vowels and fricatives and to draw a direct comparison between these sounds. The experimental results are discussed in the context of current speech production theories and agree with the overall idea that speech sounds are perceptuo-motor units comprising of articulatory movements which are shaped by perceptual properties and selected for their functional value for communication.Sprecher produzieren Sprachlaute, indem sie einen kontrollierten Luftstrom vorbei an ihren Stimmlippen und durch eine artikulatorische Konfiguration führen, was letztendlich in einem bestimmten akustischen Ergebnis mündet. In diesem Sinne können Sprachlaute als Relationen zwischen der artikulatorischen und der akustischen Dimension verstanden werden. Diese allgemeine Vorstellung wird durch die Ergebnisse der Neuroforschung gestützt, die darauf hindeuten, dass sensorische Repräsentationen von Sprachlauten sowohl im auditiven als auch somatosensorischen Cortex gespeichert werden und sich durch neuronale auditiv-somatosensorische Zuordnungen auszeichnen. Das übergeordnete Ziel der vorliegenden Dissertation ist es, unser Verständnis von der Funktionsweise dieser Relationen zu verbessern. Dazu untersucht die Arbeit den Einfluss eines stärkeren linguo-palatalen Kontakts auf die Fähigkeit der Sprecher, mehrere Kompensationsstrategien bei der Produktion von Vokalen und Frikativen gleichzeitig anzuwenden. Bei der Datenanalyse wird sowohl das individuelle als auch das durchschnittliche Kompensationsverhalten der Sprecher mittels verallgemeinerter additiver gemischter Modelle (GAMM) sowie eines überwachten Klassifizierungsalgorithmus (Random Forest) untersucht. Dabei wird ein Rahmenwerk entwickelt, das erlaubt das Ausmaß der spektralen Anpassungen bei Vokalen und Frikativen zu untersuchen und miteinander zu vergleichen. Die experimentellen Ergebnisse werden im Rahmen aktueller Sprachproduktionstheorien diskutiert und stimmen insgesamt mit der Vorstellung überein, dass Sprachlaute perzeptuell-motorische Einheiten sind, denen Artikulationsbewegungen zu Grunde liegen, die durch perzeptuelle Eigenschaften beeinflusst und geformt werden

    Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community

    It’s All About Context: Investigating the Effects of Consonant and Vowel Environment on Vowel-Evoked Envelope Following Responses

    Get PDF
    The envelope following response (EFR) has proven useful for studying brainstem speech processing. Previous work, however, demonstrates that its amplitude varies across stimuli. This thesis investigates whether this variation is attributable to the consonant or vowel context of the stimulus, or some interaction of the two. Experiment 1 evoked EFRs in 30 participants using seven English vowels embedded in four CVC environments. A strong effect of vowel and a minor effect of consonant on EFR amplitude were found. In Experiment 2, 64 listeners heard four different tokens of one of four possible English vowels (16 participants/vowel), embedded in the same CVC environments as before. A significant three-way interaction between vowel, vowel trial, and consonant was found, indicating that the EFR is highly sensitive to subtle acoustic differences in stimuli. To effectively utilize the EFR in research, future studies should carefully explore the mechanisms driving these complex context effects

    Neural Attractors and Phonological Grammar

    Get PDF
    This volume collects three articles which constitute the bulk of my PhD research. The overarching theme of the volume is the role of attractors - a concept from dynamical systems theory – in the neural realization of phonological grammar. The motivation for this line of inquiry begins with the claim that the study of language should provide some insight into the workings of the human mind/brain. Indeed this is one of few mantras shared by linguists of the seemingly irreconcilable “Generative” and “Cognitive” schools (e.g. Chomsky 2002; Lakoff 1988). Given this apparent consensus then, it is perhaps surprising that no breakthrough in our understanding of the brain can yet be attributed to some insight from the study of language. An analysis and critique of this state of affairs is given by Poeppel & Embick (2005), who identify (amongst other things) that we currently have no way of relating the ontologies of linguistics and neuroscience. This Ontological Incommensurability Problem (OIP) can be resolved, they argue, by the use of a Linking Hypothesis, which spells out linguistic computations at the relevant level of algorithmic abstraction, such that the neuroscientist need only find the exact implementations of those algorithms in the brain. If such a hypothesis were sufficiently complete then it could, in principle, predict the kinds of neural configurations required for natural language processing, using linguistic theories as their starting point. In this way, we could finally realize the long sought-after goal of cashing in theories of language for understanding of the human brain. Simultaneously, a Linking Hypothesis also has the potential to unearth lower-level explanations for linguistic phenomena, for example where those explanations might depend on purely neurobiological notions (e.g. neuronal morphology, synaptic density, metabolic efficiency, etc.)

    Comparing malleability of phonetic category between [i] and [u]

    Get PDF
    This study reports differential category retuning effect between [i] and [u]. Two groups of American listeners were exposed to ambiguous vowels ([i/u]) within words that index a phoneme /i/ (e.g., athl[i/u]t) (i-group) or /u/ (e.g., aftern[i/u]n) (u-group). Before and after the exposure these listeners categorized sounds from a [bip]-[bup] continuum. The i-group significantly increased /bip/ responses after exposure, but the u-group did not change their responses significantly. These results suggest that the way mental representation handles phonetic variation may influence malleability of each category, highlighting the complex relationship among distribution of sounds, their mental representation, and speech perception

    Tones in Zhangzhou: Pitch and Beyond

    Get PDF
    This study draws on various approaches—field linguistics; auditory and acoustic phonetics; and statistics—to explore and explain the nature of Zhangzhou tones, an under-described Southern Min variety. Several original findings emerged from the analyses of the data from 21 speakers. The realisations of Zhangzhou tones are multidimensional. The single parameter of pitch/F0 is not sufficient to characterise tonal contrasts in either monosyllabic or polysyllabic settings in Zhangzhou. Instead, various parameters, including pitch/F0, duration, vowel quality, voice quality, and syllable coda type, interact in a complicated but consistent way to code tonal distinctions. Zhangzhou has eight tones rather than seven tones as proposed in previous studies. This finding resulted from examining the realisations of diverse parameters across three different contexts—isolation, phrase-initial, and phrase-final—, rather than classifying tones in citation and in terms of the preservation of Middle Chinese tonal categories. Tonal contrasts in Zhangzhou can be neutralised across different linguistic contexts. Identifying the number of tonal contrasts based simply on tonal realisations in the citation environment is not sufficient. Instead, examining tonal realisations across different linguistic contexts beyond monosyllables is imperative for understanding the nature of tone. Tone sandhi in Zhangzhou is syntactically relevant. The tone sandhi domain is not phonologically determined but rather is aligned with a syntactic phrase XP. Within a given XP, the realisations of the tones at non-phrase-final positions undergo alternation phonologically and phonetically. Nevertheless, the alterations are sensitive only to the phrase boundaries and are not affected by the internal structure of syntactic phrases. Tone sandhi in Zhangzhou is phonologically inert but phonetically sensitive. The realisations of Zhangzhou tones in disyllabic phrases are not categorically affected by their surrounding tones but are phonetically sensitive to surrounding environments. For instance, the pitch/F0 onsets of phrase-final tones are largely sensitive to pitch/F0 offsets of preceding tones and appear to have diverse variants. The mappings between Zhangzhou citation and disyllabic tones are morphologically conditioned. Phrase-initial tones are largely not related to the citation tones at either the phonological or the phonetic level while phrase-final tones are categorically related to the citation tones but phonetically are not quite the same because of predictable sensitivity to surrounding environments. Each tone in Zhangzhou can be regarded as a single morpheme having two alternating allomorphs (tonemes), one for non-phrase-final variants and one for variants in citation and phrase-final contexts, both of which are listed in the mental lexicon of native Zhangzhou speakers but are phonetically distant on the surface. In summary, the realisations of Zhangzhou tones are multidimensional, involving a variety of segmental and suprasegmental parameters. The interactions of Zhangzhou tones are complicated, involving phonetics, phonology, syntax, and morphology. Neutralisation of Zhangzhou tonal contrasts occurs across different contexts, including citation, phrase-final, and non-phrase-final. Thus, researchers must go beyond pitch to understand tone thoroughly as a phenomenon in Southern Min
    • …
    corecore