Search CORE

478 research outputs found

What information is necessary for speech categorization? Harnessing variability in the speech signal by integrating cues computed relative to expectations

Author: Jongman Allard
McMurray Bob
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/04/2011
Field of study

This is the author's accepted manuscript. This article may not exactly replicate the final version published in the APA journal. It is not the copy of record. The original publication is available at http://psycnet.apa.org/index.cfm?fa=search.displayrecord&uid=2011-05323-001.Most theories of categorization emphasize how continuous perceptual information is mapped to categories. However, equally important are the informational assumptions of a model, the type of information subserving this mapping. This is crucial in speech perception where the signal is variable and context dependent. This study assessed the informational assumptions of several models of speech categorization, in particular, the number of cues that are the basis of categorization and whether these cues represent the input veridically or have undergone compensation. We collected a corpus of 2,880 fricative productions (Jongman, Wayland, & Wong, 2000) spanning many talker and vowel contexts and measured 24 cues for each. A subset was also presented to listeners in an 8AFC phoneme categorization task. We then trained a common classification model based on logistic regression to categorize the fricative from the cue values and manipulated the information in the training set to contrast (a) models based on a small number of invariant cues, (b) models using all cues without compensation, and (c) models in which cues underwent compensation for contextual factors. Compensation was modeled by computing cues relative to expectations (C-CuRE), a new approach to compensation that preserves fine-grained detail in the signal. Only the compensation model achieved a similar accuracy to listeners and showed the same effects of context. Thus, even simple categorization metrics can overcome the variability in speech when sufficient information is available and compensation schemes like C-CuRE are employed

KU ScholarWorks

PubMed Central

Acoustic-phonetic and auditory mechanisms of adaptation in the perception of sibilant fricatives

Author: Chodroff Eleanor
Wilson Colin
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2020
Field of study

Listeners are highly proficient at adapting to contextual variation when perceiving speech. In the present study, we examined the effects of brief speech and nonspeech contexts on the perception of sibilant fricatives. We explored three theoretically motivated accounts of contextual adaptation, based on phonetic cue calibration, phonetic covariation, and auditory contrast. Under the cue calibration account, listeners adapt by estimating a talker-specific average for each phonetic cue or dimension; under the cue covariation account, listeners adapt by exploiting consistencies in how the realization of speech sounds varies across talkers; under the auditory contrast account, adaptation results from (partial) masking of spectral components that are shared by adjacent stimuli. The spectral center of gravity, a phonetic cue to fricative identity, was manipulated for several types of context sound: /z/-initial syllables, /v/-initial syllables, and white noise matched in long-term average spectrum (LTAS) to the /z/-initial stimuli. Listeners’ perception of the /s/–/ʃ/ contrast was significantly influenced by /z/-initial syllables and LTAS-matched white noise stimuli, but not by /v/-initial syllables. No significant difference in adaptation was observed between exposure to /z/-initial syllables and matched white noise stimuli, and speech did not have a considerable advantage over noise when the two were presented consecutively within a context. The pattern of findings is most consistent with the auditory contrast account of short-term perceptual adaptation. The cue covariation account makes accurate predictions for speech contexts, but not for nonspeech contexts or for the absence of a speech-versus-nonspeech difference

White Rose Research Online

Articulatory and Acoustic Characteristics of German Fricative Clusters

Author: Hoole Philip
Pouplier Marianne
Publication venue: 'S. Karger AG'
Publication date: 01/01/2016
Field of study

Background: We investigate the articulatory-acoustic relationship in German fricative sequences. We pursue the possibility that /f/#sibilant and /s#integral/ sequences are in principle subject to articulatory overlap in a similar fashion, yet due to independent articulators being involved, there is a significant difference in the acoustic consequences. We also investigate the role of vowel context and stress. Methods: We recorded electropalatographic and acoustic data from 9 native speakers of German. Results: Results are compatible with the hypothesis that the temporal organization of fricative clusters is globally independent of cluster type with differences between clusters appearing mainly in degree. Articulatory overlap may be obscured acoustically by a labiodental constriction, similarly to what has been reported for stops. Conclusion: Our data suggest that similar principles of articulatory coordination underlie German fricative clusters independently of their segmental composition. The general auditory-acoustic patterning of the fricative sequences can be predicted by taking into account that aerodynamic-acoustic consequences of gestural overlap may vary as a function of the articulators involved. We discuss possible sources for differences in degrees of overlap and place our results in the context of previously reported asymmetries among the fricatives in regressive place assimilation. (C) 2016 S. Karger AG, Base

Open Access LMU

Contingent categorization in speech perception

Author: Apfelbaum Keith S.
Bullock-Rest Natasha
Jongman Allard
McMurray Bob
Rhone Ariane E.
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2014
Field of study

This is an Accepted Manuscript of an article published by Taylor & Francis in Language Cognition and Neuroscience in 2014, available online: http://www.tandfonline.com/10.1080/01690965.2013.824995.The speech signal is notoriously variable, with the same phoneme realized differently depending on factors like talker and phonetic context. Variance in the speech signal has led to a proliferation of theories of how listeners recognize speech. A promising approach, supported by computational modeling studies, is contingent categorization, wherein incoming acoustic cues are computed relative to expectations. We tested contingent encoding empirically. Listeners were asked to categorize fricatives in CV syllables constructed by splicing the fricative from one CV syllable with the vowel from another CV syllable. The two spliced syllables always contained the same fricative, providing consistent bottom-up cues; however on some trials, the vowel and/or talker mismatched between these syllables, giving conflicting contextual information. Listeners were less accurate and slower at identifying the fricatives in mismatching splices. This suggests that listeners rely on context information beyond bottom-up acoustic cues during speech perception, providing support for contingent categorization

KU ScholarWorks

PubMed Central

Chipping Away at the Perception/Production Interface

Author: Niedzielski Nancy
Publication venue: ScholarlyCommons
Publication date: 01/01/2001
Field of study

ScholarlyCommons@Penn

Lexically-guided perceptual learning in speech processing

Author: Eisner F.
Publication venue: Radboud University Nijmegen
Publication date: 07/03/2006
Field of study

During listening to spoken language, the perceptual system needs to adapt frequently to changes in talkers, and thus to considerable interindividual variability in the articulation of a given speech sound. This thesis investigated a learning process which allows listeners to use stored lexical representations to modify the interpretation of a speech sound when a talker's articulation of that sound is consistently unclear or ambiguous. The questions that were addressed in this research concerned the robustness of such perceptual learning, a potential role for sleep, and whether learning is specific to the speech of one talker or, alternatively, generalises to other talkers. A further study aimed to identify the underlying functional neuroanatomy by using magnetic resonance imaging methods. The picture that emerged for lexically-guided perceptual learning is that learning occurs very rapidly, is highly specific, and remains remarkably robust both over time and under exposure to speech from other talkers

MPG.PuRe

Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

Author: Ames Heather
Grossberg Stephen
Publication venue: Boston University Center for Adaptive Systems and Department of Cognitive and Neural Systems
Publication date: 24/11/2007
Field of study

Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

Boston University Institutional Repository (OpenBU)

Audio-to-Visual Speech Conversion using Deep Neural Networks

Author: Kato Akihiro
Matthews Iain
Milner Ben
Taylor Sarah
Publication venue: 'International Speech Communication Association'
Publication date: 29/08/2016
Field of study

We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results

Crossref

University of East Anglia digital repository

How visual cues to speech rate influence speech perception

Author: Bosker H.
Holler J.
Peeters D.
Publication venue: 'SAGE Publications'
Publication date: 01/01/2020
Field of study

Spoken words are highly variable and therefore listeners interpret speech sounds relative to the surrounding acoustic context, such as the speech rate of a preceding sentence. For instance, a vowel midway between short /ɑ/ and long /a:/ in Dutch is perceived as short /ɑ/ in the context of preceding slow speech, but as long /a:/ if preceded by a fast context. Despite the well-established influence of visual articulatory cues on speech comprehension, it remains unclear whether visual cues to speech rate also influence subsequent spoken word recognition. In two ‘Go Fish’-like experiments, participants were presented with audio-only (auditory speech + fixation cross), visual-only (mute videos of talking head), and audiovisual (speech + videos) context sentences, followed by ambiguous target words containing vowels midway between short /ɑ/ and long /a:/. In Experiment 1, target words were always presented auditorily, without visual articulatory cues. Although the audio-only and audiovisual contexts induced a rate effect (i.e., more long /a:/ responses after fast contexts), the visual-only condition did not. When, in Experiment 2, target words were presented audiovisually, rate effects were observed in all three conditions, including visual-only. This suggests that visual cues to speech rate in a context sentence influence the perception of following visual target cues (e.g., duration of lip aperture), which at an audiovisual integration stage bias participants’ target categorization responses. These findings contribute to a better understanding of how what we see influences what we hear

MPG.PuRe

Tilburg University Repository