Search CORE

1,662 research outputs found

Development of temporal auditory processing in childhood: Changes in efficiency rather than temporal-modulation selectivity

Author: Buss E
Cabrera L
Lorenzi C
Rosen S
Varnet L
Publication venue
Publication date: 01/10/2019
Field of study

The ability to detect amplitude modulation (AM) is essential to distinguish the spectro-temporal features of speech from those of a competing masker. Previous work shows that AM sensitivity improves until 10 years of age. This may relate to the development of sensory factors (tuning of AM filters, susceptibility to AM masking) or to changes in processing efficiency (reduction in internal noise, optimization of decision strategies). To disentangle these hypotheses, three groups of children (5–11 years) and one of young adults completed psychophysical tasks measuring thresholds for detecting sinusoidal AM (with a rate of 4, 8, or 32 Hz) applied to carriers whose inherent modulations exerted different amounts of AM masking. Results showed that between 5 and 11 years, AM detection thresholds improved and that susceptibility to AM masking slightly increased. However, the effects of AM rate and carrier were not associated with age, suggesting that sensory factors are mature by 5 years. Subsequent modelling indicated that reducing internal noise by a factor 10 accounted for the observed developmental trends. Finally, children’s consonant identification thresholds in noise related to some extent to AM sensitivity. Increased efficiency in AM detection may support better use of temporal information in speech during childhood

UCL Discovery

Learning An Invariant Speech Representation

Author: Evangelopoulos Georgios
Poggio Tomaso
Rosasco Lorenzo
Voinea Stephen
Zhang Chiyuan
Publication venue
Publication date: 01/01/2014
Field of study

Recognition of speech, and in particular the ability to generalize and learn from small sets of labelled examples like humans do, depends on an appropriate representation of the acoustic input. We formulate the problem of finding robust speech features for supervised learning with small sample complexity as a problem of learning representations of the signal that are maximally invariant to intraclass transformations and deformations. We propose an extension of a theory for unsupervised learning of invariant visual representations to the auditory domain and empirically evaluate its validity for voiced speech sound classification. Our version of the theory requires the memory-based, unsupervised storage of acoustic templates -- such as specific phones or words -- together with all the transformations of each that normally occur. A quasi-invariant representation for a speech segment can be obtained by projecting it to each template orbit, i.e., the set of transformed signals, and computing the associated one-dimensional empirical probability distributions. The computations can be performed by modules of filtering and pooling, and extended to hierarchical architectures. In this paper, we apply a single-layer, multicomponent representation for phonemes and demonstrate improved accuracy and decreased sample complexity for vowel classification compared to standard spectral, cepstral and perceptual features.Comment: CBMM Memo No. 022, 5 pages, 2 figure

arXiv.org e-Print Archive

CiteSeerX

DSpace@MIT

The listening talker: A review of human and algorithmic context-induced modifications of speech

Author: Adriaans
Albin
Alcántara
Andruski
ANSI S3.5-1997
Arai
Assmann
Assmann
Aubanel
Aubanel
Aubanel
Babel
Babel
Bailly
Baran
Barker
Batliner
Beautemps
Beckford Wassink
Beckman
Beckman
Bele
Bell
Benoit
Best
Biersack
Bird
Blamey
Boike
Bond
Bond
Bond
Boril
Bradlow
Bradlow
Bradlow
Bradlow
Branigan
Bregman
Bronkhorst
Brungart
Brungart
Brunskog
Burnham
Burnham
Burnham
Burnham
Castellanos
Chen
Cheskin
Cheyne
Chládková
Chung
Church
Cole
Cooke
Cooke
Cooke
Cooke
Cooke
Cooke
Cooper
Cooper
Cox
Cox
Cristia
Cristià
Cutler
Darwin
Dau
Davis
Davis
Dejonckere
Delvaux
Dodane
Dreher
Dudley
Dunst
Egan
Englund
Eriksson
Erting
Estival
Falk
Farris
Ferguson
Ferguson
Fernald
Fernald
Fernald
Fernald
Fernald
Field
Fisher
Fisher
Fitzpatrick
Floccia
Fogerty
Fogerty
Fowler
Fowler
Freed
Fux
Fux
Fux
Gagne
Gagne
Gagne
Galati
Garnier
Garnier
Garnier
Garnier
Garnier
Garnier
Garnier
Garrod
Giles
Goldwater
Golinkoff
Golinkoff
Gordon-Salant
Granlund
Granlund
Green
Grieser
Hawley
Hazan
Hazan
Hazan
Hazan
Healey
Helfer
Helfer
Hornsby
Horwitz
Howell
Imaizumi
Imaizumi
Ishizuka
Janarthanam
Johnson
Jun
Jung
Junqua
Junqua
Junqua
Kadiri
Kang
Kaplan
Kappes
Kawahara
Kewley-Port
Kim
Kim
Kirchhoff
Kitamura
Kitamura
Kondaurova
Kondaurova
Korn
Krause
Krause
Krause
Krause
Krause
Kretsinger
Kryter
Kuhl
Kusumoto
Lam
Lane
Laures
Laures
Lee
Lienard
Lindblom
Lindblom
Little
Liu
Liu
Liu
Lombard
Long
Long
Lu
Lu
Lu
Malsheen
Maniwa
Marin
Martin Cooke
Masataka
Matthies
Mattys
Mattys
Mattys
Maye
Maye
Mayo
Maëva Garnier
Metz
Michael
Miller
Mokbel
Monsen
Montgomery
Moon
Moon
Moore
Moore
Moulines
Naoi
Natale
Nejime
Newport
Niederjohn
Niwano
Niwano
Ostroff
Oviatt
Owren
Papoušek
Papoušek
Papoušek
Pardo
Patel
Patel
Payne
Payton
Pegg
Pelegrín-García
Perkell
Petkov
Peutz
Phillips
Picheny
Picheny
Picheny
Pickering
Pickett
Pickett
Pisoni
Pittman
Pollack
Pucher
Pye
Rasetshwane
Ratner
Ratner
Ratner
Rieser
Rogers
Rostolland
Rostolland
Ryan
Räsänen
Sachs
Sankowska
Sauert
Scarborough
Schmitt
Schulman
Schum
Shimron
Simon King
Sims
Singh
Skowronski
Smiljanic
Smith
Snow
Song
Stanton
Stern
Stilp
Stylianou
Summers
Summers
Sundberg
Sundberg
Sundberg
Suni
Synnestvedt
Taal
Taal
Tang
Tang
Tang
Tartter
Ternström
Thanavisuth
Titze
Torick
Trainor
Trainor
Traunmuller
Uchanski
Uchanski
Uther
Valentini-Botinhao
Valentini-Botinhao
Valian
Valian
van de Weijer
van Rooij
Vatikiotis-Bateson
Villegas
Vincent Aubanel
Vitevitch
Wang
Warner
Warren
Watson
Webster
Welby
Welby
Werker
World Health Organisation
Xu
Xu
Yamagishi
Yang
Yoo
Zajdó
Zampini
Zangl
Zhao
Zipf
Zorilă
Publication venue: 'Elsevier BV'
Publication date: 01/01/2014
Field of study

International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

Crossref

Hal - Université Grenoble Alpes

Edinburgh Research Explorer

Western Sydney ResearchDirect

Learning to imitate facial expressions through sound

Author: de Klerk C. C. J. M.
de Klerk C. C. J. M.
Goupil L.
Goupil L.
Viswanathan N.
Viswanathan N.
Wass S.
Wass S.
Publication venue: Elsevier
Publication date: 01/01/2024
Field of study

The question of how young infants learn to imitate others’ facial expressions has been central in developmental psychology for decades. Facial imitation has been argued to constitute a particularly challenging learning task for infants because facial expressions are perceptually opaque: infants cannot see changes in their own facial configuration when they execute a motor program, so how do they learn to match these gestures with those of their interacting partners? Here we argue that this apparent paradox mainly appears if one focuses only on the visual modality, as most existing work in this field has done so far. When considering other modalities, in particular the auditory modality, many facial expressions are not actually perceptually opaque. In fact, every orolabial expression that is accompanied by vocalisations has specific acoustic consequences, which means that it is relatively transparent in the auditory modality. Here, we describe how this relative perceptual transparency can allow infants to accrue experience relevant for orolabial, facial imitation every time they vocalise. We then detail two specific mechanisms that could support facial imitation learning through the auditory modality. First, we review evidence showing that experiencing correlated proprioceptive and auditory feedback when they vocalise – even when they are alone – enables infants to build audio-motor maps that could later support facial imitation of orolabial actions. Second, we show how these maps could also be used by infants to support imitation even for silent, orolabial facial expressions at a later stage. By considering non-visual perceptual domains, this paper expands our understanding of the ontogeny of facial imitation and offers new directions for future investigations

UEL Research Repository at University of East London

Identification of infants at high familiar risk for language-learning disorders (LLD) by combining machine learning techniques with EEG-based brain network metrics [Editorial]

Author: Dimitriadis Stavros
Publication venue: 'Elsevier BV'
Publication date: 01/07/2016
Field of study

Crossref

Online Research @ Cardiff

Investigating Speech Perception in Evolutionary Perspective: Comparisons of Chimpanzee (Pan troglodytes) and Human Capabilities

Author: Heimbauer Lisa A
Publication venue: ScholarWorks @ Georgia State University
Publication date: 01/08/2012
Field of study

There has been much discussion regarding whether the capability to perceive speech is uniquely human. The “Speech is Special” (SiS) view proposes that humans possess a specialized cognitive module for speech perception (Mann & Liberman, 1983). In contrast, the “Auditory Hypothesis” (Kuhl, 1988) suggests spoken-language evolution took advantage of existing auditory-system capabilities. In support of the Auditory Hypothesis, there is evidence that Panzee, a language-trained chimpanzee (Pan troglodytes), perceives speech in synthetic “sine-wave” and “noise-vocoded” forms (Heimbauer, Beran, & Owren, 2011). Human comprehension of these altered forms of speech has been cited as evidence for specialized cognitive capabilities (Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005). In light of Panzee’s demonstrated abilities, three experiments extended these investigations of the cognitive processes underlying her speech perception. The first experiment investigated the acoustic cues that Panzee and humans use when identifying sine-wave and noise-vocoded speech. The second experiment examined Panzee’s ability to perceive “time-reversed” speech, in which individual segments of the waveform are reversed in time. Humans are able to perceive such speech if these segments do not much exceed average phoneme length. Finally, the third experiment tested Panzee’s ability to generalize across both familiar and novel talkers, a perceptually challenging task that humans seem to perform effortlessly. Panzee’s performance was similar to that of humans in all experiments. In Experiment 1, results demonstrated that Panzee likely attends to the same “spectro-temporal” cues in sine-wave and noise-vocoded speech that humans are sensitive to. In Experiment 2, Panzee showed a similar intelligibility pattern as a function of reversal-window length as found in human listeners. In Experiment 3, Panzee readily recognized words not only from a variety of familiar adult males and females, but also from unfamiliar adults and children of both sexes. Overall, results suggest that a combination of general auditory processing and sufficient exposure to meaningful spoken language is sufficient to account for speech-perception evidence previously proposed to require specialized, uniquely human mechanisms. These findings in turn suggest that speech-perception capabilities were already present in latent form in the common evolutionary ancestors of modern chimpanzees and humans

ScholarWorks @ Georgia State University

Mandarin speech perception in combined electric and acoustic stimulation.

Author: Fu Qian-Jie
Galvin John J
Li Yongxin
Zhang Guoping
Publication venue: eScholarship, University of California
Publication date: 01/01/2014
Field of study

For deaf individuals with residual low-frequency acoustic hearing, combined use of a cochlear implant (CI) and hearing aid (HA) typically provides better speech understanding than with either device alone. Because of coarse spectral resolution, CIs do not provide fundamental frequency (F0) information that contributes to understanding of tonal languages such as Mandarin Chinese. The HA can provide good representation of F0 and, depending on the range of aided acoustic hearing, first and second formant (F1 and F2) information. In this study, Mandarin tone, vowel, and consonant recognition in quiet and noise was measured in 12 adult Mandarin-speaking bimodal listeners with the CI-only and with the CI+HA. Tone recognition was significantly better with the CI+HA in noise, but not in quiet. Vowel recognition was significantly better with the CI+HA in quiet, but not in noise. There was no significant difference in consonant recognition between the CI-only and the CI+HA in quiet or in noise. There was a wide range in bimodal benefit, with improvements often greater than 20 percentage points in some tests and conditions. The bimodal benefit was compared to CI subjects' HA-aided pure-tone average (PTA) thresholds between 250 and 2000 Hz; subjects were divided into two groups: "better" PTA (<50 dB HL) or "poorer" PTA (>50 dB HL). The bimodal benefit differed significantly between groups only for consonant recognition. The bimodal benefit for tone recognition in quiet was significantly correlated with CI experience, suggesting that bimodal CI users learn to better combine low-frequency spectro-temporal information from acoustic hearing with temporal envelope information from electric hearing. Given the small number of subjects in this study (n = 12), further research with Chinese bimodal listeners may provide more information regarding the contribution of acoustic and electric hearing to tonal language perception

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Local Temporal Regularities in Child-Directed Speech in Spanish

Author: Clark Catherine
Flanagan Sheila
Goswami Usha
Lallier Marie
Pérez-Navarro Jose
Publication venue: 'American Speech Language Hearing Association'
Publication date: 01/01/2022
Field of study

Published online: Oct 4, 2022Purpose: The purpose of this study is to characterize the local (utterance-level) temporal regularities of child-directed speech (CDS) that might facilitate phonological development in Spanish, classically termed a syllable-timed language. Method: Eighteen female adults addressed their 4-year-old children versus other adults spontaneously and also read aloud (CDS vs. adult-directed speech [ADS]). We compared CDS and ADS speech productions using a spectrotemporal model (Leong & Goswami, 2015), obtaining three temporal metrics: (a) distribution of modulation energy, (b) temporal regularity of stressed syllables, and (c) syllable rate. Results: CDS was characterized by (a) significantly greater modulation energy in the lower frequencies (0.5–4 Hz), (b) more regular rhythmic occurrence of stressed syllables, and (c) a slower syllable rate than ADS, across both spontaneous and read conditions. Discussion: CDS is characterized by a robust local temporal organization (i.e., within utterances) with amplitude modulation bands aligning with delta and theta electrophysiological frequency bands, respectively, showing greater phase synchronization than in ADS, facilitating parsing of stress units and syllables. These temporal regularities, together with the slower rate of production of CDS, might support the automatic extraction of phonological units in speech and hence support the phonological development of children. Supplemental Material: https://doi.org/10.23641/asha.21210893This study was supported by the Formación de Personal Investigado Grant BES-2016-078125 by Ministerio Español de Economía, Industria y Competitividad and Fondo Social Europeo awarded to Jose Pérez-Navarro; through Project RTI2018-096242-B-I00 (Ministerio de Ciencia, Innovación y Universidades [MCIU]/Agencia Estatal de Investigación [AEI]/Fondo Europeo de Desarrollo Regional [FEDER], Unión Europea) funded by MCIU, the AEI, and FEDER awarded to Marie Lallier; by the Basque Government through the Basque Excellence Research Centre 2018-2021 Program; and by the Spanish State Research Agency through Basque Center on Cognition, Brain and Language Severo Ochoa Excellence Accreditation SEV- 2015-0490. We want to thank the participants and their children for their volunteer contribution to our study

Archivo Digital para la Docencia y la Investigación

Learning spectro-temporal representations of complex sounds with parameterized neural networks

Author: Bachoud-Lévi Anne-Catherine
Dupoux Emmanuel
Karadayi Julien
Riad Rachid
Publication venue
Publication date: 12/03/2021
Field of study

Deep Learning models have become potential candidates for auditory neuroscience research, thanks to their recent successes on a variety of auditory tasks. Yet, these models often lack interpretability to fully understand the exact computations that have been performed. Here, we proposed a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels (Learnable STRFs) and that is fully interpretable. We evaluated predictive capabilities of this layer on Speech Activity Detection, Speaker Verification, Urban Sound Classification and Zebra Finch Call Type Classification. We found out that models based on Learnable STRFs are on par for all tasks with different toplines, and obtain the best performance for Speech Activity Detection. As this layer is fully interpretable, we used quantitative measures to describe the distribution of the learned spectro-temporal modulations. The filters adapted to each task and focused mostly on low temporal and spectral modulations. The analyses show that the filters learned on human speech have similar spectro-temporal parameters as the ones measured directly in the human auditory cortex. Finally, we observed that the tasks organized in a meaningful way: the human vocalizations tasks closer to each other and bird vocalizations far away from human vocalizations and urban sounds tasks

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL - UPEC / UPEM