1,662 research outputs found
Development of temporal auditory processing in childhood: Changes in efficiency rather than temporal-modulation selectivity
The ability to detect amplitude modulation (AM) is essential to distinguish the spectro-temporal
features of speech from those of a competing masker. Previous work shows that AM sensitivity
improves until 10 years of age. This may relate to the development of sensory factors (tuning of
AM filters, susceptibility to AM masking) or to changes in processing efficiency (reduction in internal noise, optimization of decision strategies). To disentangle these hypotheses, three groups of
children (5–11 years) and one of young adults completed psychophysical tasks measuring thresholds for detecting sinusoidal AM (with a rate of 4, 8, or 32 Hz) applied to carriers whose inherent
modulations exerted different amounts of AM masking. Results showed that between 5 and 11
years, AM detection thresholds improved and that susceptibility to AM masking slightly increased.
However, the effects of AM rate and carrier were not associated with age, suggesting that sensory
factors are mature by 5 years. Subsequent modelling indicated that reducing internal noise by a factor 10 accounted for the observed developmental trends. Finally, children’s consonant identification
thresholds in noise related to some extent to AM sensitivity. Increased efficiency in AM detection
may support better use of temporal information in speech during childhood
Learning An Invariant Speech Representation
Recognition of speech, and in particular the ability to generalize and learn
from small sets of labelled examples like humans do, depends on an appropriate
representation of the acoustic input. We formulate the problem of finding
robust speech features for supervised learning with small sample complexity as
a problem of learning representations of the signal that are maximally
invariant to intraclass transformations and deformations. We propose an
extension of a theory for unsupervised learning of invariant visual
representations to the auditory domain and empirically evaluate its validity
for voiced speech sound classification. Our version of the theory requires the
memory-based, unsupervised storage of acoustic templates -- such as specific
phones or words -- together with all the transformations of each that normally
occur. A quasi-invariant representation for a speech segment can be obtained by
projecting it to each template orbit, i.e., the set of transformed signals, and
computing the associated one-dimensional empirical probability distributions.
The computations can be performed by modules of filtering and pooling, and
extended to hierarchical architectures. In this paper, we apply a single-layer,
multicomponent representation for phonemes and demonstrate improved accuracy
and decreased sample complexity for vowel classification compared to standard
spectral, cepstral and perceptual features.Comment: CBMM Memo No. 022, 5 pages, 2 figure
The listening talker: A review of human and algorithmic context-induced modifications of speech
International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output
Learning to imitate facial expressions through sound
The question of how young infants learn to imitate others’ facial expressions has been central in developmental psychology for decades. Facial imitation has been argued to constitute a particularly challenging learning task for infants because facial expressions are perceptually opaque: infants cannot see changes in their own facial configuration when they execute a motor program, so how do they learn to match these gestures with those of their interacting partners? Here we argue that this apparent paradox mainly appears if one focuses only on the visual modality, as most existing work in this field has done so far. When considering other modalities, in particular the auditory modality, many facial expressions are not actually perceptually opaque. In fact, every orolabial expression that is accompanied by vocalisations has specific acoustic consequences, which means that it is relatively transparent in the auditory modality. Here, we describe how this relative perceptual transparency can allow infants to accrue experience relevant for orolabial, facial imitation every time they vocalise. We then detail two specific mechanisms that could support facial imitation learning through the auditory modality. First, we review evidence showing that experiencing correlated proprioceptive and auditory feedback when they vocalise – even when they are alone – enables infants to build audio-motor maps that could later support facial imitation of orolabial actions. Second, we show how these maps could also be used by infants to support imitation even for silent, orolabial facial expressions at a later stage. By considering non-visual perceptual domains, this paper expands our understanding of the ontogeny of facial imitation and offers new directions for future investigations
Investigating Speech Perception in Evolutionary Perspective: Comparisons of Chimpanzee (Pan troglodytes) and Human Capabilities
There has been much discussion regarding whether the capability to perceive speech is uniquely human. The “Speech is Special” (SiS) view proposes that humans possess a specialized cognitive module for speech perception (Mann & Liberman, 1983). In contrast, the “Auditory Hypothesis” (Kuhl, 1988) suggests spoken-language evolution took advantage of existing auditory-system capabilities. In support of the Auditory Hypothesis, there is evidence that Panzee, a language-trained chimpanzee (Pan troglodytes), perceives speech in synthetic “sine-wave” and “noise-vocoded” forms (Heimbauer, Beran, & Owren, 2011). Human comprehension of these altered forms of speech has been cited as evidence for specialized cognitive capabilities (Davis, Johnsrude, Hervais-Adelman, Taylor, & McGettigan, 2005).
In light of Panzee’s demonstrated abilities, three experiments extended these investigations of the cognitive processes underlying her speech perception. The first experiment investigated the acoustic cues that Panzee and humans use when identifying sine-wave and noise-vocoded speech. The second experiment examined Panzee’s ability to perceive “time-reversed” speech, in which individual segments of the waveform are reversed in time. Humans are able to perceive such speech if these segments do not much exceed average phoneme length. Finally, the third experiment tested Panzee’s ability to generalize across both familiar and novel talkers, a perceptually challenging task that humans seem to perform effortlessly.
Panzee’s performance was similar to that of humans in all experiments. In Experiment 1, results demonstrated that Panzee likely attends to the same “spectro-temporal” cues in sine-wave and noise-vocoded speech that humans are sensitive to. In Experiment 2, Panzee showed a similar intelligibility pattern as a function of reversal-window length as found in human listeners. In Experiment 3, Panzee readily recognized words not only from a variety of familiar adult males and females, but also from unfamiliar adults and children of both sexes. Overall, results suggest that a combination of general auditory processing and sufficient exposure to meaningful spoken language is sufficient to account for speech-perception evidence previously proposed to require specialized, uniquely human mechanisms. These findings in turn suggest that speech-perception capabilities were already present in latent form in the common evolutionary ancestors of modern chimpanzees and humans
Mandarin speech perception in combined electric and acoustic stimulation.
For deaf individuals with residual low-frequency acoustic hearing, combined use of a cochlear implant (CI) and hearing aid (HA) typically provides better speech understanding than with either device alone. Because of coarse spectral resolution, CIs do not provide fundamental frequency (F0) information that contributes to understanding of tonal languages such as Mandarin Chinese. The HA can provide good representation of F0 and, depending on the range of aided acoustic hearing, first and second formant (F1 and F2) information. In this study, Mandarin tone, vowel, and consonant recognition in quiet and noise was measured in 12 adult Mandarin-speaking bimodal listeners with the CI-only and with the CI+HA. Tone recognition was significantly better with the CI+HA in noise, but not in quiet. Vowel recognition was significantly better with the CI+HA in quiet, but not in noise. There was no significant difference in consonant recognition between the CI-only and the CI+HA in quiet or in noise. There was a wide range in bimodal benefit, with improvements often greater than 20 percentage points in some tests and conditions. The bimodal benefit was compared to CI subjects' HA-aided pure-tone average (PTA) thresholds between 250 and 2000 Hz; subjects were divided into two groups: "better" PTA (<50 dB HL) or "poorer" PTA (>50 dB HL). The bimodal benefit differed significantly between groups only for consonant recognition. The bimodal benefit for tone recognition in quiet was significantly correlated with CI experience, suggesting that bimodal CI users learn to better combine low-frequency spectro-temporal information from acoustic hearing with temporal envelope information from electric hearing. Given the small number of subjects in this study (n = 12), further research with Chinese bimodal listeners may provide more information regarding the contribution of acoustic and electric hearing to tonal language perception
Local Temporal Regularities in Child-Directed Speech in Spanish
Published online: Oct 4, 2022Purpose: The purpose of this study is to characterize the local (utterance-level)
temporal regularities of child-directed speech (CDS) that might facilitate phonological
development in Spanish, classically termed a syllable-timed language.
Method: Eighteen female adults addressed their 4-year-old children versus
other adults spontaneously and also read aloud (CDS vs. adult-directed speech
[ADS]). We compared CDS and ADS speech productions using a spectrotemporal
model (Leong & Goswami, 2015), obtaining three temporal metrics: (a) distribution
of modulation energy, (b) temporal regularity of stressed syllables, and
(c) syllable rate.
Results: CDS was characterized by (a) significantly greater modulation energy
in the lower frequencies (0.5–4 Hz), (b) more regular rhythmic occurrence of
stressed syllables, and (c) a slower syllable rate than ADS, across both spontaneous
and read conditions.
Discussion: CDS is characterized by a robust local temporal organization (i.e.,
within utterances) with amplitude modulation bands aligning with delta and
theta electrophysiological frequency bands, respectively, showing greater phase
synchronization than in ADS, facilitating parsing of stress units and syllables.
These temporal regularities, together with the slower rate of production of CDS,
might support the automatic extraction of phonological units in speech and
hence support the phonological development of children.
Supplemental Material: https://doi.org/10.23641/asha.21210893This study was supported by the Formación de Personal
Investigado Grant BES-2016-078125 by Ministerio
Español de Economía, Industria y Competitividad and Fondo
Social Europeo awarded to Jose Pérez-Navarro; through
Project RTI2018-096242-B-I00 (Ministerio de Ciencia,
Innovación y Universidades [MCIU]/Agencia Estatal de
Investigación [AEI]/Fondo Europeo de Desarrollo Regional
[FEDER], Unión Europea) funded by MCIU, the AEI,
and FEDER awarded to Marie Lallier; by the Basque
Government through the Basque Excellence Research Centre
2018-2021 Program; and by the Spanish State Research
Agency through Basque Center on Cognition, Brain and
Language Severo Ochoa Excellence Accreditation SEV-
2015-0490. We want to thank the participants and their
children for their volunteer contribution to our study
Learning spectro-temporal representations of complex sounds with parameterized neural networks
Deep Learning models have become potential candidates for auditory
neuroscience research, thanks to their recent successes on a variety of
auditory tasks. Yet, these models often lack interpretability to fully
understand the exact computations that have been performed. Here, we proposed a
parametrized neural network layer, that computes specific spectro-temporal
modulations based on Gabor kernels (Learnable STRFs) and that is fully
interpretable. We evaluated predictive capabilities of this layer on Speech
Activity Detection, Speaker Verification, Urban Sound Classification and Zebra
Finch Call Type Classification. We found out that models based on Learnable
STRFs are on par for all tasks with different toplines, and obtain the best
performance for Speech Activity Detection. As this layer is fully
interpretable, we used quantitative measures to describe the distribution of
the learned spectro-temporal modulations. The filters adapted to each task and
focused mostly on low temporal and spectral modulations. The analyses show that
the filters learned on human speech have similar spectro-temporal parameters as
the ones measured directly in the human auditory cortex. Finally, we observed
that the tasks organized in a meaningful way: the human vocalizations tasks
closer to each other and bird vocalizations far away from human vocalizations
and urban sounds tasks
- …