Search CORE

3 research outputs found

Automatic recognition of connected vowels only using speaker-invariant representation of speech dynamics

Author: Keikichi Hirose
Nobuaki Minematsu
Satoshi Asakawa
Publication venue
Publication date
Field of study

Speech acoustics vary due to differences in gender, age, microphone, room, lines, and a variety of factors. In speech recognition research, to deal with these inevitable non-linguistic variations, thousands of speakers in different acoustic conditions were prepared to train acoustic models of individual phonemes. Recently, a novel representation of speech dynamics was proposed [1, 2], where the above non-linguistic factors are effectively removed from speech as if pitch information is removed from spectrum by its smoothing. This representation captures only speaker- and microphone-invariant speech dynamics and no absolute or static acoustic properties such as spectrums are used. With them, speaker identity has to remain in speech representation. In our previous study, the new representation was applied to recognizing a sequence of isolated vowels [3]. Th

CiteSeerX

Constructing Invariant Representation of Sound Using Optimal Features And Sound Statistics Adaptation

Author: Liu ShiTong
Publication venue
Publication date: 03/09/2021
Field of study

The ability to convey information using sound is critical for the survival of many vocal species, including humans. These communication sounds (vocalizations or calls) are often comprised of complex spectrotemporal features that require accurate detection to prevent mis-categorization. This task is made difficult by two factors: 1) the inherent variability in vocalization production, and 2) competing sounds from the environment. The auditory system must generalize across these variabilities while maintaining sufficient sensitivity to detect subtle differences in fine acoustic structures. While several studies have described vocalization-selective and noise invariant neural responses in the auditory pathway at a phenomenological level, the algorithmic and mechanistic principles behind these observations remain speculative. In this thesis, we first adopted a theoretical approach to develop biologically plausible computational algorithms to categorize vocalizations while generalizing over sound production and environment variability. From an initial set of randomly chosen vocalization features, we used a greedy search algorithm to select most informative features that maximized vocalization categorization performance and minimized redundancy between features. High classification performance could be achieved using only 10–20 features per vocalization category. The optimal features tended to be of intermediate complexity, offering an optimal compromise between fine and tolerant feature tuning. Predictions of tuning properties of putative feature-selective neurons matched some observed auditory cortical responses. While this algorithm performed well in quiet listening conditions, it failed in noisy conditions. To address this shortcoming, we implemented biologically plausible algorithms to improve model performance in noisy conditions. We explored two model elements to aid adaption to sound statistics: 1. De-noising of noisy inputs by thresholding based on wide-band energy, and 2. Adjusting feature detection parameters to offset noise-masking effects. These processes were consistent with physiological observations of gain control mechanisms and principles of efficient encoding in the brain. With these additions, our model was able to achieve near-physiological levels of performance. Our results suggest that invariant representation of sound can be achieved based on task-dependent features with adaptation to input sound statistics

D-Scholarship@Pitt