31 research outputs found
Top-down effects on compensation for coarticulation are not replicable
Listeners use lexical knowledge to judge what speech sounds they heard. I investigated whether such lexical influences are truly top-down or just reflect a merging of perceptual and lexical constraints. This is achieved by testing whether the lexically determined identity of a phone exerts the appropriate context effects on surrounding phones. The current investigations focuses on compensation for coarticulation in vowel-fricative sequences, where the presence of a rounded vowel (/y/ rather than /i/) leads fricatives to be perceived as /s/ rather than /∫/. This results was consistently found in all three experiments. A vowel was also more likely to be perceived as rounded /y/ if that lead listeners to be perceive words rather than nonwords (Dutch: meny, English id. vs. meni nonword). This lexical influence on the perception of the vowel had, however, no consistent influence on the perception of following fricative.peer-reviewe
Pointing gestures do not influence the perception of lexical stress
We investigated whether seeing a pointing gesture influences the perceived lexical stress. A pitch contour continuum between the Dutch words "CAnon" ('canon') and "kaNON" ('cannon') was presented along with a pointing gesture during the first or the second syllable. Pointing gestures following natural recordings but not Gaussian functions influenced stress perception (Experiment 1 and 2), especially when auditory context preceded (Experiment 2). This was not replicated in Experiment 3. Natural pointing gestures failed to affect the categorization of a pitch peak timing continuum (Experiment 4). There is thus no convincing evidence that seeing a pointing gesture influences lexical stress perception.This research was supported in part by an Innovational
Research Incentive Scheme Veni grant from the Netherlands
Organization for Scientific Research (NWO) awarded to first
author. The authors thank Lies Cuijpers for her help with the
experiments.peer-reviewe
Perceptual learning of liquids
Previous research on lexically-guided perceptual learning has
focussed on contrasts that differ primarily in local cues, such
as plosive and fricative contrasts. The present research had
two aims: to investigate whether perceptual learning occurs for
a contrast with non-local cues, the /l/-/r/ contrast, and to
establish whether STRAIGHT can be used to create
ambiguous sounds on an /l/-/r/ continuum. Listening
experiments showed lexically-guided learning about the /l/-/r/
contrast. Listeners can thus tune in to unusual speech sounds
characterised by non-local cues. Moreover, STRAIGHT can
be used to create stimuli for perceptual learning experiments,
opening up new research possibilities.
Index Terms: perceptual learning, morphing, liquids, human
word recognition, STRAIGHT.The research by Odette Scharenborg was partly sponsored by
the Max Planck International Research Network on Aging. We
thank Denise Moerel, Laurence Bruggeman, Lies Cuijpers,
Michael Wiechers, Willemijn van den Berg, and Zhou Fang
for assistance in preparing and running these experiments and
Marijt Witteman for recording the stimuli.peer-reviewe
Detecting Alzheimer's Disease using Interactional and Acoustic features from Spontaneous Speech
Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs
INTERSPEECH 2021. arXiv admin note: substantial text overlap with arXiv:2106.09668INTERSPEECH 2021. arXiv admin note: substantial text overlap with arXiv:2106.09668INTERSPEECH 2021. arXiv admin note: substantial text overlap with arXiv:2106.09668We present two multimodal fusion-based deep learning models that consume ASR transcribed speech and acoustic data simultaneously to classify whether a speaker in a structured diagnostic task has Alzheimer's Disease and to what degree, evaluating the ADReSSo challenge 2021 data. Our best model, a BiLSTM with highway layers using words, word probabilities, disfluency features, pause information, and a variety of acoustic features, achieves an accuracy of 84% and RSME error prediction of 4.26 on MMSE cognitive scores. While predicting cognitive decline is more challenging, our models show improvement using the multimodal approach and word probabilities, disfluency and pause information over word-only models. We show considerable gains for AD classification using multimodal fusion and gating, which can effectively deal with noisy inputs from acoustic features and ASR hypotheses
Towards Deep End-of-Turn Prediction for Situated Spoken Dialogue Systems
This work was supported by the Cluster of Excellence Cognitive Interaction Technology ‘CITEC’ (EXC 277) at Bielefeld University, funded by the German Research Foundation (DFG), and the DFG-funded DUEL project (grant SCHL 845/5-1)
Development of a Speech Quality Database Under Uncontrolled Conditions
Objective audio quality assessment is preferred to avoid time-consuming and costly listening tests. The development of objective quality metrics depends on the availability of datasets appropriate to the application under study. Currently, a suitable human-annotated dataset for developing quality metrics in archive audio is missing. Given the online availability of archival recordings, we propose to develop a real-world audio quality dataset. We present a methodology used to curate a speech quality database using the archive recordings from the Apollo Space Program. The proposed procedure is based on two steps: a pilot listening test and an exploratory data analysis. The pilot listening test shows that we can extract audio clips through the control of speech-to-text performance metrics to prevent data repetition. Through unsupervised exploratory data analysis, we explore the characteristics of the degradations. We classify distinct degradations and we study spectral, intensity, tonality and overall quality properties of the data through clustering techniques. These results provide the necessary foundation to support the subsequent development of large-scale crowdsourced datasets for audio quality
Memory Controlled Sequential Self Attention for Sound Recognition
In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound recognition performance with the self attention induced SED model. We extend the proposed idea with a multi-head self attention mechanism where each attention head processes the audio embedding with explicit attention width values. The proposed use of memory controlled sequential self attention offers a way to induce relations among frames of sound event tokens. We show that our memory controlled self attention model achieves an event based F -score of 33.92% on the URBAN-SED dataset, outperforming the F -score of 20.10% reported by the model without self attention. Index Terms: Memory controlled self attention, sound recognition, multi-head attention
Towards joint sound scene and polyphonic sound event recognition
Acoustic Scene Classification (ASC) and Sound Event Detection (SED) are two separate tasks in the field of computational sound scene analysis. In this work, we present a new dataset with both sound scene and sound event labels and use this to demonstrate a novel method for jointly classifying sound scenes and recognizing sound events. We show that by taking a joint approach, learning is more efficient and whilst improvements are still needed for sound event detection, SED results are robust in a dataset where the sample distribution is skewed towards sound scenes
