416 research outputs found

    Learning Representations for Nonspeech Audio Events through Their Similarities to Speech Patterns

    Get PDF
    The human auditory system is very well matched to both human speech and environmental sounds. Therefore, the question arises whether human speech material may provide useful information for training systems for analyzing nonspeech audio signals, for example, in a classification task. In order to answer this question, we consider speech patterns as basic acoustic concepts which embody and represent the target nonspeech signal. To find out how similar the nonspeech signal is to speech, we classify it with a classifier trained on the speech patterns and use the classification posteriors to represent the closeness to the speech bases. The speech similarities are finally employed as a descriptor to represent the target signal. We further show that a better descriptor can be obtained by learning to organize the speech categories hierarchically with a tree structure. Furthermore, these descriptors are generic. That is, once the speech classifier has been learned, it can be employed as a feature extractor for different datasets without re-training. Lastly, we propose an algorithm to select a sufficient subset which provides an approximate representation capability of the entire set of available speech patterns. We conduct experiments for the application of audio event analysis. Phone triplets from the TIMIT dataset were used as speech patterns to learn the descriptors for audio events of three different datasets with different complexity, including UPC-TALP, Freiburg-106, and NAR. The experimental results on the event classification task show that a good performance can be easily obtained even if a simple linear classifier is used. Furthermore, fusion of the learned descriptors as an additional source leads to state-of-the-art performance on all the three target datasets

    Mapping Acoustic and Semantic Dimensions of Auditory Perception

    Get PDF
    Auditory categorisation is a function of sensory perception which allows humans to generalise across many different sounds present in the environment and classify them into behaviourally relevant categories. These categories cover not only the variance of acoustic properties of the signal but also a wide variety of sound sources. However, it is unclear to what extent the acoustic structure of sound is associated with, and conveys, different facets of semantic category information. Whether people use such data and what drives their decisions when both acoustic and semantic information about the sound is available, also remains unknown. To answer these questions, we used the existing methods broadly practised in linguistics, acoustics and cognitive science, and bridged these domains by delineating their shared space. Firstly, we took a model-free exploratory approach to examine the underlying structure and inherent patterns in our dataset. To this end, we ran principal components, clustering and multidimensional scaling analyses. At the same time, we drew sound labels’ semantic space topography based on corpus-based word embeddings vectors. We then built an LDA model predicting class membership and compared the model-free approach and model predictions with the actual taxonomy. Finally, by conducting a series of web-based behavioural experiments, we investigated whether acoustic and semantic topographies relate to perceptual judgements. This analysis pipeline showed that natural sound categories could be successfully predicted based on the acoustic information alone and that perception of natural sound categories has some acoustic grounding. Results from our studies help to recognise the role of physical sound characteristics and their meaning in the process of sound perception and give an invaluable insight into the mechanisms governing the machine-based and human classifications

    Representing Nonspeech Audio Signals through Speech Classification Models

    Get PDF
    The human auditory system is very well matched to both human speech and environmental sounds. Therefore, the question arises whether human speech material may provide useful information for training systems for analyzing nonspeech audio signals, such as in a recognition task. To find out how similar nonspeech signals are to speech, we measure the closeness between target nonspeech signals and different basis speech categories via a speech classification model. The speech similarities are finally employed as a descriptor to represent the target signal. We further show that a better descriptor can be obtained by learning to organize the speech categories hierarchically with a tree structure. We conduct experiments for the audio event analysis application by using speech words from the TIMIT dataset to learn the descriptors for the audio events of the Freiburg-106 dataset. Our results on the event recognition task outperform those achieved by the best system even though a simple linear classifier is used. Furthermore, integrating the learned descriptors as an additional source leads to improved performance

    Can You Hear Me Now? Sensitive Comparisons of Human and Machine Perception

    Get PDF
    The rise of machine-learning systems that process sensory input has brought with it a rise in comparisons between human and machine perception. But such comparisons face a challenge: Whereas machine perception of some stimulus can often be probed through direct and explicit measures, much of human perceptual knowledge is latent, incomplete, or unavailable for explicit report. Here, we explore how this asymmetry can cause such comparisons to misestimate the overlap in human and machine perception. As a case study, we consider human perception of \textit{adversarial speech} -- synthetic audio commands that are recognized as valid messages by automated speech-recognition systems but that human listeners reportedly hear as meaningless noise. In five experiments, we adapt task designs from the human psychophysics literature to show that even when subjects cannot freely transcribe such speech commands (the previous benchmark for human understanding), they often can demonstrate other forms of understanding, including discriminating adversarial speech from closely matched non-speech (Experiments 1--2), finishing common phrases begun in adversarial speech (Experiments 3--4), and solving simple math problems posed in adversarial speech (Experiment 5) -- even for stimuli previously described as unintelligible to human listeners. We recommend the adoption of such "sensitive tests" when comparing human and machine perception, and we discuss the broader consequences of such approaches for assessing the overlap between systems.Comment: 24 pages; 4 figure

    Cognitive performance in open-plan office acoustic simulations: Effects of room acoustics and semantics but not spatial separation of sound sources

    Full text link
    The irrelevant sound effect (ISE) characterizes short-term memory performance impairment during irrelevant sounds relative to quiet. Irrelevant sound presentation in most laboratory-based ISE studies has been rather limited to represent complex scenarios including open-plan offices (OPOs) and not many studies have considered serial recall of heard information. This paper investigates ISE using an auditory-verbal serial recall task, wherein performance was evaluated for relevant factors in simulating OPO acoustics: the irrelevant sounds including the semanticity of speech, reproduction methods over headphones, and room acoustics. Results (Experiments 1 and 2) show that ISE was exhibited in most conditions with anechoic (irrelevant) nonspeech sounds with/without speech, but the effect was substantially higher with meaningful speech compared to foreign speech, suggesting a semantic effect. Performance differences in conditions with diotic and binaural reproductions were not statistically robust, suggesting limited role of spatial separation of sources. In Experiment 3, statistically robust ISE were exhibited for binaural room acoustic conditions with mid-frequency reverberation times, T30 (s) = 0.4, 0.8, 1.1, suggesting cognitive impairment regardless of sound absorption representative of OPOs. Performance differences in T30 = 0.4 s relative to T30 = 0.8 and 1.1 s conditions were statistically robust. This emphasizes the benefits for cognitive performance with increased sound absorption, reinforcing extant room acoustic design recommendations. Performance differences in T30 = 0.8 s vs. 1.1 s were not statistically robust. Collectively, these results suggest that certain findings from ISE studies with idiosyncratic acoustics may not translate well to complex OPO acoustic environments

    Temporal processing in autism spectrum disorder and developmental dyslexia : a systematic review and meta-analysis

    Full text link
    Les individus ayant un trouble du spectre de l’autisme (TSA) ou une dyslexie dĂ©veloppementale (DD) semblent avoir des difficultĂ©s de traitement temporel. Ces difficultĂ©s peuvent avoir un impact sur des processus de haut-niveau, comme la communication, les compĂ©tences sociales, la lecture et l’écriture. La prĂ©sente mĂ©ta-analyse a examinĂ© deux tests de traitement temporel afin de remplir les objectifs suivants: 1) dĂ©terminer si les difficultĂ©s de traitement temporel sont un trait commun au TSA et Ă  la DD, et ce pour le traitement multisensoriel et unisensoriel, pour diffĂ©rentes modalitĂ©s et types de stimuli, 2) d’évaluer la relation entre la sĂ©vĂ©ritĂ© clinique et le traitement temporel, et 3) d’examiner l’effet de l’ñge sur le traitement temporel. Les rĂ©sultats ont montrĂ© un dĂ©ficit de traitement temporel dans le TSA et la DD, caractĂ©risĂ© de dĂ©ficits multisensoriels chez ces deux populations, et de dĂ©ficits unisensoriels auditifs, tactiles et visuels pour la DD. De plus, notre analyse de la sĂ©vĂ©ritĂ© clinique indique qu’un meilleur traitement temporel en DD est associĂ© Ă  de meilleures compĂ©tences en lecture. Enfin, les dĂ©ficits de traitement temporel ne varient pas avec l’ñge des individus TSA et DD, ils sont donc prĂ©sents tout au long du dĂ©veloppement et de la vie adulte. En conclusion, les rĂ©sultats de la mĂ©ta-analyse montrent que les difficultĂ©s de traitement temporel font partie du cadre clinique du TSA et de la DD et permettent d’émettre des recommandations pour de futures recherches et interventions.Individuals with autism spectrum disorder (ASD) or developmental dyslexia (DD) are commonly reported to have deficits in temporal processing. These deficits can impact higher-order processes, such as social communication, reading and writing. In this thesis, quantitative meta-analyses are used to examine two temporal processing tasks, with the following objectives: 1) determine whether temporal processing deficits are a consistent feature of ASD and DD across specific task contexts such as multisensory and unisensory processing, modality and stimulus type, 2) investigate the relationship between symptom severity and temporal processing, and 3) examine the effect of age on temporal processing deficits. The results provide strong evidence for impaired temporal processing in both ASD and DD, as measured by judgments of temporal order and simultaneity. Multisensory temporal processing was impaired for both ASD and DD, and unisensory auditory, tactile and visual processing was impaired in DD. Greater reading and spelling skills in DD were associated with greater temporal precision. Temporal deficits did not show changes with age in either disorder. In addition to more clearly defining temporal impairments in ASD and DD, the results highlight common and distinct patterns of temporal processing between these disorders. Deficits are discussed in relation to existing theoretical models, and recommendations are made for future research and interventions

    Understanding concurrent earcons: applying auditory scene analysis principles to concurrent earcon recognition

    Get PDF
    Two investigations into the identification of concurrently presented, structured sounds, called earcons were carried out. One of the experiments investigated how varying the number of concurrently presented earcons affected their identification. It was found that varying the number had a significant effect on the proportion of earcons identified. Reducing the number of concurrently presented earcons lead to a general increase in the proportion of presented earcons successfully identified. The second experiment investigated how modifying the earcons and their presentation, using techniques influenced by auditory scene analysis, affected earcon identification. It was found that both modifying the earcons such that each was presented with a unique timbre, and altering their presentation such that there was a 300 ms onset-to-onset time delay between each earcon were found to significantly increase identification. Guidelines were drawn from this work to assist future interface designers when incorporating concurrently presented earcons

    The Role of Music-Specific Representations When Processing Speech: Using a Musical Illusion to Elucidate Domain-Specific and -General Processes

    Full text link
    When listening to music and language sounds, it is unclear whether adults recruit domain-specific or domain-general mechanisms to make sense of incoming sounds. Unique acoustic characteristics such as a greater reliance on rapid temporal transitions in speech relative to song may introduce misleading interpretations concerning shared and overlapping processes in the brain. By using a stimulus that is both ecologically valid and can be perceived as speech or song depending on context, the contribution of low- and high-level mechanisms may be teased apart. The stimuli employed in all experiments are auditory illusions from speech to song reported by Deutsch et al. (2003, 2011) and Tierney et al. (2012). The current experiments found that 1) non-musicians also perceive the speech-to-song illusion and experience a similar disruption of the transformation as a result of pitch transpositions. 2) The contribution of rhythmic regularity to the perceptual transformation from speech to song is unclear using several different examples of the auditory illusion, and clear order effects occur because of the within-subjects design. And finally, 3) when comparing pitch change sensitivity in a speech mode of listening and, after several repetitions, a song mode of listening, only a song mode indicated the recruitment of music-specific representations. Together these studies indicate the potential for using the auditory illusion from speech to song in future research. Also, the final experiment tentatively demonstrates a behavioral dissociation between the recruitment of mechanisms unique to musical knowledge and mechanisms unique to the processing acoustic characteristics predominant in speech or song because acoustic characteristics were held constant

    Computational Models of Representation and Plasticity in the Central Auditory System

    Get PDF
    The performance for automated speech processing tasks like speech recognition and speech activity detection rapidly degrades in challenging acoustic conditions. It is therefore necessary to engineer systems that extract meaningful information from sound while exhibiting invariance to background noise, different speakers, and other disruptive channel conditions. In this thesis, we take a biomimetic approach to these problems, and explore computational strategies used by the central auditory system that underlie neural information extraction from sound. In the first part of this thesis, we explore coding strategies employed by the central auditory system that yield neural responses that exhibit desirable noise robustness. We specifically demonstrate that a coding strategy based on sustained neural firings yields richly structured spectro-temporal receptive fields (STRFs) that reflect the structure and diversity of natural sounds. The emergent receptive fields are comparable to known physiological neuronal properties and can be employed as a signal processing strategy to improve noise invariance in a speech recognition task. Next, we extend the model of sound encoding based on spectro-temporal receptive fields to incorporate the cognitive effects of selective attention. We propose a framework for modeling attention-driven plasticity that induces changes to receptive fields driven by task demands. We define a discriminative cost function whose optimization and solution reflect a biologically plausible strategy for STRF adaptation that helps listeners better attend to target sounds. Importantly, the adaptation patterns predicted by the framework have a close correspondence with known neurophysiological data. We next generalize the framework to act on the spectro-temporal dynamics of task-relevant stimuli, and make predictions for tasks that have yet to be experimentally measured. We argue that our generalization represents a form of object-based attention, which helps shed light on the current debate about auditory attentional mechanisms. Finally, we show how attention-modulated STRFs form a high-fidelity representation of the attended target, and we apply our results to obtain improvements in a speech activity detection task. Overall, the results of this thesis improve our general understanding of central auditory processing, and our computational frameworks can be used to guide further studies in animal models. Furthermore, our models inspire signal processing strategies that are useful for automated speech and sound processing tasks
    • 

    corecore