Search CORE

78,586 research outputs found

Prediction of speech intelligibility based on a correlation metric in the envelope power spectrum domain

Author: Dau Torsten
May Tobias
Relaño-Iborra Helia
Scheidiger Christoph
Zaar Johannes
Publication venue
Publication date: 01/01/2016
Field of study

A powerful tool to investigate speech perception is the use of speech intelligibility prediction models. Recently, a model was presented, termed correlation-based speechbased envelope power spectrum model (sEPSMcorr) [1], based on the auditory processing of the multi-resolution speech-based Envelope Power Spectrum Model (mr-sEPSM) [2], combined with the correlation back-end of the Short-Time Objective Intelligibility measure (STOI) [3]. The sEPSMcorr can accurately predict NH data for a broad range of listening conditions, e.g., additive noise, phase jitter and ideal binary mask processing

Online Research Database In Technology

Deep Spiking Neural Network model for time-variant signals classification: a real-time speech recognition approach

Author: Davidson Simón
Domínguez Morales Juan Pedro
Furber Steve B.
Gutiérrez Galán Daniel
James Robert
Jiménez Fernández Ángel Francisco
Liu Qian
Publication venue: IEEE Computer Society
Publication date: 01/01/2018
Field of study

Speech recognition has become an important task to improve the human-machine interface. Taking into account the limitations of current automatic speech recognition systems, like non-real time cloud-based solutions or power demand, recent interest for neural networks and bio-inspired systems has motivated the implementation of new techniques. Among them, a combination of spiking neural networks and neuromorphic auditory sensors offer an alternative to carry out the human-like speech processing task. In this approach, a spiking convolutional neural network model was implemented, in which the weights of connections were calculated by training a convolutional neural network with specific activation functions, using firing rate-based static images with the spiking information obtained from a neuromorphic cochlea. The system was trained and tested with a large dataset that contains ”left” and ”right” speech commands, achieving 89.90% accuracy. A novel spiking neural network model has been proposed to adapt the network that has been trained with static images to a non-static processing approach, making it possible to classify audio signals and time series in real time.Ministerio de Economía y Competitividad TEC2016-77785-

idUS. Depósito de Investigación Universidad de Sevilla

A Neural-Network Framework for the Design of Individualised Hearing-Loss Compensation

Author: Drakopoulos Fotios
Verhulst Sarah
Publication venue
Publication date: 14/07/2022
Field of study

Even though sound processing in the human auditory system is complex and highly non-linear, hearing aids (HAs) still rely on simplified descriptions of auditory processing or hearing loss to restore hearing. Standard HA amplification strategies succeed in restoring inaudibility of faint sounds, but fall short of providing targetted treatments for complex sensorineural deficits. To address this challenge, biophysically realistic models of human auditory processing can be adopted in the design of individualised HA strategies, but these are typically non-differentiable and computationally expensive. Therefore, this study proposes a differentiable DNN framework that can be used to train DNN-based HA models based on biophysical auditory-processing differences between normal-hearing and hearing-impaired models. We investigate the restoration capabilities of our DNN-based hearing-loss compensation for different loss functions, to optimally compensate for a mixed outer-hair-cell (OHC) loss and cochlear-synaptopathy (CS) impairment. After evaluating which trained DNN-HA model yields the best restoration outcomes on simulated auditory responses and speech intelligibility, we applied the same training procedure to two milder hearing-loss profiles with OHC loss or CS alone. Our results show that auditory-processing restoration was possible for all considered hearing-loss cases, with OHC loss proving easier to compensate than CS. Several objective metrics were considered to estimate the expected perceptual benefit after processing, and these simulations hold promise in yielding improved understanding of speech-in-noise for hearing-impaired listeners who use our DNN-HA processing. Since our framework can be tuned to the hearing-loss profiles of individual listeners, we enter an era where truly individualised and DNN-based hearing-restoration strategies can be developed and be tested experimentally

arXiv.org e-Print Archive

Discrimination of Speech From Non-Speech Based on Multiscale Spectro-Temporal Modulations

Author: Mesgarani Nima
Publication venue
Publication date: 16/05/2005
Field of study

We describe a content-based audio classification algorithm based on novel multiscale spectrotemporal modulation features inspired by a model of auditory cortical processing. The task explored is to discriminate speech from non-speech consisting of animal vocalizations, music and environmental sounds. Although this is a relatively easy task for humans, it is still difficult to automate well, especially in noisy and reverberant environments. The auditory model captures basic processes occurring from the early cochlear stages to the central cortical areas. The model generates a multidimensional spectro-temporal representation of the sound, which is then analyzed by a multi-linear dimensionality reduction technique and classified by a Support Vector Machine (SVM). Generalization of the system to signals in high level of additive noise and reverberation is evaluated and compared to two existing approaches [1] [2]. The results demonstrate the advantages of the auditory model over the other two systems, especially at low SNRs and high reverberation

Digital Repository at the University of Maryland

Spectral discontinuity in concatenative speech synthesis – perception, join costs and feature transformations

Author: Kirkpatrick Barry
Publication venue: Dublin City University. School of Computing
Publication date: 22/02/2010
Field of study

This thesis explores the problem of determining an objective measure to represent human perception of spectral discontinuity in concatenative speech synthesis. Such measures are used as join costs to quantify the compatibility of speech units for concatenation in unit selection synthesis. No previous study has reported a spectral measure that satisfactorily correlates with human perception of discontinuity. An analysis of the limitations of existing measures and our understanding of the human auditory system were used to guide the strategies adopted to advance a solution to this problem. A listening experiment was conducted using a database of concatenated speech with results indicating the perceived continuity of each concatenation. The results of this experiment were used to correlate proposed measures of spectral continuity with the perceptual results. A number of standard speech parametrisations and distance measures were tested as measures of spectral continuity and analysed to identify their limitations. Time-frequency resolution was found to limit the performance of standard speech parametrisations.As a solution to this problem, measures of continuity based on the wavelet transform were proposed and tested, as wavelets offer superior time-frequency resolution to standard spectral measures. A further limitation of standard speech parametrisations is that they are typically computed from the magnitude spectrum. However, the auditory system combines information relating to the magnitude spectrum, phase spectrum and spectral dynamics. The potential of phase and spectral dynamics as measures of spectral continuity were investigated. One widely adopted approach to detecting discontinuities is to compute the Euclidean distance between feature vectors about the join in concatenated speech. The detection of an auditory event, such as the detection of a discontinuity, involves processing high up the auditory pathway in the central auditory system. The basic Euclidean distance cannot model such behaviour. A study was conducted to investigate feature transformations with sufficient processing complexity to mimic high level auditory processing. Neural networks and principal component analysis were investigated as feature transformations. Wavelet based measures were found to outperform all measures of continuity based on standard speech parametrisations. Phase and spectral dynamics based measures were found to correlate with human perception of discontinuity in the test database, although neither measure was found to contribute a significant increase in performance when combined with standard measures of continuity. Neural network feature transformations were found to significantly outperform all other measures tested in this study, producing correlations with perceptual results in excess of 90%

Irish Universities

DCU Online Research Access Service

Recognizing Speech in a Novel Accent: The Motor Theory of Speech Perception Reframed

The motor theory of speech perception holds that we perceive the speech of another in terms of a motor representation of that speech. However, when we have learned to recognize a foreign accent, it seems plausible that recognition of a word rarely involves reconstruction of the speech gestures of the speaker rather than the listener. To better assess the motor theory and this observation, we proceed in three stages. Part 1 places the motor theory of speech perception in a larger framework based on our earlier models of the adaptive formation of mirror neurons for grasping, and for viewing extensions of that mirror system as part of a larger system for neuro-linguistic processing, augmented by the present consideration of recognizing speech in a novel accent. Part 2 then offers a novel computational model of how a listener comes to understand the speech of someone speaking the listener's native language with a foreign accent. The core tenet of the model is that the listener uses hypotheses about the word the speaker is currently uttering to update probabilities linking the sound produced by the speaker to phonemes in the native language repertoire of the listener. This, on average, improves the recognition of later words. This model is neutral regarding the nature of the representations it uses (motor vs. auditory). It serve as a reference point for the discussion in Part 3, which proposes a dual-stream neuro-linguistic architecture to revisits claims for and against the motor theory of speech perception and the relevance of mirror neurons, and extracts some implications for the reframing of the motor theory

arXiv.org e-Print Archive

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

A spiking neural network for real-time Spanish vowel phonemes recognition

Author: Gómez Rodríguez Francisco de Asís
Jiménez Fernández Ángel Francisco
Jiménez Moreno Gabriel
Miró Amarante María Lourdes
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

This paper explores neuromorphic approach capabilities applied to real-time speech processing. A spiking recognition neural network composed of three types of neurons is proposed. These neurons are based on an integrative and fire model and are capable of recognizing auditory frequency patterns, such as vowel phonemes; words are recognized as sequences of vowel phonemes. For demonstrating real-time operation, a complete spiking recognition neural network has been described in VHDL for detecting certain Spanish words, and it has been tested in a FPGA platform. This is a stand-alone and fully hardware system that allows to embed it in a mobile system. To stimulate the network, a spiking digital-filter-based cochlea has been implemented in VHDL. In the implementation, an Address Event Representation (AER) is used for transmitting information between neurons.Ministerio de Economía y Competitividad TEC2012-37868-C04-02/0

idUS. Depósito de Investigación Universidad de Sevilla

Low-Level Information and High-Level Perception: The Case of Speech in Noise

Author: Ahissar
Ahissar
Ahissar
Ahissar
Allport
Bajo
Bajo
Batra
Batra
Bernstein
Bernstein
Binder
Binder
Binder
Blauert
Bleeck
Brungart
Bundesen
Carhart
Casseday
Chechik
Colburn
Colburn
Dai
Dai
Darwin
Dau
David Poeppel
Davis
Delgutte
Demonet
Dreschler
Durlach
Durlach
Durlach
Durlach
Durlach
Felleman
Freyman
Green
Gresham
Griffiths
Hafter
Hickok
Hirsch
Hochstein
Huang-Pollock
Israel Nelken
Jancke
Jiang
Johansson
Kaas
Kidd
Las
Lavie
Lavie
Lee
Levitt
Levitt
Levitt
Liberman
Licklider
Maunsell
Merav Ahissar
Miller
Mor Nahum
Muller-Gass
Nelken
Nelken
Palmer
Patterson
Pfafflin
Poltrock
Price
Rauschecker
Rodd
Romanski
Saberi
Salasoo
Schwartz
Scott
Sherwin
Shiffrin
Stecker
Treue
Ungerleider
Verhey
Wang
Warren
Warren
Warren
Wei
Wessinger
Wilson
Winer
Yi
Yin
Zatorre
Zatorre
Zurek
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Auditory information is processed in a fine-to-crude hierarchical scheme, from low-level acoustic information to high-level abstract representations, such as phonological labels. We now ask whether fine acoustic information, which is not retained at high levels, can still be used to extract speech from noise. Previous theories suggested either full availability of low-level information or availability that is limited by task difficulty. We propose a third alternative, based on the Reverse Hierarchy Theory (RHT), originally derived to describe the relations between the processing hierarchy and visual perception. RHT asserts that only the higher levels of the hierarchy are immediately available for perception. Direct access to low-level information requires specific conditions, and can be achieved only at the cost of concurrent comprehension. We tested the predictions of these three views in a series of experiments in which we measured the benefits from utilizing low-level binaural information for speech perception, and compared it to that predicted from a model of the early auditory system. Only auditory RHT could account for the full pattern of the results, suggesting that similar defaults and tradeoffs underlie the relations between hierarchical processing and perception in the visual and auditory modalities

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central