33 research outputs found

    Predicting Speech Intelligibility

    Get PDF
    Hearing impairment, and specifically sensorineural hearing loss, is an increasingly prevalent condition, especially amongst the ageing population. It occurs primarily as a result of damage to hair cells that act as sound receptors in the inner ear and causes a variety of hearing perception problems, most notably a reduction in speech intelligibility. Accurate diagnosis of hearing impairments is a time consuming process and is complicated by the reliance on indirect measurements based on patient feedback due to the inaccessible nature of the inner ear. The challenges of designing hearing aids to counteract sensorineural hearing losses are further compounded by the wide range of severities and symptoms experienced by hearing impaired listeners. Computer models of the auditory periphery have been developed, based on phenomenological measurements from auditory-nerve fibres using a range of test sounds and varied conditions. It has been demonstrated that auditory-nerve representations of vowels in normal and noisedamaged ears can be ranked by a subjective visual inspection of how the impaired representations differ from the normal. This thesis seeks to expand on this procedure to use full word tests rather than single vowels, and to replace manual inspection with an automated approach using a quantitative measure. It presents a measure that can predict speech intelligibility in a consistent and reproducible manner. This new approach has practical applications as it could allow speechprocessing algorithms for hearing aids to be objectively tested in early stage development without having to resort to extensive human trials. Simulated hearing tests were carried out by substituting real listeners with the auditory model. A range of signal processing techniques were used to measure the model’s auditory-nerve outputs by presenting them spectro-temporally as neurograms. A neurogram similarity index measure (NSIM) was developed that allowed the impaired outputs to be compared to a reference output from a normal hearing listener simulation. A simulated listener test was developed, using standard listener test material, and was validated for predicting normal hearing speech intelligibility in quiet and noisy conditions. Two types of neurograms were assessed: temporal fine structure (TFS) which retained spike timing information; and average discharge rate or temporal envelope (ENV). Tests were carried out to simulate a wide range of sensorineural hearing losses and the results were compared to real listeners’ unaided and aided performance. Simulations to predict speech intelligibility performance of NAL-RP and DSL 4.0 hearing aid fitting algorithms were undertaken. The NAL-RP hearing aid fitting algorithm was adapted using a chimaera sound algorithm which aimed to improve the TFS speech cues available to aided hearing impaired listeners. NSIM was shown to quantitatively rank neurograms with better performance than a relative mean squared error and other similar metrics. Simulated performance intensity functions predicted speech intelligibility for normal and hearing impaired listeners. The simulated listener tests demonstrated that NAL-RP and DSL 4.0 performed with similar speech intelligibility restoration levels. Using NSIM and a computational model of the auditory periphery, speech intelligibility can be predicted for both normal and hearing impaired listeners and novel hearing aids can be rapidly prototyped and evaluated prior to real listener tests

    Speech Intelligibility from Image Processing

    Get PDF
    Hearing loss research has traditionally been based on perceptual criteria, speech intelligibility and threshold levels. The development of computational models of the auditory-periphery has allowed experimentation via simulation to provide quantitative, repeatable results at a more granular level than would be practical with clinical research on human subjects

    Optimizing Stimulation Strategies in Cochlear Implants for Music Listening

    Get PDF
    Most cochlear implant (CI) strategies are optimized for speech characteristics while music enjoyment is signicantly below normal hearing performance. In this thesis, electrical stimulation strategies in CIs are analyzed for music input. A simulation chain consisting of two parallel paths, simulating normal hearing conditions and electrical hearing respectively, is utilized. One thesis objective is to congure and develop the sound processor of the CI chain to analyze dierent compression- and channel selection strategies to optimally capture the characteristics of music signals. A new set of knee points (KPs) for the compression function are investigated together with clustering of frequency bands. The N-of-M electrode selection strategy models the eect of a psychoacoustic masking threshold. In order to evaluate the performance of the CI model, the normal hearing model is considered a true reference. Similarity among the resulting neurograms of respective model are measured using the image analysis method Neurogram Similarity Index Measure (NSIM). The validation and resolution of NSIM is another objective of the thesis. Results indicate that NSIM is sensitive to no-activity regions in the neurograms and has diculties capturing small CI changes, i.e. compression settings. Further verication of the model setup is suggested together with investigating an alternative optimal electric hearing reference and/or objective similarity measure

    Towards improving ViSQOL (Virtual Speech Quality Objective Listener) Using Machine Learning Techniques

    Get PDF
    Vast amounts of sound data are transmitted every second over digital networks. VoIP services and cellular networks transmit speech data in increasingly greater volumes. Objective sound quality models provide an essential function to measure the quality of this data in real-time. However, these models can suffer from a lack of accuracy with various degradations over networks. This research uses machine learning techniques to create one support vector regression and three neural network mapping models for use with ViSQOLAudio. Each of the mapping models (including ViSQOL and ViSQOLAudio) are tested against two separate speech datasets in order to comparatively study accuracy results. Despite the slight cost in positive linear correlation and slight increase in error rate, the study finds that a neural network mapping model with ViSQOLAudio provides the highest levels of accuracy in objective speech quality measurement. In some cases, the accuracy levels can be over double that of ViSQOL. The research demonstrates that ViSQOLAudio can be altered to provide an objective speech quality metric greater than that of ViSQOL

    A novel neural feature for a text-dependent speaker identification system

    Get PDF
    A novel feature based on the simulated neural response of the auditory periphery was proposed in this study for a speaker identification system. A well-known computational model of the auditory-nerve (AN) fiber by Zilany and colleagues, which incorporates most of the stages and the relevant nonlinearities observed in the peripheral auditory system, was employed to simulate neural responses to speech signals from different speakers. Neurograms were constructed from responses of inner-hair-cell (IHC)-AN synapses with characteristic frequencies spanning the dynamic range of hearing. The synapse responses were subjected to an analytical function to incorporate the effects of absolute and relative refractory periods. The proposed IHC-AN neurogram feature was then used to train and test the text-dependent speaker identification system using standard classifiers. The performance of the proposed method was compared to the results from existing baseline methods for both quiet and noisy conditions. While the performance using the proposed feature was comparable to the results of existing methods in quiet environments, the neural feature exhibited a substantially better classification accuracy in noisy conditions, especially with white Gaussian and street noises. Also, the performance of the proposed system was relatively independent of various types of distortions in the acoustic signals and classifiers. The proposed feature can be employed to design a robust speech recognition system

    NOMAD: Unsupervised Learning of Perceptual Embeddings for Speech Enhancement and Non-matching Reference Audio Quality Assessment

    Full text link
    This paper presents NOMAD (Non-Matching Audio Distance), a differentiable perceptual similarity metric that measures the distance of a degraded signal against non-matching references. The proposed method is based on learning deep feature embeddings via a triplet loss guided by the Neurogram Similarity Index Measure (NSIM) to capture degradation intensity. During inference, the similarity score between any two audio samples is computed through Euclidean distance of their embeddings. NOMAD is fully unsupervised and can be used in general perceptual audio tasks for audio analysis e.g. quality assessment and generative tasks such as speech enhancement and speech synthesis. The proposed method is evaluated with 3 tasks. Ranking degradation intensity, predicting speech quality, and as a loss function for speech enhancement. Results indicate NOMAD outperforms other non-matching reference approaches in both ranking degradation intensity and quality assessment, exhibiting competitive performance with full-reference audio metrics. NOMAD demonstrates a promising technique that mimics human capabilities in assessing audio quality with non-matching references to learn perceptual embeddings without the need for human-generated labels.Comment: 5 page

    Measuring and Monitoring Speech Quality for Voice over IP with POLQA, ViSQOL and P.563

    Get PDF
    There are many types of degradation which can occur in Voice over IP (VoIP) calls. Of interest in this work are degradations which occur independently of the codec, hardware or network in use. Specifically, their effect on the subjective and objec- tive quality of the speech is examined. Since no dataset suit- able for this purpose exists, a new dataset (TCD-VoIP) has been created and has been made publicly available. The dataset con- tains speech clips suffering from a range of common call qual- ity degradations, as well as a set of subjective opinion scores on the clips from 24 listeners. The performances of three ob- jective quality metrics: POLQA, ViSQOL and P.563, have been evaluated using the dataset. The results show that full reference metrics are capable of accurately predicting a variety of com- mon VoIP degradations. They also highlight the outstanding need for a wideband, single-ended, no-reference metric to mon- itor accurately speech quality for degradations common in VoIP scenarios

    Detailed comparative analysis of PESQ and VISQOL behaviour in the context of playout delay adjustments introduced by VOIP jitter buffer algorithms

    Get PDF
    The default best-effort Internet presents significant challenges for delay-sensitive applications such as VoIP. To cope with non determinism, receiver playout strategies are utilised in VoIP applications that adapt to network condition. Such strategies can be divided into two different groups, namely per-talkspurt and per-packet. The former make use of silence periods within natural speech and adapt such silences to track network conditions, thus preserving the integrity of active speech talkspurts. Examples of this approach are described in [1, 2]. Per packet strategies are different in that adjustments are made both during silence periods and during talkspurts by time-scaling of packets, a technique also known in the literature as time-warping. This approach is more effective in coping with short network delay changes because the per talkspurt approach can only adapt during recognized silences even though the duration of many delay spikes may be less than that of a talkspurt. This approach however introduces potential degradation caused by the scaling of speech packets. Examples of this approach are described in [3, 4] and such techniques are frequently deployed in popular VoIP applications such as GoogleTalk and Skype. In this research, we focus on applications that deploy per talkspurt strategies, which are commonly found in current telecommunication networks
    corecore