1,493 research outputs found
Composition of Deep and Spiking Neural Networks for Very Low Bit Rate Speech Coding
Most current very low bit rate (VLBR) speech coding systems use hidden Markov
model (HMM) based speech recognition/synthesis techniques. This allows
transmission of information (such as phonemes) segment by segment that
decreases the bit rate. However, the encoder based on a phoneme speech
recognition may create bursts of segmental errors. Segmental errors are further
propagated to optional suprasegmental (such as syllable) information coding.
Together with the errors of voicing detection in pitch parametrization,
HMM-based speech coding creates speech discontinuities and unnatural speech
sound artefacts.
In this paper, we propose a novel VLBR speech coding framework based on
neural networks (NNs) for end-to-end speech analysis and synthesis without
HMMs. The speech coding framework relies on phonological (sub-phonetic)
representation of speech, and it is designed as a composition of deep and
spiking NNs: a bank of phonological analysers at the transmitter, and a
phonological synthesizer at the receiver, both realised as deep NNs, and a
spiking NN as an incremental and robust encoder of syllable boundaries for
coding of continuous fundamental frequency (F0). A combination of phonological
features defines much more sound patterns than phonetic features defined by
HMM-based speech coders, and the finer analysis/synthesis code contributes into
smoother encoded speech. Listeners significantly prefer the NN-based approach
due to fewer discontinuities and speech artefacts of the encoded speech. A
single forward pass is required during the speech encoding and decoding. The
proposed VLBR speech coding operates at a bit rate of approximately 360 bits/s
Employing Emotion Cues to Verify Speakers in Emotional Talking Environments
Usually, people talk neutrally in environments where there are no abnormal
talking conditions such as stress and emotion. Other emotional conditions that
might affect people talking tone like happiness, anger, and sadness. Such
emotions are directly affected by the patient health status. In neutral talking
environments, speakers can be easily verified, however, in emotional talking
environments, speakers cannot be easily verified as in neutral talking ones.
Consequently, speaker verification systems do not perform well in emotional
talking environments as they do in neutral talking environments. In this work,
a two-stage approach has been employed and evaluated to improve speaker
verification performance in emotional talking environments. This approach
employs speaker emotion cues (text-independent and emotion-dependent speaker
verification problem) based on both Hidden Markov Models (HMMs) and
Suprasegmental Hidden Markov Models (SPHMMs) as classifiers. The approach is
comprised of two cascaded stages that combines and integrates emotion
recognizer and speaker recognizer into one recognizer. The architecture has
been tested on two different and separate emotional speech databases: our
collected database and Emotional Prosody Speech and Transcripts database. The
results of this work show that the proposed approach gives promising results
with a significant improvement over previous studies and other approaches such
as emotion-independent speaker verification approach and emotion-dependent
speaker verification approach based completely on HMMs.Comment: Journal of Intelligent Systems, Special Issue on Intelligent
Healthcare Systems, De Gruyter, 201
Using a low-bit rate speech enhancement variable post-filter as a speech recognition system pre-filter to improve robustness to GSM speech
Includes bibliographical references.Performance of speech recognition systems degrades when they are used to recognize speech that has been transmitted through GS1 (Global System for Mobile Communications) voice communication channels (GSM speech). This degradation is mainly due to GSM speech coding and GSM channel noise on speech signals transmitted through the network. This poor recognition of GSM channel speech limits the use of speech recognition applications over GSM networks. If speech recognition technology is to be used unlimitedly over GSM networks recognition accuracy of GSM channel speech has to be improved. Different channel normalization techniques have been developed in an attempt to improve recognition accuracy of voice channel modified speech in general (not specifically for GSM channel speech). These techniques can be classified into three broad categories, namely, model modification, signal pre-processing and feature processing techniques. In this work, as a contribution toward improving the robustness of speech recognition systems to GSM speech, the use of a low-bit speech enhancement post-filter as a speech recognition system pre-filter is proposed. This filter is to be used in recognition systems in combination with channel normalization techniques
- …