664 research outputs found
Source and Filter Estimation for Throat-Microphone Speech Enhancement
In this paper, we propose a new statistical enhancement system for throat microphone recordings through source and filter separation. Throat microphones (TM) are skin-attached piezoelectric sensors that can capture speech sound signals in the form of tissue vibrations. Due to their limited bandwidth, TM recorded speech suffers from intelligibility and naturalness. In this paper, we investigate learning phone-dependent Gaussian mixture model (GMM)-based statistical mappings using parallel recordings of acoustic microphone (AM) and TM for enhancement of the spectral envelope and excitation signals of the TM speech. The proposed mappings address the phone-dependent variability of tissue conduction with TM recordings. While the spectral envelope mapping estimates the line spectral frequency (LSF) representation of AM from TM recordings, the excitation mapping is constructed based on the spectral energy difference (SED) of AM and TM excitation signals. The excitation enhancement is modeled as an estimation of the SED features from the TM signal. The proposed enhancement system is evaluated using both objective and subjective tests. Objective evaluations are performed with the log-spectral distortion (LSD), the wideband perceptual evaluation of speech quality (PESQ) and mean-squared error (MSE) metrics. Subjective evaluations are performed with an A/B comparison test. Experimental results indicate that the proposed phone-dependent mappings exhibit enhancements over phone-independent mappings. Furthermore enhancement of the TM excitation through statistical mappings of the SED features introduces significant objective and subjective performance improvements to the enhancement of TM recordings. ©2015 IEEE
Silent-speech enhancement using body-conducted vocal-tract resonance signals
The physical characteristics of weak body-conducted vocal-tract resonance signals called non-audible murmur (NAM) and the acoustic characteristics of three sensors developed for detecting these signals have been investigated. NAM signals attenuate 50 dB at 1 kHz; this attenuation consists of 30-dB full-range attenuation due to air-to-body transmission loss and 10 dB/octave spectral decay due to a sound propagation loss within the body. These characteristics agree with the spectral characteristics of measured NAM signals. The sensors have a sensitivity of between 41 and 58 dB [V/Pa] at I kHz, and the mean signal-to-noise ratio of the detected signals was 15 dB. On the basis of these investigations, three types of silent-speech enhancement systems were developed: (1) simple, direct amplification of weak vocal-tract resonance signals using a wired urethane-elastomer NAM microphone, (2) simple, direct amplification using a wireless urethane-elastomer-duplex NAM microphone, and (3) transformation of the weak vocal-tract resonance signals sensed by a soft-silicone NAM microphone into whispered speech using statistical conversion. Field testing of the systems showed that they enable voice impaired people to communicate verbally using body-conducted vocal-tract resonance signals. Listening tests demonstrated that weak body-conducted vocal-tract resonance sounds can be transformed into intelligible whispered speech sounds. Using these systems, people with voice impairments can re-acquire speech communication with less effort. (C) 2009 Elsevier B.V. All rights reserved.ArticleSPEECH COMMUNICATION. 52(4):301-313 (2010)journal articl
A Novel Radar Sensor for the Non-Contact Detection of Speech Signals
Different speech detection sensors have been developed over the years but they are limited by the loss of high frequency speech energy, and have restricted non-contact detection due to the lack of penetrability. This paper proposes a novel millimeter microwave radar sensor to detect speech signals. The utilization of a high operating frequency and a superheterodyne receiver contributes to the high sensitivity of the radar sensor for small sound vibrations. In addition, the penetrability of microwaves allows the novel sensor to detect speech signals through nonmetal barriers. Results show that the novel sensor can detect high frequency speech energies and that the speech quality is comparable to traditional microphone speech. Moreover, the novel sensor can detect speech signals through a nonmetal material of a certain thickness between the sensor and the subject. Thus, the novel speech sensor expands traditional speech detection techniques and provides an exciting alternative for broader application prospects
Configurable EBEN: Extreme Bandwidth Extension Network to enhance body-conducted speech capture
This paper presents a configurable version of Extreme Bandwidth Extension
Network (EBEN), a Generative Adversarial Network (GAN) designed to improve
audio captured with body-conduction microphones. We show that although these
microphones significantly reduce environmental noise, this insensitivity to
ambient noise happens at the expense of the bandwidth of the speech signal
acquired by the wearer of the devices. The obtained captured signals therefore
require the use of signal enhancement techniques to recover the full-bandwidth
speech. EBEN leverages a configurable multiband decomposition of the raw
captured signal. This decomposition allows the data time domain dimensions to
be reduced and the full band signal to be better controlled. The multiband
representation of the captured signal is processed through a U-Net-like model,
which combines feature and adversarial losses to generate an enhanced speech
signal. We also benefit from this original representation in the proposed
configurable discriminators architecture. The configurable EBEN approach can
achieve state-of-the-art enhancement results on synthetic data with a
lightweight generator that allows real-time processing.Comment: Accepted in IEEE/ACM Transactions on Audio, Speech and Language
Processing on 14/08/202
Glottal-synchronous speech processing
Glottal-synchronous speech processing is a field of speech science where the pseudoperiodicity
of voiced speech is exploited. Traditionally, speech processing involves segmenting
and processing short speech frames of predefined length; this may fail to exploit the inherent
periodic structure of voiced speech which glottal-synchronous speech frames have
the potential to harness. Glottal-synchronous frames are often derived from the glottal
closure instants (GCIs) and glottal opening instants (GOIs).
The SIGMA algorithm was developed for the detection of GCIs and GOIs from
the Electroglottograph signal with a measured accuracy of up to 99.59%. For GCI and
GOI detection from speech signals, the YAGA algorithm provides a measured accuracy
of up to 99.84%. Multichannel speech-based approaches are shown to be more robust to
reverberation than single-channel algorithms.
The GCIs are applied to real-world applications including speech dereverberation,
where SNR is improved by up to 5 dB, and to prosodic manipulation where the importance
of voicing detection in glottal-synchronous algorithms is demonstrated by subjective
testing. The GCIs are further exploited in a new area of data-driven speech modelling,
providing new insights into speech production and a set of tools to aid deployment into
real-world applications. The technique is shown to be applicable in areas of speech coding,
identification and artificial bandwidth extension of telephone speec
- …