1,102 research outputs found

    Reconstruction-based speech enhancement from robust acoustic features

    Get PDF
    This paper proposes a method of speech enhancement where a clean speech signal is reconstructed from a sinusoidal model of speech production and a set of acoustic speech features. The acoustic features are estimated from noisy speech and comprise, for each frame, a voicing classification (voiced, unvoiced or non-speech), fundamental frequency (for voiced frames) and spectral envelope. Rather than using different algorithms to estimate each parameter, a single statistical model is developed. This comprises a set of acoustic models and has similarity to the acoustic modelling used in speech recognition. This allows noise and speaker adaptation to be applied to acoustic feature estimation to improve robustness. Objective and subjective tests compare reconstruction-based enhancement with other methods of enhancement and show the proposed method to be highly effective at removing noise

    Robust Audio and WiFi Sensing via Domain Adaptation and Knowledge Sharing From External Domains

    Get PDF
    Recent advancements in machine learning have initiated a revolution in embedded sensing and inference systems. Acoustic and WiFi-based sensing and inference systems have enabled a wide variety of applications ranging from home activity detection to health vitals monitoring. While many existing solutions paved the way for acoustic event recognition and WiFi-based activity detection, the diverse characteristics in sensors, systems, and environments used for data capture cause a shift in the distribution of data and thus results in sub-optimal classification performance when the sensor and environment discrepancy occurs between training and inference stage. Moreover, large-scale acoustic and WiFi data collection is non-trivial and cumbersome. Therefore, current acoustic and WiFi-based sensing systems suffer when there is a lack of labeled samples as they only rely on the provided training data. In this thesis, we aim to address the performance loss of machine learning-based classifiers for acoustic and WiFi-based sensing systems due to sensor and environment heterogeneity and lack of labeled examples. We show that discovering latent domains (sensor type, environment, etc.) and removing domain bias from machine learning classifiers make acoustic and WiFi-based sensing robust and generalized. We also propose a few-shot domain adaptation method that requires only one labeled sample for a new domain that relieves the users and developers from the painstaking task of data collection at each new domain. Furthermore, to address the lack of labeled examples, we propose to exploit the information or learned knowledge from sources where available data already exists in volumes, such as textual descriptions and visual domain. We implemented our algorithms in mobile and embedded platforms and collected data from participants to evaluate our proposed algorithms and frameworks in an extensive manner.Doctor of Philosoph

    Speaker Independent Acoustic-to-Articulatory Inversion

    Get PDF
    Acoustic-to-articulatory inversion, the determination of articulatory parameters from acoustic signals, is a difficult but important problem for many speech processing applications, such as automatic speech recognition (ASR) and computer aided pronunciation training (CAPT). In recent years, several approaches have been successfully implemented for speaker dependent models with parallel acoustic and kinematic training data. However, in many practical applications inversion is needed for new speakers for whom no articulatory data is available. In order to address this problem, this dissertation introduces a novel speaker adaptation approach called Parallel Reference Speaker Weighting (PRSW), based on parallel acoustic and articulatory Hidden Markov Models (HMM). This approach uses a robust normalized articulatory space and palate referenced articulatory features combined with speaker-weighted adaptation to form an inversion mapping for new speakers that can accurately estimate articulatory trajectories. The proposed PRSW method is evaluated on the newly collected Marquette electromagnetic articulography - Mandarin Accented English (EMA-MAE) corpus using 20 native English speakers. Cross-speaker inversion results show that given a good selection of reference speakers with consistent acoustic and articulatory patterns, the PRSW approach gives good speaker independent inversion performance even without kinematic training data

    A Novel Adaptive Spectrum Noise Cancellation Approach for Enhancing Heartbeat Rate Monitoring in a Wearable Device

    Get PDF
    This paper presents a novel approach, Adaptive Spectrum Noise Cancellation (ASNC), for motion artifacts removal in Photoplethysmography (PPG) signals measured by an optical biosensor to obtain clean PPG waveforms for heartbeat rate calculation. One challenge faced by this optical sensing method is the inevitable noise induced by movement when the user is in motion, especially when the motion frequency is very close to the target heartbeat rate. The proposed ASNC utilizes the onboard accelerometer and gyroscope sensors to detect and remove the artifacts adaptively, thus obtaining accurate heartbeat rate measurement while in motion. The ASNC algorithm makes use of a commonly accepted spectrum analysis approaches in medical digital signal processing, discrete cosine transform, to carry out frequency domain analysis. Results obtained by the proposed ASNC have been compared to the classic algorithms, the adaptive threshold peak detection and adaptive noise cancellation. The mean (standard deviation) absolute error and mean relative error of heartbeat rate calculated by ASNC is 0.33 (0.57) beats·min-1 and 0.65%, by adaptive threshold peak detection algorithm is 2.29 (2.21) beats·min-1 and 8.38%, by adaptive noise cancellation algorithm is 1.70 (1.50) beats·min-1 and 2.02%. While all algorithms performed well with both simulated PPG data and clean PPG data collected from our Verity device in situations free of motion artifacts, ASNC provided better accuracy when motion artifacts increase, especially when motion frequency is very close to the heartbeat rate

    Methods of Optimizing Speech Enhancement for Hearing Applications

    Get PDF
    Speech intelligibility in hearing applications suffers from background noise. One of the most effective solutions is to develop speech enhancement algorithms based on the biological traits of the auditory system. In humans, the medial olivocochlear (MOC) reflex, which is an auditory neural feedback loop, increases signal-in-noise detection by suppressing cochlear response to noise. The time constant is one of the key attributes of the MOC reflex as it regulates the variation of suppression over time. Different time constants have been measured in nonhuman mammalian and human auditory systems. Physiological studies reported that the time constant of nonhuman mammalian MOC reflex varies with the properties (e.g. frequency, bandwidth) changes of the stimulation. A human based study suggests that time constant could vary when the bandwidth of the noise is changed. Previous works have developed MOC reflex models and successfully demonstrated the benefits of simulating the MOC reflex for speech-in-noise recognition. However, they often used fixed time constants. The effect of the different time constants on speech perception remains unclear. The main objectives of the present study are (1) to study the effect of the MOC reflex time constant on speech perception in different noise conditions; (2) to develop a speech enhancement algorithm with dynamic time constant optimization to adapt to varying noise conditions for improving speech intelligibility. The first part of this thesis studies the effect of the MOC reflex time constants on speech-in-noise perception. Conventional studies do not consider the relationship between the time constants and speech perception as it is difficult to measure the speech intelligibility changes due to varying time constants in human subjects. We use a model to investigate the relationship by incorporating Meddis’ peripheral auditory model (which includes a MOC reflex) with an automatic speech recognition (ASR) system. The effect of the MOC reflex time constant is studied by adjusting the time constant parameter of the model and testing the speech recognition accuracy of the ASR. Different time constants derived from human data are evaluated in both speech-like and non-speech like noise at the SNR levels from -10 dB to 20 dB and clean speech condition. The results show that the long time constants (≥1000 ms) provide a greater improvement of speech recognition accuracy at SNR levels≤10 dB. Maximum accuracy improvement of 40% (compared to no MOC condition) is shown in pink noise at the SNR of 10 dB. Short time constants (<1000 ms) show recognition accuracy over 5% higher than the longer ones at SNR levels ≥15 dB. The second part of the thesis develops a novel speech enhancement algorithm based on the MOC reflex with a time constant that is dynamically optimized, according to a lookup table for varying SNRs. The main contributions of this part include: (1) So far, the existing SNR estimation methods are challenged in cases of low SNR, nonstationary noise, and computational complexity. High computational complexity would increase processing delay that causes intelligibility degradation. A variance of spectral entropy (VSE) based SNR estimation method is developed as entropy based features have been shown to be more robust in the cases of low SNR and nonstationary noise. The SNR is estimated according to the estimated VSE-SNR relationship functions by measuring VSE of noisy speech. Our proposed method has an accuracy of 5 dB higher than other methods especially in the babble noise with fewer talkers (2 talkers) and low SNR levels (< 0 dB), with averaging processing time only about 30% of the noise power estimation based method. The proposed SNR estimation method is further improved by implementing a nonlinear filter-bank. The compression of the nonlinear filter-bank is shown to increase the stability of the relationship functions. As a result, the accuracy is improved by up to 2 dB in all types of tested noise. (2) A modification of Meddis’ MOC reflex model with a time constant dynamically optimized against varying SNRs is developed. The model incudes simulated inner hair cell response to reduce the model complexity, and now includes the SNR estimation method. Previous MOC reflex models often have fixed time constants that do not adapt to varying noise conditions, whilst our modified MOC reflex model has a time constant dynamically optimized according to the estimated SNRs. The results show a speech recognition accuracy of 8 % higher than the model using a fixed time constant of 2000 ms in different types of noise. (3) A speech enhancement algorithm is developed based on the modified MOC reflex model and implemented in an existing hearing aid system. The performance is evaluated by measuring the objective speech intelligibility metric of processed noisy speech. In different types of noise, the proposed algorithm increases intelligibility at least 20% in comparison to unprocessed noisy speech at SNRs between 0 dB and 20 dB, and over 15 % in comparison to processed noisy speech using the original MOC based algorithm in the hearing aid

    Communication breakdown: Limits of spectro-temporal resolution for the perception of bat communication calls

    Get PDF
    During vocal communication, the spectro‑temporal structure of vocalizations conveys important contextual information. Bats excel in the use of sounds for echolocation by meticulous encoding of signals in the temporal domain. We therefore hypothesized that for social communication as well, bats would excel at detecting minute distortions in the spectro‑temporal structure of calls. To test this hypothesis, we systematically introduced spectro‑temporal distortion to communication calls of Phyllostomus discolor bats. We broke down each call into windows of the same length and randomized the phase spectrum inside each window. The overall degree of spectro‑temporal distortion in communication calls increased with window length. Modelling the bat auditory periphery revealed that cochlear mechanisms allow discrimination of fast spectro‑temporal envelopes. We evaluated model predictions with experimental psychophysical and neurophysiological data. We first assessed bats’ performance in discriminating original versions of calls from increasingly distorted versions of the same calls. We further examined cortical responses to determine additional specializations for call discrimination at the cortical level. Psychophysical and cortical responses concurred with model predictions, revealing discrimination thresholds in the range of 8–15 ms randomization‑window length. Our data suggest that specialized cortical areas are not necessary to impart psychophysical resilience to temporal distortion in communication calls

    Communication breakdown : limits of spectro-temporal resolution for the perception of bat communication calls

    Get PDF
    Open Access funding enabled and organized by Projekt DEAL. This work was supported by the Human Frontier Science Program (Grant RGP0058 to UF).During vocal communication, the spectro-temporal structure of vocalizations conveys important contextual information. Bats excel in the use of sounds for echolocation by meticulous encoding of signals in the temporal domain. We therefore hypothesized that for social communication as well, bats would excel at detecting minute distortions in the spectro-temporal structure of calls. To test this hypothesis, we systematically introduced spectro-temporal distortion to communication calls of Phyllostomus discolor bats. We broke down each call into windows of the same length and randomized the phase spectrum inside each window. The overall degree of spectro-temporal distortion in communication calls increased with window length. Modelling the bat auditory periphery revealed that cochlear mechanisms allow discrimination of fast spectro-temporal envelopes. We evaluated model predictions with experimental psychophysical and neurophysiological data. We first assessed bats' performance in discriminating original versions of calls from increasingly distorted versions of the same calls. We further examined cortical responses to determine additional specializations for call discrimination at the cortical level. Psychophysical and cortical responses concurred with model predictions, revealing discrimination thresholds in the range of 8-15 ms randomization-window length. Our data suggest that specialized cortical areas are not necessary to impart psychophysical resilience to temporal distortion in communication calls.Publisher PDFPeer reviewe

    Methods for speaking style conversion from normal speech to high vocal effort speech

    Get PDF
    This thesis deals with vocal-effort-focused speaking style conversion (SSC). Specifically, we studied two topics on conversion of normal speech to high vocal effort. The first topic involves the conversion of normal speech to shouted speech. We employed this conversion in a speaker recognition system with vocal effort mismatch between test and enrollment utterances (shouted speech vs. normal speech). The mismatch causes a degradation of the system's speaker identification performance. As solution, we proposed a SSC system that included a novel spectral mapping, used along a statistical mapping technique, to transform the mel-frequency spectral energies of normal speech enrollment utterances towards their counterparts in shouted speech. We evaluated the proposed solution by comparing speaker identification rates for a state-of-the-art i-vector-based speaker recognition system, with and without applying SSC to the enrollment utterances. Our results showed that applying the proposed SSC pre-processing to the enrollment data improves considerably the speaker identification rates. The second topic involves a normal-to-Lombard speech conversion. We proposed a vocoder-based parametric SSC system to perform the conversion. This system first extracts speech features using the vocoder. Next, a mapping technique, robust to data scarcity, maps the features. Finally, the vocoder synthesizes the mapped features into speech. We used two vocoders in the conversion system, for comparison: a glottal vocoder and the widely used STRAIGHT. We assessed the converted speech from the two vocoder cases with two subjective listening tests that measured similarity to Lombard speech and naturalness. The similarity subjective test showed that, for both vocoder cases, our proposed SSC system was able to convert normal speech to Lombard speech. The naturalness subjective test showed that the converted samples using the glottal vocoder were clearly more natural than those obtained with STRAIGHT
    • …
    corecore