126 research outputs found
Amélioration psychoacoustique du filtrage de Wiener : quelques approches récentes et une nouvelle méthode
*Bruit musical, distorsion, filtre deWiener, psychoacoustique, signal de parol
Noise-Robust Voice Conversion
A persistent challenge in speech processing is the presence of noise that reduces the quality of speech signals. Whether natural speech is used as input or speech is the desirable output to be synthesized, noise degrades the performance of these systems and causes output speech to be unnatural. Speech enhancement deals with such a problem, typically seeking to improve the input speech or post-processes the (re)synthesized speech. An intriguing complement to post-processing speech signals is voice conversion, in which speech by one person (source speaker) is made to sound as if spoken by a different person (target speaker). Traditionally, the majority of speech enhancement and voice conversion methods rely on parametric modeling of speech. A promising complement to parametric models is an inventory-based approach, which is the focus of this work. In inventory-based speech systems, one records an inventory of clean speech signals as a reference. Noisy speech (in the case of enhancement) or target speech (in the case of conversion) can then be replaced by the best-matching clean speech in the inventory, which is found via a correlation search method. Such an approach has the potential to alleviate intelligibility and unnaturalness issues often encountered by parametric modeling speech processing systems. This work investigates and compares inventory-based speech enhancement methods with conventional ones. In addition, the inventory search method is applied to estimate source speaker characteristics for voice conversion in noisy environments. Two noisy-environment voice conversion systems were constructed for a comparative study: a direct voice conversion system and an inventory-based voice conversion system, both with limited noise filtering at the front end. Results from this work suggest that the inventory method offers encouraging improvements over the direct conversion method
Effective post-processing for single-channel frequency-domain speech enhancement
Conventional frequency-domain speech enhancement filters improve signal-to-noise ratio (SNR), but also produce speech distortions. This paper describes a novel post-processing algorithm devised for the improvement of the quality of the speech processed by a conventional filter. In the proposed algorithm, the speech distortion is first compensated by adding the original noisy speech, and then the noise is reduced by a post-filter. Experimental results on speech quality show the effectiveness of the proposed algorithm in lower speech distortions. Based on our isolated word recognition experiments conducted in 15 real car environments, a relative word error rate (WER) reduction of 10.5\% is obtained compared to the conventional filter
Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech
Several modification algorithms that alter natural or synthetic speech with the goal of improving intelligibility in noise have been proposed recently. A key requirement of many modification techniques is the ability to predict intelligibility, both offline during algorithm development, and online, in order to determine the optimal modification for the current noise context. While existing objective intelligibility metrics (OIMs) have good predictive power for unmodified natural speech in stationary and fluctuating noise, little is known about their effectiveness for other forms of speech. The current study evaluated how well seven OIMs predict listener responses in three large datasets of modified and synthetic speech which together represent 396 combinations of speech modification, masker type and signal-to-noise ratio. The chief finding is a clear reduction in predictive power for most OIMs when faced with modified and synthetic speech. Modifications introducing durational changes are particularly harmful to intelligibility predictors. OIMs that measure masked audibility tend to over-estimate intelligibility in the presence of fluctuating maskers relative to stationary maskers, while OIMs that estimate the distortion caused by the masker to a clean speech prototype exhibit the reverse pattern
Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction
Non-intrusive intelligibility prediction is important for its application in
realistic scenarios, where a clean reference signal is difficult to access. The
construction of many non-intrusive predictors require either ground truth
intelligibility labels or clean reference signals for supervised learning. In
this work, we leverage an unsupervised uncertainty estimation method for
predicting speech intelligibility, which does not require intelligibility
labels or reference signals to train the predictor. Our experiments demonstrate
that the uncertainty from state-of-the-art end-to-end automatic speech
recognition (ASR) models is highly correlated with speech intelligibility. The
proposed method is evaluated on two databases and the results show that the
unsupervised uncertainty measures of ASR models are more correlated with speech
intelligibility from listening results than the predictions made by widely used
intrusive methods.Comment: Submitted to INTERSPEECH202
Play as You Like: Timbre-enhanced Multi-modal Music Style Transfer
Style transfer of polyphonic music recordings is a challenging task when
considering the modeling of diverse, imaginative, and reasonable music pieces
in the style different from their original one. To achieve this, learning
stable multi-modal representations for both domain-variant (i.e., style) and
domain-invariant (i.e., content) information of music in an unsupervised manner
is critical. In this paper, we propose an unsupervised music style transfer
method without the need for parallel data. Besides, to characterize the
multi-modal distribution of music pieces, we employ the Multi-modal
Unsupervised Image-to-Image Translation (MUNIT) framework in the proposed
system. This allows one to generate diverse outputs from the learned latent
distributions representing contents and styles. Moreover, to better capture the
granularity of sound, such as the perceptual dimensions of timbre and the
nuance in instrument-specific performance, cognitively plausible features
including mel-frequency cepstral coefficients (MFCC), spectral difference, and
spectral envelope, are combined with the widely-used mel-spectrogram into a
timber-enhanced multi-channel input representation. The Relativistic average
Generative Adversarial Networks (RaGAN) is also utilized to achieve fast
convergence and high stability. We conduct experiments on bilateral style
transfer tasks among three different genres, namely piano solo, guitar solo,
and string quartet. Results demonstrate the advantages of the proposed method
in music style transfer with improved sound quality and in allowing users to
manipulate the output
A non-intrusive method for estimating binaural speech intelligibility from noise-corrupted signals captured by a pair of microphones
A non-intrusive method is introduced to predict binaural speech intelligibility in noise directly from signals captured using a pair of microphones. The approach combines signal processing techniques in blind source separation
and localisation, with an intrusive objective intelligibility measure (OIM). Therefore, unlike classic intrusive OIMs, this method does not require a clean reference speech signal and knowing the location of the sources to operate.
The proposed approach is able to estimate intelligibility in stationary and fluctuating noises, when the noise masker is presented as a point or diffused source, and is spatially separated from the target speech source on a horizontal
plane. The performance of the proposed method was evaluated in two rooms. When predicting subjective intelligibility measured as word recognition rate, this method showed reasonable predictive accuracy with correlation coefficients above 0.82, which is comparable to that of a reference intrusive OIM in most of the conditions. The proposed approach offers a solution for fast binaural intelligibility prediction, and therefore has practical potential
to be deployed in situations where on-site speech intelligibility is a concern
- …