Search CORE

126 research outputs found

Amélioration psychoacoustique du filtrage de Wiener : quelques approches récentes et une nouvelle méthode

Author: Amehraye Asmaa
Pastor Dominique
Tamtaoui Ahmed
Publication venue: HAL CCSD
Publication date: 01/01/2007
Field of study

*Bruit musical, distorsion, filtre deWiener, psychoacoustique, signal de parol

Noise-Robust Voice Conversion

Author: Tran Trang Thi Minh
Publication venue: Bucknell Digital Commons
Publication date: 04/05/2014
Field of study

A persistent challenge in speech processing is the presence of noise that reduces the quality of speech signals. Whether natural speech is used as input or speech is the desirable output to be synthesized, noise degrades the performance of these systems and causes output speech to be unnatural. Speech enhancement deals with such a problem, typically seeking to improve the input speech or post-processes the (re)synthesized speech. An intriguing complement to post-processing speech signals is voice conversion, in which speech by one person (source speaker) is made to sound as if spoken by a different person (target speaker). Traditionally, the majority of speech enhancement and voice conversion methods rely on parametric modeling of speech. A promising complement to parametric models is an inventory-based approach, which is the focus of this work. In inventory-based speech systems, one records an inventory of clean speech signals as a reference. Noisy speech (in the case of enhancement) or target speech (in the case of conversion) can then be replaced by the best-matching clean speech in the inventory, which is found via a correlation search method. Such an approach has the potential to alleviate intelligibility and unnaturalness issues often encountered by parametric modeling speech processing systems. This work investigates and compares inventory-based speech enhancement methods with conventional ones. In addition, the inventory search method is applied to estimate source speaker characteristics for voice conversion in noisy environments. Two noisy-environment voice conversion systems were constructed for a comparative study: a direct voice conversion system and an inventory-based voice conversion system, both with limited noise filtering at the front end. Results from this work suggest that the inventory method offers encouraging improvements over the direct conversion method

Bucknell University

Effective post-processing for single-channel frequency-domain speech enhancement

Author: Li Weifeng
Publication venue: IDIAP
Publication date: 11/02/2010
Field of study

Conventional frequency-domain speech enhancement filters improve signal-to-noise ratio (SNR), but also produce speech distortions. This paper describes a novel post-processing algorithm devised for the improvement of the quality of the speech processed by a conventional filter. In the proposed algorithm, the speech distortion is first compensated by adding the original noisy speech, and then the noise is reduced by a post-filter. Experimental results on speech quality show the effectiveness of the proposed algorithm in lower speech distortions. Based on our isolated word recognition experiments conducted in 15 real car environments, a relative word error rate (WER) reduction of 10.5\% is obtained compared to the conventional filter

Infoscience - École polytechnique fédérale de Lausanne

Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech

Author: Cooke Martin
Tang Yan
Valentini-Botinhao Cassia
Publication venue: 'Elsevier BV'
Publication date: 20/06/2015
Field of study

Several modification algorithms that alter natural or synthetic speech with the goal of improving intelligibility in noise have been proposed recently. A key requirement of many modification techniques is the ability to predict intelligibility, both offline during algorithm development, and online, in order to determine the optimal modification for the current noise context. While existing objective intelligibility metrics (OIMs) have good predictive power for unmodified natural speech in stationary and fluctuating noise, little is known about their effectiveness for other forms of speech. The current study evaluated how well seven OIMs predict listener responses in three large datasets of modified and synthetic speech which together represent 396 combinations of speech modification, masker type and signal-to-noise ratio. The chief finding is a clear reduction in predictive power for most OIMs when faced with modified and synthetic speech. Modifications introducing durational changes are particularly harmful to intelligibility predictors. OIMs that measure masked audibility tend to over-estimate intelligibility in the presence of fluctuating maskers relative to stationary maskers, while OIMs that estimate the distortion caused by the masker to a clean speech prototype exhibit the reverse pattern

University of Salford Institutional Repository

Edinburgh Research Explorer

Unsupervised Uncertainty Measures of Automatic Speech Recognition for Non-intrusive Speech Intelligibility Prediction

Author: Barker Jon
Ma Ning
Tu Zehai
Publication venue
Publication date: 08/04/2022
Field of study

Non-intrusive intelligibility prediction is important for its application in realistic scenarios, where a clean reference signal is difficult to access. The construction of many non-intrusive predictors require either ground truth intelligibility labels or clean reference signals for supervised learning. In this work, we leverage an unsupervised uncertainty estimation method for predicting speech intelligibility, which does not require intelligibility labels or reference signals to train the predictor. Our experiments demonstrate that the uncertainty from state-of-the-art end-to-end automatic speech recognition (ASR) models is highly correlated with speech intelligibility. The proposed method is evaluated on two databases and the results show that the unsupervised uncertainty measures of ASR models are more correlated with speech intelligibility from listening results than the predictions made by widely used intrusive methods.Comment: Submitted to INTERSPEECH202

arXiv.org e-Print Archive

Predicting speech intelligibility based on a correlation metric in the envelope power spectrum domain

Author: ANSI
ANSI
Christensen C. L.
Christoph Scheidiger
Helia Relaño-Iborra
Houtgast T.
ISO
Johannes Zaar
Langner G.
Ludvigsen C.
Ludvigsen C.
Tobias May
Torsten Dau
Wang D. L.
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2016
Field of study

Crossref

Online Research Database In Technology

Play as You Like: Timbre-enhanced Multi-modal Music Style Transfer

Author: Chang Chia-Che
Lee Che-Rung
Lu Chien-Yu
Su Li
Xue Min-Xin
Publication venue
Publication date: 28/11/2018
Field of study

Style transfer of polyphonic music recordings is a challenging task when considering the modeling of diverse, imaginative, and reasonable music pieces in the style different from their original one. To achieve this, learning stable multi-modal representations for both domain-variant (i.e., style) and domain-invariant (i.e., content) information of music in an unsupervised manner is critical. In this paper, we propose an unsupervised music style transfer method without the need for parallel data. Besides, to characterize the multi-modal distribution of music pieces, we employ the Multi-modal Unsupervised Image-to-Image Translation (MUNIT) framework in the proposed system. This allows one to generate diverse outputs from the learned latent distributions representing contents and styles. Moreover, to better capture the granularity of sound, such as the perceptual dimensions of timbre and the nuance in instrument-specific performance, cognitively plausible features including mel-frequency cepstral coefficients (MFCC), spectral difference, and spectral envelope, are combined with the widely-used mel-spectrogram into a timber-enhanced multi-channel input representation. The Relativistic average Generative Adversarial Networks (RaGAN) is also utilized to achieve fast convergence and high stability. We conduct experiments on bilateral style transfer tasks among three different genres, namely piano solo, guitar solo, and string quartet. Results demonstrate the advantages of the proposed method in music style transfer with improved sound quality and in allowing users to manipulate the output

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

N-HANS: a neural network-based toolkit for in-the-wild audio enhancement

Author: Keren Gil
Liu Shuo
Parada-Cabaleiro Emilia
Schuller Björn
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

OPUS Augsburg

A non-intrusive method for estimating binaural speech intelligibility from noise-corrupted signals captured by a pair of microphones

Author: Cox TJ
Liu Q
Tang Y
Wang W
Publication venue: 'Elsevier BV'
Publication date: 01/02/2018
Field of study

A non-intrusive method is introduced to predict binaural speech intelligibility in noise directly from signals captured using a pair of microphones. The approach combines signal processing techniques in blind source separation and localisation, with an intrusive objective intelligibility measure (OIM). Therefore, unlike classic intrusive OIMs, this method does not require a clean reference speech signal and knowing the location of the sources to operate. The proposed approach is able to estimate intelligibility in stationary and fluctuating noises, when the noise masker is presented as a point or diffused source, and is spatially separated from the target speech source on a horizontal plane. The performance of the proposed method was evaluated in two rooms. When predicting subjective intelligibility measured as word recognition rate, this method showed reasonable predictive accuracy with correlation coefficients above 0.82, which is comparable to that of a reference intrusive OIM in most of the conditions. The proposed approach offers a solution for fast binaural intelligibility prediction, and therefore has practical potential to be deployed in situations where on-site speech intelligibility is a concern

University of Salford Institutional Repository

University of Surrey

Surrey Research Insight

A computational model of human auditory signal processing and perception

Author: Dau Torsten
Ewert Stephan D.
Jepsen Morten Løve
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2008
Field of study

Online Research Database In Technology