Search CORE

7 research outputs found

Multi-channel Feature Enhancement for Robust Speech Recognition

Author: Emanuele Principi
Francesco Piazza
Rudy Rotili
Simone Cifani
Stefano Squartini
Publication venue: 'IntechOpen'
Publication date: 01/01/2011
Field of study

IRIS UniversitÃ Politecnica delle Marche

Speech Recognition

Author
Publication venue: 'IntechOpen'
Publication date: 20/04/2021
Field of study

Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

Robust Speech Recognition with Two-dimensional Frame-and-feature Weighting and Modulation Spectrum Normalization

Author: Chang Yang
張暘
Publication venue
Publication date: 27/03/2013
Field of study

本篇論文先概括性的介紹了各種語音強健化的演算法，諸如: 倒頻譜平均值消去法(Cepstral Moment Subtraction, CMS)、倒頻譜正規化法(Cepstral Mean and Variance Normalization, CMVN)、倒頻譜分佈等化法(Histogram Equalization, HEQ)、高階倒頻譜動差正規化法(Higher Order Cepstral Moment Normalization, HOCMN)……等方式，同時也介紹了目前在語音強健化領域所公訂的國際標準語料Aurora 2和Aurora 4，並報告了在這兩套語料上基礎的實驗結果。以亂度為基礎之特徵權重法(entropy-based feature weighting)、以混淆為基礎之特徵權重法(confusion-based feature weighting)考慮不同參數的辨別能力，在辨識時給予不同參數不同的權重，加強那些擁有較好辨別能力的參數，並利用混淆矩陣(confusion matrix)多加考慮了各種音素之間可能發生錯誤的情形，此種方法除了可以直接應用在梅爾倒頻上，更可和許多現行的語音強健化法做結合。以支撐向量辨識器為基礎之音框權重法(SVM-based frame weighting)，使用機器學習(machine learning)中的支撐向量機器(Support Vector Machine)作為機器，利用音框的能量分佈(energy distribution)及諧波率析(harmonicity estimation)將測試資料分為可信賴音框(reliable frames)、不可信賴音框(unreliable frames)，在做辨識過程中時較為依賴可信賴音框來幫助辨識，因此利用支撐向量辨識器的分數來給予可信賴音框較大的權重、不可信賴音框較低的權重。最後試著結合以混淆為基礎之特徵權重法(confusion-based feature weighting)及以支撐向量辨識器為基礎之音框權重法(SVM-based frame weighting)而成為二維特徵音框權重維特比演算法(Two-dimensional frame-and-feature weighted Viterbi decoding)，給予不同參數不同比重、不同音框不同權重。此種方法結合了上述兩種方法的優點而得到更好的進步表現。In this paper we propose a new approach of two-dimensional frame-and-feature weighted Viterbi decoding performed at the recognizer back-end for robust speech recognition. The frame weighting is based on an Support Vector Machine (SVM) classifier considering the energy distribution and cross-correlation spectrum of the frame. The basic idea is that voiced frames with higher harmonicity is in general more reliable than other frames in noisy speech and therefore should be weighted higher. The feature weighting is based on an entropy measure considering confusion between phoneme classes. The basic idea is that the scores obtained with more discriminating features causing less confusion between phonemes should be weighted higher. These two different weighting schemes on the two different dimensions, frames and features, are then properly integrated in Viterbi decoding. Very significant improvements were achieved in extensive experiments performed with the Aurora 4 testing environment for all types of noise and all SNR values

Modulation Spectrum Equalization for Improved Robust Speech Recognition

Author: Sun Liang-Che
孫良哲
Publication venue
Publication date: 27/03/2013
Field of study

在強健性語音辨識的領域中，時域濾波器(Temporal Filter)是一個常見且相當有效的技術。在過去已經發展很成熟的著名技術包括相對頻譜濾波器(Relative Spectra, RASTA)，以及基於主成分分析(Principle Component Analysis, PCA) 和線性鑑別分析(Linear Discriminant Analysis, LDA) 所設計的資料導向(Data-driven) 時域濾波器。這些技術主要是針對語音參數的時間序列 (Time Trajectories)或是調變頻譜 (Modulation Spectrum) 設計濾波器，進而使得語音信號中的雜訊能得到有效的抑制; 然而這些傳統技術的缺點在於不能隨著外在雜訊環境的不同來作調整，因而難以在所有的雜訊環境下都有很好的表現。本論文所提出的調變頻譜等化法 (Modulation Spectrum Equalization) 則可以視為一種可適性的時域濾波器，亦即我們可以對在不同雜訊環境下錄音的語句得到不同的濾波器頻率響應，因而能夠有效改善在各種不同雜訊環境下的辨識結果。在這些技術當中，我們首先藉由傅利葉轉換將語音參數的時間序列轉換至調變頻譜，而我們所提出的技術均是直接利用信號在調變頻譜上的分佈情形來做設計。在頻譜分佈等化法 (Spectral Histogram Equalization, SHE) 中，我們先利用乾淨的訓練語料，統計它們的調變頻譜機率分佈作為參考分佈，接著將測試語句的調變頻譜機率分佈等化至此參考分佈。而在雙頻帶頻譜分佈等化法(Two-band Spectral Histogram Equalization, 2B-SHE)中，我們利用調變頻譜上低頻和高頻通常帶有不同的語音資訊這項特色，將測試語句中的低頻和高頻部分，分別等化至不同的參考分佈，進而得到比頻譜分佈等化法更佳的辨識結果。而在量值比例等化法(Magnitude Ratio Equalization, MRE)中，我們則將測試語句在調變頻譜上的量值比例等化至由乾淨語料所計算出的量值比例參考值。我們在英文連續數字語料 (Aurora 2) 和英文連續大字彙語料 (Aurora 4) 上的實驗發現，我們所提出的技術相較於傳統的時域濾波器技術在辨識率上有明顯的提昇，而且我們所提出的技術也可以和一些知名的倒頻譜正規化法作有效的結合以進一步提昇辨識正確率。而除了在辨識率上的呈現外，我們也從許多不同的面向來探討辨識率進步的原因，包含這些技術所求出的濾波器應、雜訊在調變頻譜上的行為、不同音素的辨識率，以及調變頻譜的距離….等。We propose novel approaches for equalizing the modulation spectrum for robust feature extraction in speech recognition. In these cases the temporal trajectories of the feature parameters are first transformed into the magnitude modulation spectrum. In spectral histogram equalization (SHE) and two-band spectral histogram equalization (2B-SHE), we simply equalize the histogram of the modulation spectrum for each utterance to a reference histogram obtained from clean training data, or perform the this equalization with two sub-bands on the modulation spectrum. In magnitude ratio equalization (MRE), we define the magnitude ratio of lower to higher modulation frequency components for each utterance, and equalize this to a reference value obtained from clean training data. These approaches can be viewed as temporal filters that are adapted to each testing utterance. Experiments performed on the Aurora 2 and 4 corpora for small and large vocabulary tasks indicate that significant performance improvements are achievable for all noise conditions (additive or convolutional, different noise types, and different SNR values). We also show that additional improvements are obtainable when these approaches are integrated with cepstral mean and variance normalization (CMVN), histogram equalization (HEQ), or higher-order cepstral moment normalization (HOCMN). We analyze and discuss reasons why such improvements are achievable from different viewpoints with different sets of data, including adaptive temporal filtering, noise behavior on the modulation spectrum, phoneme types, and modulation spectrum distance