2 research outputs found
Automatic robust classification of speech using analytical feature techniques
Aquest document és la memòria de la recerca efectuada dins del domini de la classificació
automĂ tica de la parla durant una estada al laboratori Sony CSL per a la realitzaciĂł del projecte fi
de carrera. El treball explora les possibilitats del sistema EDS, desenvolupat a Sony CSL, per
resoldre problemes de reconeixement d’un petit nombre de mots aïllats, independentment del
locutor i en presència de soroll de fons. EDS construeix automà ticament features per problemes
de classificació d’à udio. Això ho aconsegueix mitjançant la composició (funcional) d’operadors
matemà tics i de processament de senyal. Per això aquestes features reben el nom de features
analĂtiques, que el sistema construeix especĂficament per cada problema de classificaciĂł d’à udio,
presentat sota la forma d’una base de dades d’entrenament i de test
Speaker-Invariant Features for Automatic Speech Recognition
In this paper, we consider the generation of features for automatic speech recognition (ASR) that are robust to speaker-variations. One of the major causes for the degradation in the performance of ASR systems is due to inter-speaker variations. These variations are commonly modeled by a pure scaling relation between spectra of speakers enunciating the same sound. Therefore, current state-of-the art ASR systems overcome this problem of speakervariability by doing a brute-force search for the optimal scaling parameter. This procedure known as vocal-tract length normalization (VTLN) is computationally intensive. We have recently used Scale-Transform (a variation of Mellin transform) to generate features which are robust to speaker variations without the need to search for the scaling parameter. However, these features have poorer performance due to loss of phase information. In this paper, we propose to use the magnitude of Scale-Transform and a pre-computed “phase”-vector for each phoneme to generate speaker-invariant features. We compare the performance of the proposed features with conventional VTLN on a phoneme recognition task.