2 research outputs found

    Automatic robust classification of speech using analytical feature techniques

    Get PDF
    Aquest document és la memòria de la recerca efectuada dins del domini de la classificació automàtica de la parla durant una estada al laboratori Sony CSL per a la realització del projecte fi de carrera. El treball explora les possibilitats del sistema EDS, desenvolupat a Sony CSL, per resoldre problemes de reconeixement d’un petit nombre de mots aïllats, independentment del locutor i en presència de soroll de fons. EDS construeix automàticament features per problemes de classificació d’àudio. Això ho aconsegueix mitjançant la composició (funcional) d’operadors matemàtics i de processament de senyal. Per això aquestes features reben el nom de features analítiques, que el sistema construeix específicament per cada problema de classificació d’àudio, presentat sota la forma d’una base de dades d’entrenament i de test

    Speaker-Invariant Features for Automatic Speech Recognition

    No full text
    In this paper, we consider the generation of features for automatic speech recognition (ASR) that are robust to speaker-variations. One of the major causes for the degradation in the performance of ASR systems is due to inter-speaker variations. These variations are commonly modeled by a pure scaling relation between spectra of speakers enunciating the same sound. Therefore, current state-of-the art ASR systems overcome this problem of speakervariability by doing a brute-force search for the optimal scaling parameter. This procedure known as vocal-tract length normalization (VTLN) is computationally intensive. We have recently used Scale-Transform (a variation of Mellin transform) to generate features which are robust to speaker variations without the need to search for the scaling parameter. However, these features have poorer performance due to loss of phase information. In this paper, we propose to use the magnitude of Scale-Transform and a pre-computed “phase”-vector for each phoneme to generate speaker-invariant features. We compare the performance of the proposed features with conventional VTLN on a phoneme recognition task.
    corecore