8,901 research outputs found
Frame-level features conveying phonetic information for language and speaker recognition
150 p.This Thesis, developed in the Software Technologies Working Group of the Departmentof Electricity and Electronics of the University of the Basque Country, focuseson the research eld of spoken language and speaker recognition technologies.More specically, the research carried out studies the design of a set of featuresconveying spectral acoustic and phonotactic information, searches for the optimalfeature extraction parameters, and analyses the integration and usage of the featuresin language recognition systems, and the complementarity of these approacheswith regard to state-of-the-art systems. The study reveals that systems trained onthe proposed set of features, denoted as Phone Log-Likelihood Ratios (PLLRs), arehighly competitive, outperforming in several benchmarks other state-of-the-art systems.Moreover, PLLR-based systems also provide complementary information withregard to other phonotactic and acoustic approaches, which makes them suitable infusions to improve the overall performance of spoken language recognition systems.The usage of this features is also studied in speaker recognition tasks. In this context,the results attained by the approaches based on PLLR features are not as remarkableas the ones of systems based on standard acoustic features, but they still providecomplementary information that can be used to enhance the overall performance ofthe speaker recognition systems
Two-pass Decoding and Cross-adaptation Based System Combination of End-to-end Conformer and Hybrid TDNN ASR Systems
Fundamental modelling differences between hybrid and end-to-end (E2E)
automatic speech recognition (ASR) systems create large diversity and
complementarity among them. This paper investigates multi-pass rescoring and
cross adaptation based system combination approaches for hybrid TDNN and
Conformer E2E ASR systems. In multi-pass rescoring, state-of-the-art hybrid
LF-MMI trained CNN-TDNN system featuring speed perturbation, SpecAugment and
Bayesian learning hidden unit contributions (LHUC) speaker adaptation was used
to produce initial N-best outputs before being rescored by the speaker adapted
Conformer system using a 2-way cross system score interpolation. In cross
adaptation, the hybrid CNN-TDNN system was adapted to the 1-best output of the
Conformer system or vice versa. Experiments on the 300-hour Switchboard corpus
suggest that the combined systems derived using either of the two system
combination approaches outperformed the individual systems. The best combined
system obtained using multi-pass rescoring produced statistically significant
word error rate (WER) reductions of 2.5% to 3.9% absolute (22.5% to 28.9%
relative) over the stand alone Conformer system on the NIST Hub5'00, Rt03 and
Rt02 evaluation data.Comment: It' s accepted to ISCA 202
- …