One of the main problems faced by automatic speech recognition is the variability of
the testing conditions. This is due both to the acoustic conditions (different transmission
channels, recording devices, noises etc.) and to the variability of speech
across different speakers (i.e. due to different accents, coarticulation of phonemes
and different vocal tract characteristics). Vocal tract length normalisation (VTLN)
aims at normalising the acoustic signal, making it independent from the vocal tract
length. This is done by a speaker specific warping of the frequency axis parameterised
through a warping factor. In this thesis the application of VTLN to multiparty
conversational speech was investigated focusing on the meeting domain. This
is a challenging task showing a great variability of the speech acoustics both across
different speakers and across time for a given speaker. VTL, the distance between
the lips and the glottis, varies over time. We observed that the warping factors estimated
using Maximum Likelihood seem to be context dependent: appearing to be
influenced by the current conversational partner and being correlated with the behaviour
of formant positions and the pitch. This is because VTL also influences the
frequency of vibration of the vocal cords and thus the pitch. In this thesis we also
investigated pitch-adaptive acoustic features with the goal of further improving the
speaker normalisation provided by VTLN.
We explored the use of acoustic features obtained using a pitch-adaptive analysis
in combination with conventional features such as Mel frequency cepstral coefficients.
These spectral representations were combined both at the acoustic feature
level using heteroscedastic linear discriminant analysis (HLDA), and at the system
level using ROVER. We evaluated this approach on a challenging large vocabulary
speech recognition task: multiparty meeting transcription. We found that VTLN
benefits the most from pitch-adaptive features. Our experiments also suggested that
combining conventional and pitch-adaptive acoustic features using HLDA results in
a consistent, significant decrease in the word error rate across all the tasks. Combining
at the system level using ROVER resulted in a further significant improvement.
Further experiments compared the use of pitch adaptive spectral representation with
the adoption of a smoothed spectrogram for the extraction of cepstral coefficients.
It was found that pitch adaptive spectral analysis, providing a representation which
is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to
be advantageous when HLDA is applied. The combination of a pitch adaptive spectral
representation and VTLN based speaker normalisation in the context of LVCSR
for multiparty conversational speech led to more speaker independent acoustic models
improving the overall recognition performances