50 research outputs found
A practical two-stage training strategy for multi-stream end-to-end speech recognition
The multi-stream paradigm of audio processing, in which several sources are
simultaneously considered, has been an active research area for information
fusion. Our previous study offered a promising direction within end-to-end
automatic speech recognition, where parallel encoders aim to capture diverse
information followed by a stream-level fusion based on attention mechanisms to
combine the different views. However, with an increasing number of streams
resulting in an increasing number of encoders, the previous approach could
require substantial memory and massive amounts of parallel data for joint
training. In this work, we propose a practical two-stage training scheme.
Stage-1 is to train a Universal Feature Extractor (UFE), where encoder outputs
are produced from a single-stream model trained with all data. Stage-2
formulates a multi-stream scheme intending to solely train the attention fusion
module using the UFE features and pretrained components from Stage-1.
Experiments have been conducted on two datasets, DIRHA and AMI, as a
multi-stream scenario. Compared with our previous method, this strategy
achieves relative word error rate reductions of 8.2--32.4%, while consistently
outperforming several conventional combination methods.Comment: submitted to ICASSP 201
Exploration on HuBERT with Multiple Resolutions
Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL)
model in speech processing. However, we argue that its fixed 20ms resolution
for hidden representations would not be optimal for various speech-processing
tasks since their attributes (e.g., speaker characteristics and semantics) are
based on different time scales. To address this limitation, we propose
utilizing HuBERT representations at multiple resolutions for downstream tasks.
We explore two approaches, namely the parallel and hierarchical approaches, for
integrating HuBERT features with different resolutions. Through experiments, we
demonstrate that HuBERT with multiple resolutions outperforms the original
model. This highlights the potential of utilizing multiple resolutions in SSL
models like HuBERT to capture diverse information from speech signals.Comment: Accepted to Interspeech202
AN EFFICIENT AND ROBUST MULTI-STREAM FRAMEWORK FOR END-TO-END SPEECH RECOGNITION
In voice-enabled domestic or meeting environments, distributed microphone arrays aim to process distant-speech interaction into text with high accuracy.
However, with dynamic corruption of noises and reverberations or human movement present, there is no guarantee that any microphone array (stream) is constantly informative. In these cases, an appropriate strategy to dynamically fuse streams is necessary.
The multi-stream paradigm in Automatic Speech Recognition (ASR) considers scenarios where parallel streams carry diverse or complementary task-related knowledge. Such streams could be defined as microphone arrays, frequency bands, various modalities or etc. Hence, a robust stream fusion is crucial to emphasize on more informative streams than corrupted ones, especially under unseen conditions. This thesis focuses on improving the performance and robustness of speech recognition in multi-stream scenarios.
With increasing use of Deep Neural Networks (DNNs) in ASR, End-to-End (E2E) approaches, which directly transcribe human speech into text, have received greater attention. In this thesis, a multi-stream framework is presented based on the joint Connectionist Temporal Classification/ATTention (CTC/ATT) E2E model, where parallel streams are represented by separate encoders. On top of regular attention networks, a secondary stream-fusion network is to steer the decoder toward the most informative streams.
The MEM-Array model aims at improving the far-field ASR robustness using microphone arrays which are activated by separate encoders. With an increasing number of streams (encoders) requiring substantial memory and massive amounts of parallel data, a practical two-stage training strategy is designated to address these issues. Furthermore, a two-stage augmentation scheme is present to improve robustness of the multi-stream model. In MEM-Res, two heterogeneous encoders with different architectures, temporal resolutions and separate CTC networks work in parallel to extract complementary information from the same acoustics. Compared with the best single-stream performance, both models have achieved substantial improvement, outperforming alternative fusion strategies.
While the proposed framework optimizes information in multi-stream scenarios, this thesis also studies the Performance Monitoring (PM) measures to predict if recognition results of an E2E model are reliable without growth-truth knowledge. Four PM techniques are investigated, suggesting that PM measures on attention distributions and decoder posteriors are well-correlated with true performances
Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis
Audio-Visual Automatic Speech Recognition (AV-ASR) has become the most promising research area when the audio signal gets corrupted by noise. The main objective of this paper is to select the important and discriminative audio and visual speech features to recognize audio-visual speech. This paper proposes Pseudo Zernike Moment (PZM) and feature selection method for audio-visual speech recognition. Visual information is captured from the lip contour and computes the moments for lip reading. We have extracted 19th order of Mel Frequency Cepstral Coefficients (MFCC) as speech features from audio. Since all the 19 speech features are not equally important, therefore, feature selection algorithms are used to select the most efficient features. The various statistical algorithm such as Analysis of Variance (ANOVA), Kruskal-wallis, and Friedman test are employed to analyze the significance of features along with Incremental Feature Selection (IFS) technique. Statistical analysis is used to analyze the statistical significance of the speech features and after that IFS is used to select the speech feature subset. Furthermore, multiclass Support Vector Machine (SVM), Artificial Neural Network (ANN) and Naive Bayes (NB) machine learning techniques are used to recognize the speech for both the audio and visual modalities. Based on the recognition rate combined decision is taken from the two individual recognition systems. This paper compares the result achieved by the proposed model and the existing model for both audio and visual speech recognition. Zernike Moment (ZM) is compared with PZM and shows that our proposed model using PZM extracts better discriminative features for visual speech recognition. This study also proves that audio feature selection using statistical analysis outperforms methods without any feature selection technique
Automatic recognition of multiparty human interactions using dynamic Bayesian networks
Relating statistical machine learning approaches to the automatic analysis of multiparty
communicative events, such as meetings, is an ambitious research area. We
have investigated automatic meeting segmentation both in terms of “Meeting Actions”
and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine
grained level highlighting individual speaker intentions. Group meeting actions describe
the same process at a coarse level, highlighting interactions between different
meeting participants and showing overall group intentions.
A framework based on probabilistic graphical models such as dynamic Bayesian
networks (DBNs) has been investigated for both tasks. Our first set of experiments
is concerned with the segmentation and structuring of meetings (recorded using
multiple cameras and microphones) into sequences of group meeting actions such
as monologue, discussion and presentation. We outline four families of multimodal
features based on speaker turns, lexical transcription, prosody, and visual motion
that are extracted from the raw audio and video recordings. We relate these lowlevel
multimodal features to complex group behaviours proposing a multistreammodelling
framework based on dynamic Bayesian networks. Later experiments are
concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty
conversational speech. We present a joint generative approach based on a switching
DBN for DA recognition in which segmentation and classification of DAs are
carried out in parallel. This approach models a set of features, related to lexical
content and prosody, and incorporates a weighted interpolated factored language
model. In conjunction with this joint generative model, we have also investigated
the use of a discriminative approach, based on conditional random fields, to perform
a reclassification of the segmented DAs.
The DBN based approach yielded significant improvements when applied both
to the meeting action and the dialogue act recognition task. On both tasks, the DBN
framework provided an effective factorisation of the state-space and a flexible infrastructure
able to integrate a heterogeneous set of resources such as continuous
and discrete multimodal features, and statistical language models. Although our
experiments have been principally targeted on multiparty meetings; features, models,
and methodologies developed in this thesis can be employed for a wide range
of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features
for several related research areas such as speaker addressing and focus of attention
modelling, automatic speech recognition and understanding, topic and decision detection