15 research outputs found
A practical two-stage training strategy for multi-stream end-to-end speech recognition
The multi-stream paradigm of audio processing, in which several sources are
simultaneously considered, has been an active research area for information
fusion. Our previous study offered a promising direction within end-to-end
automatic speech recognition, where parallel encoders aim to capture diverse
information followed by a stream-level fusion based on attention mechanisms to
combine the different views. However, with an increasing number of streams
resulting in an increasing number of encoders, the previous approach could
require substantial memory and massive amounts of parallel data for joint
training. In this work, we propose a practical two-stage training scheme.
Stage-1 is to train a Universal Feature Extractor (UFE), where encoder outputs
are produced from a single-stream model trained with all data. Stage-2
formulates a multi-stream scheme intending to solely train the attention fusion
module using the UFE features and pretrained components from Stage-1.
Experiments have been conducted on two datasets, DIRHA and AMI, as a
multi-stream scenario. Compared with our previous method, this strategy
achieves relative word error rate reductions of 8.2--32.4%, while consistently
outperforming several conventional combination methods.Comment: submitted to ICASSP 201
Neural network-based method for visual recognition of driver’s voice commands using attention mechanism
Visual speech recognition or automated lip-reading systems actively apply to speech-to-text translation. Video data
proves to be useful in multimodal speech recognition systems, particularly when using acoustic data is difficult or
not available at all. The main purpose of this study is to improve driver command recognition by analyzing visual
information to reduce touch interaction with various vehicle systems (multimedia and navigation systems, phone calls,
etc.) while driving. We propose a method of automated lip-reading the driver’s speech while driving based on a deep
neural network of 3DResNet18 architecture. Using neural network architecture with bi-directional LSTM model and
attention mechanism allows achieving higher recognition accuracy with a slight decrease in performance. Two different
variants of neural network architectures for visual speech recognition are proposed and investigated. When using the
first neural network architecture, the result of voice recognition of the driver was 77.68 %, which was lower by 5.78 %
than when using the second one the accuracy of which was 83.46 %. Performance of the system which is determined
by a real-time indicator RTF in the case of the first neural network architecture is equal to 0.076, and the second —
RTF is 0.183 which is more than two times higher. The proposed method was tested on the data of multimodal corpus
RUSAVIC recorded in the car. Results of the study can be used in systems of audio-visual speech recognition which
is recommended in high noise conditions, for example, when driving a vehicle. In addition, the analysis performed
allows us to choose the optimal neural network model of visual speech recognition for subsequent incorporation into
the assistive system based on a mobile device
Integration of Language Models in Sequence to Sequence Optical Music Recognition Systems
El present projecte és un estudi del potencial d'integrar per mitjà de diverses tècniques un model de llenguatge a un sistema de Reconeixement Òptic de Partitures (OMR) basat en una arquitectura Sequence to Sequence. L'objectiu és millorar el rendiment del model sobre partitures manuscrites antigues, que són especialment complexes d'interpretar a causa del seu elevat grau de variabilitat i les distorsions que solen incorporar.The following project is a study of the potential of integrating a language model into a Sequence to Sequence-based Optical Music Recognition (OMR) system through various techniques. The goal is to improve the performance of the model on handwritten old music scores, whose interpretation is particularly error-prone due to their high degree of variability and distortion.El presente proyecto es un estudio del potencial de integrar por medio de varias técnicas un modelo de lenguaje a un sistema de Reconocimiento Óptico de Partituras (OMR) basado en una arquitectura Sequence to Sequence. El objetivo es mejorar el rendimiento del modelo sobre partituras manuscritas antiguas, que son especialmente complicadas de interpretar a causa de su elevado grado de variabilidad y las distorsiones que suelen incorporar