7 research outputs found
End-to-End Speech Recognition From the Raw Waveform
State-of-the-art speech recognition systems rely on fixed, hand-crafted
features such as mel-filterbanks to preprocess the waveform before the training
pipeline. In this paper, we study end-to-end systems trained directly from the
raw waveform, building on two alternatives for trainable replacements of
mel-filterbanks that use a convolutional architecture. The first one is
inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015),
and the second one by the scattering transform (Zeghidour et al., 2017). We
propose two modifications to these architectures and systematically compare
them to mel-filterbanks, on the Wall Street Journal dataset. The first
modification is the addition of an instance normalization layer, which greatly
improves on the gammatone-based trainable filterbanks and speeds up the
training of the scattering-based filterbanks. The second one relates to the
low-pass filter used in these approaches. These modifications consistently
improve performances for both approaches, and remove the need for a careful
initialization in scattering-based trainable filterbanks. In particular, we
show a consistent improvement in word error rate of the trainable filterbanks
relatively to comparable mel-filterbanks. It is the first time end-to-end
models trained from the raw signal significantly outperform mel-filterbanks on
a large vocabulary task under clean recording conditions.Comment: Accepted for presentation at Interspeech 201
Multi-scale Alignment and Contextual History for Attention Mechanism in Sequence-to-sequence Model
A sequence-to-sequence model is a neural network module for mapping two
sequences of different lengths. The sequence-to-sequence model has three core
modules: encoder, decoder, and attention. Attention is the bridge that connects
the encoder and decoder modules and improves model performance in many tasks.
In this paper, we propose two ideas to improve sequence-to-sequence model
performance by enhancing the attention module. First, we maintain the history
of the location and the expected context from several previous time-steps.
Second, we apply multiscale convolution from several previous attention vectors
to the current decoder state. We utilized our proposed framework for
sequence-to-sequence speech recognition and text-to-speech systems. The results
reveal that our proposed extension could improve performance significantly
compared to a standard attention baseline
ΠΠ½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ
This article presents an analytic survey of various end-to-end speech recognition systems, as well as some approaches to their construction and optimization. We consider models based on connectionist temporal classification (CTC), models based on encoder-decoder architecture with attention mechanism and models using conditional random field (CRF). We also describe integration possibilities with language models at a stage of decoding. We see that such an approach significantly reduces recognition error rates for end-to-end models. A survey of research works in this subject area reveals that end-to-end systems allow achieving results close to that of the state-of-the-art hybrid models. Nevertheless, end-to-end models use simple configuration and demonstrate a high speed of learning and decoding. In addition, we consider popular frameworks and toolkits for creating speech recognition systems.ΠΡΠΈΠ²Π΅Π΄Π΅Π½ Π°Π½Π°Π»ΠΈΡΠΈΡΠ΅ΡΠΊΠΈΠΉ ΠΎΠ±Π·ΠΎΡ ΡΠ°Π·Π½ΠΎΠ²ΠΈΠ΄Π½ΠΎΡΡΠ΅ΠΉ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
(end-to-end) ΡΠΈΡΡΠ΅ΠΌ Π΄Π»Ρ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΈΡ
ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΡ, ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ ΠΈ ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΈ. Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π²Π°ΡΠΈΠ°Π½ΡΡ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡΠΈΠΎΠ½Π½ΠΎΠΉ Π²ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ (CTC) Π² ΠΊΠ°ΡΠ΅ΡΡΠ²Π΅ ΡΡΠ½ΠΊΡΠΈΠΈ ΠΏΠΎΡΠ΅ΡΡ Π΄Π»Ρ Π½Π΅ΠΉΡΠΎΠ½Π½ΠΎΠΉ ΡΠ΅ΡΠΈ, ΠΌΠΎΠ΄Π΅Π»ΠΈ Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ ΠΌΠ΅Ρ
Π°Π½ΠΈΠ·ΠΌΠ° Π²Π½ΠΈΠΌΠ°Π½ΠΈΡ ΠΈ ΡΠΈΡΡΠ°ΡΠΎΡ-Π΄Π΅ΡΠΈΡΡΠ°ΡΠΎΡ ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. Π’Π°ΠΊΠΆΠ΅ ΡΠ°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π½Π΅ΠΉΡΠΎΠ½Π½ΡΠ΅ ΡΠ΅ΡΠΈ, ΠΏΠΎΡΡΡΠΎΠ΅Π½Π½ΡΠ΅ Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΡΡΠ»ΠΎΠ²Π½ΡΡ
ΡΠ»ΡΡΠ°ΠΉΠ½ΡΡ
ΠΏΠΎΠ»Π΅ΠΉ (CRF), ΠΊΠΎΡΠΎΡΡΠ΅ ΡΠ²Π»ΡΡΡΡΡ ΠΎΠ±ΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ΠΌ ΡΠΊΡΡΡΡΡ
ΠΌΠ°ΡΠΊΠΎΠ²ΡΠΊΠΈΡ
ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ, ΡΡΠΎ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΠ΅Ρ ΠΈΡΠΏΡΠ°Π²ΠΈΡΡ ΠΌΠ½ΠΎΠ³ΠΈΠ΅ Π½Π΅Π΄ΠΎΡΡΠ°ΡΠΊΠΈ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
Π³ΠΈΠ±ΡΠΈΠ΄Π½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΏΡΠ΅Π΄ΠΏΠΎΠ»ΠΎΠΆΠ΅Π½ΠΈΠ΅ ΠΎ ΡΠΎΠΌ, ΡΡΠΎ ΡΠ»Π΅ΠΌΠ΅Π½ΡΡ Π²Ρ
ΠΎΠ΄Π½ΡΡ
ΠΏΠΎΡΠ»Π΅Π΄ΠΎΠ²Π°ΡΠ΅Π»ΡΠ½ΠΎΡΡΠ΅ΠΉ Π·Π²ΡΠΊΠΎΠ² ΡΠ΅ΡΠΈ ΡΠ²Π»ΡΡΡΡΡ Π½Π΅Π·Π°Π²ΠΈΡΠΈΠΌΡΠΌΠΈ ΡΠ»ΡΡΠ°ΠΉΠ½ΡΠΌΠΈ Π²Π΅Π»ΠΈΡΠΈΠ½Π°ΠΌΠΈ. Π’Π°ΠΊΠΆΠ΅ ΠΎΠΏΠΈΡΠ°Π½Ρ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΠΈ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΠΈ Ρ ΡΠ·ΡΠΊΠΎΠ²ΡΠΌΠΈ ΠΌΠΎΠ΄Π΅Π»ΡΠΌΠΈ Π½Π° ΡΡΠ°ΠΏΠ΅ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π½ΠΈΡ, Π΄Π΅ΠΌΠΎΠ½ΡΡΡΠΈΡΡΡΡΠΈΠ΅ ΡΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠ΅ ΡΠΎΠΊΡΠ°ΡΠ΅Π½ΠΈΠ΅ ΠΎΡΠΈΠ±ΠΊΠΈ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ Π΄Π»Ρ ΠΈΠ½ΡΠ΅Π³ΡΠ°ΡΠΈΠΎΠ½Π½ΡΡ
ΠΌΠΎΠ΄Π΅Π»Π΅ΠΉ. ΠΠΏΠΈΡΠ°Π½Ρ ΡΠ°Π·Π»ΠΈΡΠ½ΡΠ΅ ΠΌΠΎΠ΄ΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈ ΡΠ»ΡΡΡΠ΅Π½ΠΈΡ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
Π°ΡΡ
ΠΈΡΠ΅ΠΊΡΡΡ ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΠΊΠ°ΠΊ, Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΎΠ±ΠΎΠ±ΡΠ΅Π½ΠΈΠ΅ ΠΊΠΎΠ½Π½Π΅ΠΊΡΠΈΠΎΠ½Π½ΠΎΠΉ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΠΈ ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠΈ ΡΠ΅Π³ΡΠ»ΡΡΠΈΠ·Π°ΡΠΈΠΈ Π² ΠΌΠΎΠ΄Π΅Π»ΡΡ
, ΠΎΡΠ½ΠΎΠ²Π°Π½Π½ΡΡ
Π½Π° ΠΌΠ΅Ρ
Π°Π½ΠΈΠ·ΠΌΠ°Ρ
Π²Π½ΠΈΠΌΠ°Π½ΠΈΡ. ΠΠ±Π·ΠΎΡ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΉ, ΠΏΡΠΎΠ²ΠΎΠ΄ΠΈΠΌΡΡ
Π² Π΄Π°Π½Π½ΠΎΠΉ ΠΏΡΠ΅Π΄ΠΌΠ΅ΡΠ½ΠΎΠΉ ΠΎΠ±Π»Π°ΡΡΠΈ, ΠΏΠΎΠΊΠ°Π·ΡΠ²Π°Π΅Ρ, ΡΡΠΎ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΠ΅ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ ΠΏΠΎΠ·Π²ΠΎΠ»ΡΡΡ Π΄ΠΎΡΡΠΈΡΡ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΎΠ², ΡΡΠ°Π²Π½ΠΈΠΌΡΡ
Ρ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠ°ΠΌΠΈ ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡΡΠΈΡ
ΡΠΊΡΡΡΡΠ΅ ΠΌΠ°ΡΠΊΠΎΠ²ΡΠΊΠΈΠ΅ ΠΌΠΎΠ΄Π΅Π»ΠΈ, Π½ΠΎ Ρ ΠΏΡΠΈΠΌΠ΅Π½Π΅Π½ΠΈΠ΅ΠΌ Π±ΠΎΠ»Π΅Π΅ ΠΏΡΠΎΡΡΠΎΠΉ ΠΊΠΎΠ½ΡΠΈΠ³ΡΡΠ°ΡΠΈΠΈ ΠΈ Π±ΡΡΡΡΠΎΠΉ ΡΠ°Π±ΠΎΡΠΎΠΉ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΠΊΠ°ΠΊ ΠΏΡΠΈ ΠΎΠ±ΡΡΠ΅Π½ΠΈΠΈ, ΡΠ°ΠΊ ΠΈ ΠΏΡΠΈ Π΄Π΅ΠΊΠΎΠ΄ΠΈΡΠΎΠ²Π°Π½ΠΈΠΈ. Π Π°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ Π½Π°ΠΈΠ±ΠΎΠ»Π΅Π΅ ΠΏΠΎΠΏΡΠ»ΡΡΠ½ΡΠ΅ ΠΈ ΡΠ°Π·Π²ΠΈΠ²Π°ΡΡΠΈΠ΅ΡΡ Π±ΠΈΠ±Π»ΠΈΠΎΡΠ΅ΠΊΠΈ ΠΈ ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ°ΡΠΈΠΈ Π΄Π»Ρ ΠΏΠΎΡΡΡΠΎΠ΅Π½ΠΈΡ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ, ΡΠ°ΠΊΠΈΠ΅ ΠΊΠ°ΠΊ TensorFlow, Eesen, Kaldi ΠΈ Π΄ΡΡΠ³ΠΈΠ΅. ΠΡΠΎΠ²Π΅Π΄Π΅Π½ΠΎ ΡΡΠ°Π²Π½Π΅Π½ΠΈΠ΅ ΠΎΠΏΠΈΡΠ°Π½Π½ΡΡ
ΠΈΠ½ΡΡΡΡΠΌΠ΅Π½ΡΠ°ΡΠΈΠ΅Π² ΠΏΠΎ ΠΊΡΠΈΡΠ΅ΡΠΈΡΠΌ ΠΏΡΠΎΡΡΠΎΡΡ ΠΈ Π΄ΠΎΡΡΡΠΏΠ½ΠΎΡΡΠΈ ΠΈΡ
ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΡ Π΄Π»Ρ ΡΠ΅Π°Π»ΠΈΠ·Π°ΡΠΈΠΈ ΠΈΠ½ΡΠ΅Π³ΡΠ°Π»ΡΠ½ΡΡ
ΡΠΈΡΡΠ΅ΠΌ ΡΠ°ΡΠΏΠΎΠ·Π½Π°Π²Π°Π½ΠΈΡ ΡΠ΅ΡΠΈ
Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review
Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined