9 research outputs found
Multi-scale Alignment and Contextual History for Attention Mechanism in Sequence-to-sequence Model
A sequence-to-sequence model is a neural network module for mapping two
sequences of different lengths. The sequence-to-sequence model has three core
modules: encoder, decoder, and attention. Attention is the bridge that connects
the encoder and decoder modules and improves model performance in many tasks.
In this paper, we propose two ideas to improve sequence-to-sequence model
performance by enhancing the attention module. First, we maintain the history
of the location and the expected context from several previous time-steps.
Second, we apply multiscale convolution from several previous attention vectors
to the current decoder state. We utilized our proposed framework for
sequence-to-sequence speech recognition and text-to-speech systems. The results
reveal that our proposed extension could improve performance significantly
compared to a standard attention baseline
Make More of Your Data: Minimal Effort Data Augmentation for Automatic Speech Recognition and Translation
Data augmentation is a technique to generate new training data based on
existing data. We evaluate the simple and cost-effective method of
concatenating the original data examples to build new training instances.
Continued training with such augmented data is able to improve off-the-shelf
Transformer and Conformer models that were optimized on the original data only.
We demonstrate considerable improvements on the LibriSpeech-960h test sets (WER
2.83 and 6.87 for test-clean and test-other), which carry over to models
combined with shallow fusion (WER 2.55 and 6.27). Our method of continued
training also leads to improvements of up to 0.9 WER on the ASR part of
CoVoST-2 for four non English languages, and we observe that the gains are
highly dependent on the size of the original training data. We compare
different concatenation strategies and found that our method does not need
speaker information to achieve its improvements. Finally, we demonstrate on two
datasets that our methods also works for speech translation tasks
Albayzin 2018 Evaluation: The IberSpeech-RTVE Challenge on Speech Technologies for Spanish Broadcast Media
The IberSpeech-RTVE Challenge presented at IberSpeech 2018 is a new Albayzin evaluation series supported by the Spanish Thematic Network on Speech Technologies (Red Temática en Tecnologías del Habla (RTTH)). That series was focused on speech-to-text transcription, speaker diarization, and multimodal diarization of television programs. For this purpose, the Corporacion Radio Television Española (RTVE), the main public service broadcaster in Spain, and the RTVE Chair at the University of Zaragoza made more than 500 h of broadcast content and subtitles available for scientists. The dataset included about 20 programs of different kinds and topics produced and broadcast by RTVE between 2015 and 2018. The programs presented different challenges from the point of view of speech technologies such as: the diversity of Spanish accents, overlapping speech, spontaneous speech, acoustic variability, background noise, or specific vocabulary. This paper describes the database and the evaluation process and summarizes the results obtained
R-BI: Regularized Batched Inputs enhance Incremental Decoding Framework for Low-Latency Simultaneous Speech Translation
Incremental Decoding is an effective framework that enables the use of an
offline model in a simultaneous setting without modifying the original model,
making it suitable for Low-Latency Simultaneous Speech Translation. However,
this framework may introduce errors when the system outputs from incomplete
input. To reduce these output errors, several strategies such as Hold-,
LA-, and SP- can be employed, but the hyper-parameter needs to be
carefully selected for optimal performance. Moreover, these strategies are more
suitable for end-to-end systems than cascade systems. In our paper, we propose
a new adaptable and efficient policy named "Regularized Batched Inputs". Our
method stands out by enhancing input diversity to mitigate output errors. We
suggest particular regularization techniques for both end-to-end and cascade
systems. We conducted experiments on IWSLT Simultaneous Speech Translation
(SimulST) tasks, which demonstrate that our approach achieves low latency while
maintaining no more than 2 BLEU points loss compared to offline systems.
Furthermore, our SimulST systems attained several new state-of-the-art results
in various language directions.Comment: Preprin
Automatic Curriculum Learning With Over-repetition Penalty for Dialogue Policy Learning
Dialogue policy learning based on reinforcement learning is difficult to be
applied to real users to train dialogue agents from scratch because of the high
cost. User simulators, which choose random user goals for the dialogue agent to
train on, have been considered as an affordable substitute for real users.
However, this random sampling method ignores the law of human learning, making
the learned dialogue policy inefficient and unstable. We propose a novel
framework, Automatic Curriculum Learning-based Deep Q-Network (ACL-DQN), which
replaces the traditional random sampling method with a teacher policy model to
realize the dialogue policy for automatic curriculum learning. The teacher
model arranges a meaningful ordered curriculum and automatically adjusts it by
monitoring the learning progress of the dialogue agent and the over-repetition
penalty without any requirement of prior knowledge. The learning progress of
the dialogue agent reflects the relationship between the dialogue agent's
ability and the sampled goals' difficulty for sample efficiency. The
over-repetition penalty guarantees the sampled diversity. Experiments show that
the ACL-DQN significantly improves the effectiveness and stability of dialogue
tasks with a statistically significant margin. Furthermore, the framework can
be further improved by equipping with different curriculum schedules, which
demonstrates that the framework has strong generalizability
A Quaternion Gated Recurrent Unit Neural Network for Sensor Fusion
Recurrent Neural Networks (RNNs) are known for their ability to learn relationships within temporal sequences. Gated Recurrent Unit (GRU) networks have found use in challenging time-dependent applications such as Natural Language Processing (NLP), financial analysis and sensor fusion due to their capability to cope with the vanishing gradient problem. GRUs are also known to be more computationally efficient than their variant, the Long Short-Term Memory neural network (LSTM), due to their less complex structure and as such, are more suitable for applications requiring more efficient management of computational resources. Many of such applications require a stronger mapping of their features to further enhance the prediction accuracy. A novel Quaternion Gated Recurrent Unit (QGRU) is proposed in this paper, which leverages the internal and external dependencies within the quaternion algebra to map correlations within and across multidimensional features. The QGRU can be used to efficiently capture the inter- and intra-dependencies within multidimensional features unlike the GRU, which only captures the dependencies within the sequence. Furthermore, the performance of the proposed method is evaluated on a sensor fusion problem involving navigation in Global Navigation Satellite System (GNSS) deprived environments as well as a human activity recognition problem. The results obtained show that the QGRU produces competitive results with almost 3.7 times fewer parameters compared to the GRU