3,172 research outputs found
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Encoder-decoder models provide a generic architecture for
sequence-to-sequence tasks such as speech recognition and translation. While
offline systems are often evaluated on quality metrics like word error rates
(WER) and BLEU, latency is also a crucial factor in many practical use-cases.
We propose three latency reduction techniques for chunk-based incremental
inference and evaluate their efficiency in terms of accuracy-latency trade-off.
On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by
sacrificing 1% WER (6% rel.) compared to offline transcription. Although our
experiments use the Transformer, the hypothesis selection strategies are
applicable to other encoder-decoder models. To avoid expensive re-computation,
we use a unidirectionally-attending encoder. After an adaptation procedure to
partial sequences, the unidirectional model performs on-par with the original
model. We further show that our approach is also applicable to low-latency
speech translation. On How2 English-Portuguese speech translation, we reduce
latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5%
rel.) compared to the offline system
Working Memory in Writing: Empirical Evidence From the Dual-Task Technique
The dual-task paradigm recently played a major role in understanding the role of working memory in writing. By reviewing recent findings in this field of research, this article highlights how the use of the dual-task technique allowed studying processing and short-term storage functions of working memory involved in writing. With respect to processing functions of working memory (namely, attentional and executive functions), studies investigated resources allocation, step-by-step management and parallel coordination of the writing processes. With respect to short-term storage in working memory, experiments mainly attempted to test Kellogg's (1996) proposals on the relationship between the writing processes and the slave systems of working memory. It is concluded that the dual-task technique revealed fruitful in understanding the relationship between writing and working memory
Yeah, Right, Uh-Huh: A Deep Learning Backchannel Predictor
Using supporting backchannel (BC) cues can make human-computer interaction
more social. BCs provide a feedback from the listener to the speaker indicating
to the speaker that he is still listened to. BCs can be expressed in different
ways, depending on the modality of the interaction, for example as gestures or
acoustic cues. In this work, we only considered acoustic cues. We are proposing
an approach towards detecting BC opportunities based on acoustic input features
like power and pitch. While other works in the field rely on the use of a
hand-written rule set or specialized features, we made use of artificial neural
networks. They are capable of deriving higher order features from input
features themselves. In our setup, we first used a fully connected feed-forward
network to establish an updated baseline in comparison to our previously
proposed setup. We also extended this setup by the use of Long Short-Term
Memory (LSTM) networks which have shown to outperform feed-forward based setups
on various tasks. Our best system achieved an F1-Score of 0.37 using power and
pitch features. Adding linguistic information using word2vec, the score
increased to 0.39
End-to-End Evaluation for Low-Latency Simultaneous Speech Translation
The challenge of low-latency speech translation has recently draw significant
interest in the research community as shown by several publications and shared
tasks. Therefore, it is essential to evaluate these different approaches in
realistic scenarios. However, currently only specific aspects of the systems
are evaluated and often it is not possible to compare different approaches.
In this work, we propose the first framework to perform and evaluate the
various aspects of low-latency speech translation under realistic conditions.
The evaluation is carried out in an end-to-end fashion. This includes the
segmentation of the audio as well as the run-time of the different components.
Secondly, we compare different approaches to low-latency speech translation
using this framework. We evaluate models with the option to revise the output
as well as methods with fixed output. Furthermore, we directly compare
state-of-the-art cascaded as well as end-to-end systems. Finally, the framework
allows to automatically evaluate the translation quality as well as latency and
also provides a web interface to show the low-latency model outputs to the
user
Visualization: the missing factor in Simultaneous Speech Translation
Simultaneous speech translation (SimulST) is the task in which output
generation has to be performed on partial, incremental speech input. In recent
years, SimulST has become popular due to the spread of cross-lingual
application scenarios, like international live conferences and streaming
lectures, in which on-the-fly speech translation can facilitate users' access
to audio-visual content. In this paper, we analyze the characteristics of the
SimulST systems developed so far, discussing their strengths and weaknesses. We
then concentrate on the evaluation framework required to properly assess
systems' effectiveness. To this end, we raise the need for a broader
performance analysis, also including the user experience standpoint. SimulST
systems, indeed, should be evaluated not only in terms of quality/latency
measures, but also via task-oriented metrics accounting, for instance, for the
visualization strategy adopted. In light of this, we highlight which are the
goals achieved by the community and what is still missing.Comment: Accepted at CLIC-it 202
- …