40,483 research outputs found
A syntactic language model based on incremental CCG parsing
Syntactically-enriched language models (parsers) constitute a promising component in applications such as machine translation and speech-recognition. To maintain a useful level of accuracy, existing parsers are non-incremental and must span a combinatorially growing space of possible structures as every input word is processed. This prohibits their incorporation into standard linear-time decoders. In this paper, we present an incremental, linear-time dependency parser based on Combinatory Categorial Grammar (CCG) and classification techniques. We devise a deterministic transform of CCGbank canonical derivations into incremental ones, and train our parser on this data. We discover that a cascaded, incremental version provides an appealing balance between efficiency and accuracy
All Politics is Local: The Renminbi's Prospects as a Future Global Currency
. In this article we describe methods for improving the RWTH German speech recognizer used within the VERBMOBIL project. In particular, we present acceleration methods for the search based on both within-word and across-word phoneme models. We also study incremental methods to reduce the response time of the online speech recognizer. Finally, we present experimental off-line results for the three VERBMOBIL scenarios. We report on word error rates and real-time factors for both speaker independent and speaker dependent recognition. 1 Introduction The goal of the VERBMOBIL project is to develop a speech-to-speech translation system that performs close to real-time. In this system, speech recognition is followed by subsequent VERBMOBIL modules (like syntactic analysis and translation) which depend on the recognition result. Therefore, in this application it is particularly important to keep the recognition time as short as possible. There are VERBMOBIL modules which are capable to work ..
Low-Latency Sequence-to-Sequence Speech Recognition and Translation by Partial Hypothesis Selection
Encoder-decoder models provide a generic architecture for
sequence-to-sequence tasks such as speech recognition and translation. While
offline systems are often evaluated on quality metrics like word error rates
(WER) and BLEU, latency is also a crucial factor in many practical use-cases.
We propose three latency reduction techniques for chunk-based incremental
inference and evaluate their efficiency in terms of accuracy-latency trade-off.
On the 300-hour How2 dataset, we reduce latency by 83% to 0.8 second by
sacrificing 1% WER (6% rel.) compared to offline transcription. Although our
experiments use the Transformer, the hypothesis selection strategies are
applicable to other encoder-decoder models. To avoid expensive re-computation,
we use a unidirectionally-attending encoder. After an adaptation procedure to
partial sequences, the unidirectional model performs on-par with the original
model. We further show that our approach is also applicable to low-latency
speech translation. On How2 English-Portuguese speech translation, we reduce
latency to 0.7 second (-84% rel.) while incurring a loss of 2.4 BLEU points (5%
rel.) compared to the offline system
KIT's Multilingual Speech Translation System for IWSLT 2023
Many existing speech translation benchmarks focus on native-English speech in
high-quality recording conditions, which often do not match the conditions in
real-life use-cases. In this paper, we describe our speech translation system
for the multilingual track of IWSLT 2023, which focuses on the translation of
scientific conference talks. The test condition features accented input speech
and terminology-dense contents. The tasks requires translation into 10
languages of varying amounts of resources. In absence of training data from the
target domain, we use a retrieval-based approach (kNN-MT) for effective
adaptation (+0.8 BLEU for speech translation). We also use adapters to easily
integrate incremental training data from data augmentation, and show that it
matches the performance of re-training. We observe that cascaded systems are
more easily adaptable towards specific target domains, due to their separate
modules. Our cascaded speech system substantially outperforms its end-to-end
counterpart on scientific talk translation, although their performance remains
similar on TED talks.Comment: IWSLT 202
Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff
Blockwise self-attentional encoder models have recently emerged as one
promising end-to-end approach to simultaneous speech translation. These models
employ a blockwise beam search with hypothesis reliability scoring to determine
when to wait for more input speech before translating further. However, this
method maintains multiple hypotheses until the entire speech input is consumed
-- this scheme cannot directly show a single \textit{incremental} translation
to users. Further, this method lacks mechanisms for \textit{controlling} the
quality vs. latency tradeoff. We propose a modified incremental blockwise beam
search incorporating local agreement or hold- policies for quality-latency
control. We apply our framework to models trained for online or offline
translation and demonstrate that both types can be effectively used in online
mode.
Experimental results on MuST-C show 0.6-3.6 BLEU improvement without changing
latency or 0.8-1.4 s latency improvement without changing quality.Comment: Accepted at INTERSPEECH 202
Visualization: the missing factor in Simultaneous Speech Translation
Simultaneous speech translation (SimulST) is the task in which output
generation has to be performed on partial, incremental speech input. In recent
years, SimulST has become popular due to the spread of cross-lingual
application scenarios, like international live conferences and streaming
lectures, in which on-the-fly speech translation can facilitate users' access
to audio-visual content. In this paper, we analyze the characteristics of the
SimulST systems developed so far, discussing their strengths and weaknesses. We
then concentrate on the evaluation framework required to properly assess
systems' effectiveness. To this end, we raise the need for a broader
performance analysis, also including the user experience standpoint. SimulST
systems, indeed, should be evaluated not only in terms of quality/latency
measures, but also via task-oriented metrics accounting, for instance, for the
visualization strategy adopted. In light of this, we highlight which are the
goals achieved by the community and what is still missing.Comment: Accepted at CLIC-it 202
Learning Fault-tolerant Speech Parsing with SCREEN
This paper describes a new approach and a system SCREEN for fault-tolerant
speech parsing. SCREEEN stands for Symbolic Connectionist Robust EnterprisE for
Natural language. Speech parsing describes the syntactic and semantic analysis
of spontaneous spoken language. The general approach is based on incremental
immediate flat analysis, learning of syntactic and semantic speech parsing,
parallel integration of current hypotheses, and the consideration of various
forms of speech related errors. The goal for this approach is to explore the
parallel interactions between various knowledge sources for learning
incremental fault-tolerant speech parsing. This approach is examined in a
system SCREEN using various hybrid connectionist techniques. Hybrid
connectionist techniques are examined because of their promising properties of
inherent fault tolerance, learning, gradedness and parallel constraint
integration. The input for SCREEN is hypotheses about recognized words of a
spoken utterance potentially analyzed by a speech system, the output is
hypotheses about the flat syntactic and semantic analysis of the utterance. In
this paper we focus on the general approach, the overall architecture, and
examples for learning flat syntactic speech parsing. Different from most other
speech language architectures SCREEN emphasizes an interactive rather than an
autonomous position, learning rather than encoding, flat analysis rather than
in-depth analysis, and fault-tolerant processing of phonetic, syntactic and
semantic knowledge.Comment: 6 pages, postscript, compressed, uuencoded to appear in Proceedings
of AAAI 9
- âŠ