1,177 research outputs found
A Unified Multilingual Handwriting Recognition System using multigrams sub-lexical units
We address the design of a unified multilingual system for handwriting
recognition. Most of multi- lingual systems rests on specialized models that
are trained on a single language and one of them is selected at test time.
While some recognition systems are based on a unified optical model, dealing
with a unified language model remains a major issue, as traditional language
models are generally trained on corpora composed of large word lexicons per
language. Here, we bring a solution by con- sidering language models based on
sub-lexical units, called multigrams. Dealing with multigrams strongly reduces
the lexicon size and thus decreases the language model complexity. This makes
pos- sible the design of an end-to-end unified multilingual recognition system
where both a single optical model and a single language model are trained on
all the languages. We discuss the impact of the language unification on each
model and show that our system reaches state-of-the-art methods perfor- mance
with a strong reduction of the complexity.Comment: preprin
Language-Routing Mixture of Experts for Multilingual and Code-Switching Speech Recognition
Multilingual speech recognition for both monolingual and code-switching
speech is a challenging task. Recently, based on the Mixture of Experts (MoE),
many works have made good progress in multilingual and code-switching ASR, but
present huge computational complexity with the increase of supported languages.
In this work, we propose a computation-efficient network named Language-Routing
Mixture of Experts (LR-MoE) for multilingual and code-switching ASR. LR-MoE
extracts language-specific representations through the Mixture of Language
Experts (MLE), which is guided to learn by a frame-wise language routing
mechanism. The weight-shared frame-level language identification (LID) network
is jointly trained as the shared pre-router of each MoE layer. Experiments show
that the proposed method significantly improves multilingual and code-switching
speech recognition performances over baseline with comparable computational
efficiency.Comment: To appear in Proc. INTERSPEECH 2023, August 20-24, 2023, Dublin,
Irelan
GIRNet: Interleaved Multi-Task Recurrent State Sequence Models
In several natural language tasks, labeled sequences are available in
separate domains (say, languages), but the goal is to label sequences with
mixed domain (such as code-switched text). Or, we may have available models for
labeling whole passages (say, with sentiments), which we would like to exploit
toward better position-specific label inference (say, target-dependent
sentiment annotation). A key characteristic shared across such tasks is that
different positions in a primary instance can benefit from different `experts'
trained from auxiliary data, but labeled primary instances are scarce, and
labeling the best expert for each position entails unacceptable cognitive
burden. We propose GITNet, a unified position-sensitive multi-task recurrent
neural network (RNN) architecture for such applications. Auxiliary and primary
tasks need not share training instances. Auxiliary RNNs are trained over
auxiliary instances. A primary instance is also submitted to each auxiliary
RNN, but their state sequences are gated and merged into a novel composite
state sequence tailored to the primary inference task. Our approach is in sharp
contrast to recent multi-task networks like the cross-stitch and sluice
network, which do not control state transfer at such fine granularity. We
demonstrate the superiority of GIRNet using three applications: sentiment
classification of code-switched passages, part-of-speech tagging of
code-switched text, and target position-sensitive annotation of sentiment in
monolingual passages. In all cases, we establish new state-of-the-art
performance beyond recent competitive baselines.Comment: Accepted at AAAI 201
LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and Translation Using Neural Transducers
End-to-end formulation of automatic speech recognition (ASR) and speech
translation (ST) makes it easy to use a single model for both multilingual ASR
and many-to-many ST. In this paper, we propose streaming language-agnostic
multilingual speech recognition and translation using neural transducers
(LAMASSU). To enable multilingual text generation in LAMASSU, we conduct a
systematic comparison between specified and unified prediction and joint
networks. We leverage a language-agnostic multilingual encoder that
substantially outperforms shared encoders. To enhance LAMASSU, we propose to
feed target LID to encoders. We also apply connectionist temporal
classification regularization to transducer training. Experimental results show
that LAMASSU not only drastically reduces the model size but also outperforms
monolingual ASR and bilingual ST models.Comment: Submitted to ICASSP 202
Building High-accuracy Multilingual ASR with Gated Language Experts and Curriculum Training
We propose gated language experts and curriculum training to enhance
multilingual transformer transducer models without requiring language
identification (LID) input from users during inference. Our method incorporates
a gating mechanism and LID loss, enabling transformer experts to learn
language-specific information. By combining gated transformer experts with
shared transformer layers, we construct multilingual transformer blocks and
utilize linear experts to effectively regularize the joint network. The
curriculum training scheme leverages LID to guide the gated experts in
improving their respective language performance. Experimental results on a
bilingual task involving English and Spanish demonstrate significant
improvements, with average relative word error reductions of 12.5% and 7.3%
compared to the baseline bilingual and monolingual models, respectively.
Notably, our method achieves performance comparable to the upper-bound model
trained and inferred with oracle LID. Extending our approach to trilingual,
quadrilingual, and pentalingual models reveals similar advantages to those
observed in the bilingual models, highlighting its ease of extension to
multiple languages
KIT’s IWSLT 2021 Offline Speech Translation System
This paper describes KIT’submission to the IWSLT 2021 Offline Speech Translation Task. We describe a system in both cascaded condition and end-to-end condition. In the cascaded condition, we investigated different end-to-end architectures for the speech recognition module. For the text segmentation module, we trained a small transformer-based model on high-quality monolingual data. For the translation module, our last year’s neural machine translation model was reused. In the end-to-end condition, we improved our Speech Relative Transformer architecture to reach or even surpass the result of the cascade system
- …