34 research outputs found
Dealing with training and test segmentation mismatch: FBK@IWSLT2021
This paper describes FBK's system submission to the IWSLT 2021 Offline Speech
Translation task. We participated with a direct model, which is a
Transformer-based architecture trained to translate English speech audio data
into German texts. The training pipeline is characterized by knowledge
distillation and a two-step fine-tuning procedure. Both knowledge distillation
and the first fine-tuning step are carried out on manually segmented real and
synthetic data, the latter being generated with an MT system trained on the
available corpora. Differently, the second fine-tuning step is carried out on a
random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce
the performance drops occurring when a speech translation model trained on
manually segmented data (i.e. an ideal, sentence-like segmentation) is
evaluated on automatically segmented audio (i.e. actual, more realistic testing
conditions). For the same purpose, a custom hybrid segmentation procedure that
accounts for both audio content (pauses) and for the length of the produced
segments is applied to the test data before passing them to the system. At
inference time, we compared this procedure with a baseline segmentation method
based on Voice Activity Detection (VAD). Our results indicate the effectiveness
of the proposed hybrid approach, shown by a reduction of the gap with manual
segmentation from 8.3 to 1.4 BLEU points.Comment: Accepted at IWSLT202
Whisper-MCE: Whisper Model Finetuned for Better Performance with Mixed Languages
Recently Whisper has approached human-level robustness and accuracy in
English automatic speech recognition (ASR), while in minor language and mixed
language speech recognition, there remains a compelling need for further
improvement. In this work, we present the impressive results of Whisper-MCE,
our finetuned Whisper model, which was trained using our self-collected
dataset, Mixed Cantonese and English audio dataset (MCE). Meanwhile,
considering word error rate (WER) poses challenges when it comes to evaluating
its effectiveness in minor language and mixed-language contexts, we present a
novel rating mechanism. By comparing our model to the baseline whisper-large-v2
model, we demonstrate its superior ability to accurately capture the content of
the original audio, achieve higher recognition accuracy, and exhibit faster
recognition speed. Notably, our model outperforms other existing models in the
specific task of recognizing mixed language
Examining the Combination of Multi-Band Processing and Channel Dropout for Robust Speech Recognition
Adapting an ASR Foundation Model for Spoken Language Assessment
A crucial part of an accurate and reliable spoken language assessment system
is the underlying ASR model. Recently, large-scale pre-trained ASR foundation
models such as Whisper have been made available. As the output of these models
is designed to be human readable, punctuation is added, numbers are presented
in Arabic numeric form and abbreviations are included. Additionally, these
models have a tendency to skip disfluencies and hesitations in the output.
Though useful for readability, these attributes are not helpful for assessing
the ability of a candidate and providing feedback. Here a precise transcription
of what a candidate said is needed. In this paper, we give a detailed analysis
of Whisper outputs and propose two solutions: fine-tuning and soft prompt
tuning. Experiments are conducted on both public speech corpora and an English
learner dataset. Results show that we can effectively alter the decoding
behaviour of Whisper to generate the exact words spoken in the response.Comment: Proceedings of SLaT
Adapting an Unadaptable ASR System
As speech recognition model sizes and training data requirements grow, it is
increasingly common for systems to only be available via APIs from online
service providers rather than having direct access to models themselves. In
this scenario it is challenging to adapt systems to a specific target domain.
To address this problem we consider the recently released OpenAI Whisper ASR as
an example of a large-scale ASR system to assess adaptation methods. An error
correction based approach is adopted, as this does not require access to the
model, but can be trained from either 1-best or N-best outputs that are
normally available via the ASR API. LibriSpeech is used as the primary target
domain for adaptation. The generalization ability of the system in two distinct
dimensions are then evaluated. First, whether the form of correction model is
portable to other speech recognition domains, and secondly whether it can be
used for ASR models having a different architecture.Comment: submitted to INTERSPEEC
The Many-to-Many Mapping Between the Concordance Correlation Coefficient and the Mean Square Error
We derive the mapping between two of the most pervasive utility functions,
the mean square error () and the concordance correlation coefficient (CCC,
). Despite its drawbacks, is one of the most popular performance
metrics (and a loss function); along with lately in many of the
sequence prediction challenges. Despite the ever-growing simultaneous usage,
e.g., inter-rater agreement, assay validation, a mapping between the two
metrics is missing, till date. While minimisation of norm of the errors
or of its positive powers (e.g., ) is aimed at maximisation, we
reason the often-witnessed ineffectiveness of this popular loss function with
graphical illustrations. The discovered formula uncovers not only the
counterintuitive revelation that `' does not imply
`', but also provides the precise range for the
metric for a given . We discover the conditions for optimisation
for a given ; and as a logical next step, for a given set of errors. We
generalise and discover the conditions for any given norm, for an even p.
We present newly discovered, albeit apparent, mathematical paradoxes. The study
inspires and anticipates a growing use of -inspired loss functions
e.g., , replacing the traditional
-norm loss functions in multivariate regressions.Comment: Why this discovery, or the mapping formulation is important:
MSE1CCC2. In other words, MSE
minimisation does not necessarily guarantee CCC maximisatio
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing
The emergence of large language models (LLMs) has sparked significant
interest in extending their remarkable language capabilities to speech.
However, modality alignment between speech and text still remains an open
problem. Current solutions can be categorized into two strategies. One is a
cascaded approach where outputs (tokens or states) of a separately trained
speech recognition system are used as inputs for LLMs, which limits their
potential in modeling alignment between speech and text. The other is an
end-to-end approach that relies on speech instruction data, which is very
difficult to collect in large quantities. In this paper, we address these
issues and propose the BLSP approach that Bootstraps Language-Speech
Pre-training via behavior alignment of continuation writing. We achieve this by
learning a lightweight modality adapter between a frozen speech encoder and an
LLM, ensuring that the LLM exhibits the same generation behavior regardless of
the modality of input: a speech segment or its transcript. The training process
can be divided into two steps. The first step prompts an LLM to generate texts
with speech transcripts as prefixes, obtaining text continuations. In the
second step, these continuations are used as supervised signals to train the
modality adapter in an end-to-end manner. We demonstrate that this
straightforward process can extend the capabilities of LLMs to speech, enabling
speech recognition, speech translation, spoken language understanding, and
speech conversation, even in zero-shot cross-lingual scenarios
Can Generative Large Language Models Perform ASR Error Correction?
ASR error correction continues to serve as an important part of
post-processing for speech recognition systems. Traditionally, these models are
trained with supervised training using the decoding results of the underlying
ASR system and the reference text. This approach is computationally intensive
and the model needs to be re-trained when switching the underlying ASR model.
Recent years have seen the development of large language models and their
ability to perform natural language processing tasks in a zero-shot manner. In
this paper, we take ChatGPT as an example to examine its ability to perform ASR
error correction in the zero-shot or 1-shot settings. We use the ASR N-best
list as model input and propose unconstrained error correction and N-best
constrained error correction methods. Results on a Conformer-Transducer model
and the pre-trained Whisper model show that we can largely improve the ASR
system performance with error correction using the powerful ChatGPT model
Entonación enfática del español hablado por rusohablantes
En aquest article s'aporten els primers resultats de l'anàlisi de l'entonació emfàtica del castellà parlat per russos. Per fer l'estudi, vam gravar 15 hores de converses espontànies amb deu parlants natius de rus que tenien un nivell d'espanyol mitjà o avançat. De totes les gravacions, vam seleccionar 70 enunciats emfàtics amb una forta càrrega d'afectivitat. Per a l'anàlisi, vam utilitzar el mètode Anàlisi Melòdica de la Parla (AMP) i els contorns obtinguts d'aquesta interllengua es van comparar amb els patrons emfàtics de l'espanyol peninsular. Els resultats obtinguts ens ofereixen la caracterització de l'entonació emfàtica dels russos quan parlen castellà.In this paper, the first results of the analysis of emphatic intonation of Spanish spoken by Russian people are provided. To do the research, 15 hours of spontaneous conversations with ten native Russian speakers were recorded. They have an intermediate or advanced level of Spanish. From all recordings, we selected 70 emphatic statements with a strong emotional charge. For the analysis, we were based on the Melodic Analysis of Speech (MAS) method and the contours obtained from this interlanguage were compared with the emphatic patterns of peninsular Spanish. The results offer us the characterization of the emphatic intonation of Russian people spoken Spanish.En este artículo se aportan los primeros resultados del análisis de la entonación enfática del español hablado por rusos. Para hacer el estudio, grabamos 15 horas de conversaciones de habla espontánea con diez hablantes nativos de ruso que tenían un nivel de español intermedio o avanzado. De todas las grabaciones, seleccionamos 70 enunciados enfáticos con una fuerte carga de afectividad. Para el análisis, empleamos el método Análisis Melódico del Habla (AMH) y los contornos obtenidos de esta interlengua se compararon con los patrones enfáticos del español peninsular. Los resultados obtenidos nos ofrecen la caracterización de la entonación enfática de los rusos cuando hablan español
HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models
Advancements in deep neural networks have allowed automatic speech
recognition (ASR) systems to attain human parity on several publicly available
clean speech datasets. However, even state-of-the-art ASR systems experience
performance degradation when confronted with adverse conditions, as a
well-trained acoustic model is sensitive to variations in the speech domain,
e.g., background noise. Intuitively, humans address this issue by relying on
their linguistic knowledge: the meaning of ambiguous spoken terms is usually
inferred from contextual cues thereby reducing the dependency on the auditory
system. Inspired by this observation, we introduce the first open-source
benchmark to utilize external large language models (LLMs) for ASR error
correction, where N-best decoding hypotheses provide informative elements for
true transcription prediction. This approach is a paradigm shift from the
traditional language model rescoring strategy that can only select one
candidate hypothesis as the output transcription. The proposed benchmark
contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs
of N-best hypotheses and corresponding accurate transcriptions across prevalent
speech domains. Given this dataset, we examine three types of error correction
techniques based on LLMs with varying amounts of labeled
hypotheses-transcription pairs, which gains a significant word error rate (WER)
reduction. Experimental evidence demonstrates the proposed technique achieves a
breakthrough by surpassing the upper bound of traditional re-ranking based
methods. More surprisingly, LLM with reasonable prompt and its generative
capability can even correct those tokens that are missing in N-best list. We
make our results publicly accessible for reproducible pipelines with released
pre-trained models, thus providing a new evaluation paradigm for ASR error
correction with LLMs.Comment: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track.
Added the first Mandarin and code-switching (zh-cn and en-us) results from
the LLM-based generative ASR error correction to Table 8 on Page 2