34 research outputs found

    Dealing with training and test segmentation mismatch: FBK@IWSLT2021

    Full text link
    This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora. Differently, the second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce the performance drops occurring when a speech translation model trained on manually segmented data (i.e. an ideal, sentence-like segmentation) is evaluated on automatically segmented audio (i.e. actual, more realistic testing conditions). For the same purpose, a custom hybrid segmentation procedure that accounts for both audio content (pauses) and for the length of the produced segments is applied to the test data before passing them to the system. At inference time, we compared this procedure with a baseline segmentation method based on Voice Activity Detection (VAD). Our results indicate the effectiveness of the proposed hybrid approach, shown by a reduction of the gap with manual segmentation from 8.3 to 1.4 BLEU points.Comment: Accepted at IWSLT202

    Whisper-MCE: Whisper Model Finetuned for Better Performance with Mixed Languages

    Full text link
    Recently Whisper has approached human-level robustness and accuracy in English automatic speech recognition (ASR), while in minor language and mixed language speech recognition, there remains a compelling need for further improvement. In this work, we present the impressive results of Whisper-MCE, our finetuned Whisper model, which was trained using our self-collected dataset, Mixed Cantonese and English audio dataset (MCE). Meanwhile, considering word error rate (WER) poses challenges when it comes to evaluating its effectiveness in minor language and mixed-language contexts, we present a novel rating mechanism. By comparing our model to the baseline whisper-large-v2 model, we demonstrate its superior ability to accurately capture the content of the original audio, achieve higher recognition accuracy, and exhibit faster recognition speed. Notably, our model outperforms other existing models in the specific task of recognizing mixed language

    Adapting an ASR Foundation Model for Spoken Language Assessment

    Full text link
    A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model. Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available. As the output of these models is designed to be human readable, punctuation is added, numbers are presented in Arabic numeric form and abbreviations are included. Additionally, these models have a tendency to skip disfluencies and hesitations in the output. Though useful for readability, these attributes are not helpful for assessing the ability of a candidate and providing feedback. Here a precise transcription of what a candidate said is needed. In this paper, we give a detailed analysis of Whisper outputs and propose two solutions: fine-tuning and soft prompt tuning. Experiments are conducted on both public speech corpora and an English learner dataset. Results show that we can effectively alter the decoding behaviour of Whisper to generate the exact words spoken in the response.Comment: Proceedings of SLaT

    Adapting an Unadaptable ASR System

    Full text link
    As speech recognition model sizes and training data requirements grow, it is increasingly common for systems to only be available via APIs from online service providers rather than having direct access to models themselves. In this scenario it is challenging to adapt systems to a specific target domain. To address this problem we consider the recently released OpenAI Whisper ASR as an example of a large-scale ASR system to assess adaptation methods. An error correction based approach is adopted, as this does not require access to the model, but can be trained from either 1-best or N-best outputs that are normally available via the ASR API. LibriSpeech is used as the primary target domain for adaptation. The generalization ability of the system in two distinct dimensions are then evaluated. First, whether the form of correction model is portable to other speech recognition domains, and secondly whether it can be used for ASR models having a different architecture.Comment: submitted to INTERSPEEC

    The Many-to-Many Mapping Between the Concordance Correlation Coefficient and the Mean Square Error

    Full text link
    We derive the mapping between two of the most pervasive utility functions, the mean square error (MSEMSE) and the concordance correlation coefficient (CCC, ρc\rho_c). Despite its drawbacks, MSEMSE is one of the most popular performance metrics (and a loss function); along with lately ρc\rho_c in many of the sequence prediction challenges. Despite the ever-growing simultaneous usage, e.g., inter-rater agreement, assay validation, a mapping between the two metrics is missing, till date. While minimisation of LpL_p norm of the errors or of its positive powers (e.g., MSEMSE) is aimed at ρc\rho_c maximisation, we reason the often-witnessed ineffectiveness of this popular loss function with graphical illustrations. The discovered formula uncovers not only the counterintuitive revelation that `MSE1<MSE2MSE_1<MSE_2' does not imply `ρc1>ρc2\rho_{c_1}>\rho_{c_2}', but also provides the precise range for the ρc\rho_c metric for a given MSEMSE. We discover the conditions for ρc\rho_c optimisation for a given MSEMSE; and as a logical next step, for a given set of errors. We generalise and discover the conditions for any given LpL_p norm, for an even p. We present newly discovered, albeit apparent, mathematical paradoxes. The study inspires and anticipates a growing use of ρc\rho_c-inspired loss functions e.g., MSEσXY\left|\frac{MSE}{\sigma_{XY}}\right|, replacing the traditional LpL_p-norm loss functions in multivariate regressions.Comment: Why this discovery, or the mapping formulation is important: MSE1CCC2. In other words, MSE minimisation does not necessarily guarantee CCC maximisatio

    BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing

    Full text link
    The emergence of large language models (LLMs) has sparked significant interest in extending their remarkable language capabilities to speech. However, modality alignment between speech and text still remains an open problem. Current solutions can be categorized into two strategies. One is a cascaded approach where outputs (tokens or states) of a separately trained speech recognition system are used as inputs for LLMs, which limits their potential in modeling alignment between speech and text. The other is an end-to-end approach that relies on speech instruction data, which is very difficult to collect in large quantities. In this paper, we address these issues and propose the BLSP approach that Bootstraps Language-Speech Pre-training via behavior alignment of continuation writing. We achieve this by learning a lightweight modality adapter between a frozen speech encoder and an LLM, ensuring that the LLM exhibits the same generation behavior regardless of the modality of input: a speech segment or its transcript. The training process can be divided into two steps. The first step prompts an LLM to generate texts with speech transcripts as prefixes, obtaining text continuations. In the second step, these continuations are used as supervised signals to train the modality adapter in an end-to-end manner. We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios

    Can Generative Large Language Models Perform ASR Error Correction?

    Full text link
    ASR error correction continues to serve as an important part of post-processing for speech recognition systems. Traditionally, these models are trained with supervised training using the decoding results of the underlying ASR system and the reference text. This approach is computationally intensive and the model needs to be re-trained when switching the underlying ASR model. Recent years have seen the development of large language models and their ability to perform natural language processing tasks in a zero-shot manner. In this paper, we take ChatGPT as an example to examine its ability to perform ASR error correction in the zero-shot or 1-shot settings. We use the ASR N-best list as model input and propose unconstrained error correction and N-best constrained error correction methods. Results on a Conformer-Transducer model and the pre-trained Whisper model show that we can largely improve the ASR system performance with error correction using the powerful ChatGPT model

    Entonación enfática del español hablado por rusohablantes

    Get PDF
    En aquest article s'aporten els primers resultats de l'anàlisi de l'entonació emfàtica del castellà parlat per russos. Per fer l'estudi, vam gravar 15 hores de converses espontànies amb deu parlants natius de rus que tenien un nivell d'espanyol mitjà o avançat. De totes les gravacions, vam seleccionar 70 enunciats emfàtics amb una forta càrrega d'afectivitat. Per a l'anàlisi, vam utilitzar el mètode Anàlisi Melòdica de la Parla (AMP) i els contorns obtinguts d'aquesta interllengua es van comparar amb els patrons emfàtics de l'espanyol peninsular. Els resultats obtinguts ens ofereixen la caracterització de l'entonació emfàtica dels russos quan parlen castellà.In this paper, the first results of the analysis of emphatic intonation of Spanish spoken by Russian people are provided. To do the research, 15 hours of spontaneous conversations with ten native Russian speakers were recorded. They have an intermediate or advanced level of Spanish. From all recordings, we selected 70 emphatic statements with a strong emotional charge. For the analysis, we were based on the Melodic Analysis of Speech (MAS) method and the contours obtained from this interlanguage were compared with the emphatic patterns of peninsular Spanish. The results offer us the characterization of the emphatic intonation of Russian people spoken Spanish.En este artículo se aportan los primeros resultados del análisis de la entonación enfática del español hablado por rusos. Para hacer el estudio, grabamos 15 horas de conversaciones de habla espontánea con diez hablantes nativos de ruso que tenían un nivel de español intermedio o avanzado. De todas las grabaciones, seleccionamos 70 enunciados enfáticos con una fuerte carga de afectividad. Para el análisis, empleamos el método Análisis Melódico del Habla (AMH) y los contornos obtenidos de esta interlengua se compararon con los patrones enfáticos del español peninsular. Los resultados obtenidos nos ofrecen la caracterización de la entonación enfática de los rusos cuando hablan español

    HyPoradise: An Open Baseline for Generative Speech Recognition with Large Language Models

    Full text link
    Advancements in deep neural networks have allowed automatic speech recognition (ASR) systems to attain human parity on several publicly available clean speech datasets. However, even state-of-the-art ASR systems experience performance degradation when confronted with adverse conditions, as a well-trained acoustic model is sensitive to variations in the speech domain, e.g., background noise. Intuitively, humans address this issue by relying on their linguistic knowledge: the meaning of ambiguous spoken terms is usually inferred from contextual cues thereby reducing the dependency on the auditory system. Inspired by this observation, we introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction, where N-best decoding hypotheses provide informative elements for true transcription prediction. This approach is a paradigm shift from the traditional language model rescoring strategy that can only select one candidate hypothesis as the output transcription. The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses and corresponding accurate transcriptions across prevalent speech domains. Given this dataset, we examine three types of error correction techniques based on LLMs with varying amounts of labeled hypotheses-transcription pairs, which gains a significant word error rate (WER) reduction. Experimental evidence demonstrates the proposed technique achieves a breakthrough by surpassing the upper bound of traditional re-ranking based methods. More surprisingly, LLM with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list. We make our results publicly accessible for reproducible pipelines with released pre-trained models, thus providing a new evaluation paradigm for ASR error correction with LLMs.Comment: Accepted to NeurIPS 2023, 24 pages. Datasets and Benchmarks Track. Added the first Mandarin and code-switching (zh-cn and en-us) results from the LLM-based generative ASR error correction to Table 8 on Page 2
    corecore