226 research outputs found

    Can Generative Large Language Models Perform ASR Error Correction?

    Full text link
    ASR error correction continues to serve as an important part of post-processing for speech recognition systems. Traditionally, these models are trained with supervised training using the decoding results of the underlying ASR system and the reference text. This approach is computationally intensive and the model needs to be re-trained when switching the underlying ASR model. Recent years have seen the development of large language models and their ability to perform natural language processing tasks in a zero-shot manner. In this paper, we take ChatGPT as an example to examine its ability to perform ASR error correction in the zero-shot or 1-shot settings. We use the ASR N-best list as model input and propose unconstrained error correction and N-best constrained error correction methods. Results on a Conformer-Transducer model and the pre-trained Whisper model show that we can largely improve the ASR system performance with error correction using the powerful ChatGPT model

    Adapting an Unadaptable ASR System

    Full text link
    As speech recognition model sizes and training data requirements grow, it is increasingly common for systems to only be available via APIs from online service providers rather than having direct access to models themselves. In this scenario it is challenging to adapt systems to a specific target domain. To address this problem we consider the recently released OpenAI Whisper ASR as an example of a large-scale ASR system to assess adaptation methods. An error correction based approach is adopted, as this does not require access to the model, but can be trained from either 1-best or N-best outputs that are normally available via the ASR API. LibriSpeech is used as the primary target domain for adaptation. The generalization ability of the system in two distinct dimensions are then evaluated. First, whether the form of correction model is portable to other speech recognition domains, and secondly whether it can be used for ASR models having a different architecture.Comment: submitted to INTERSPEEC

    Adapting an ASR Foundation Model for Spoken Language Assessment

    Full text link
    A crucial part of an accurate and reliable spoken language assessment system is the underlying ASR model. Recently, large-scale pre-trained ASR foundation models such as Whisper have been made available. As the output of these models is designed to be human readable, punctuation is added, numbers are presented in Arabic numeric form and abbreviations are included. Additionally, these models have a tendency to skip disfluencies and hesitations in the output. Though useful for readability, these attributes are not helpful for assessing the ability of a candidate and providing feedback. Here a precise transcription of what a candidate said is needed. In this paper, we give a detailed analysis of Whisper outputs and propose two solutions: fine-tuning and soft prompt tuning. Experiments are conducted on both public speech corpora and an English learner dataset. Results show that we can effectively alter the decoding behaviour of Whisper to generate the exact words spoken in the response.Comment: Proceedings of SLaT

    N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space

    Full text link
    Error correction models form an important part of Automatic Speech Recognition (ASR) post-processing to improve the readability and quality of transcriptions. Most prior works use the 1-best ASR hypothesis as input and therefore can only perform correction by leveraging the context within one sentence. In this work, we propose a novel N-best T5 model for this task, which is fine-tuned from a T5 model and utilizes ASR N-best lists as model input. By transferring knowledge from the pre-trained language model and obtaining richer information from the ASR decoding space, the proposed approach outperforms a strong Conformer-Transducer baseline. Another issue with standard error correction is that the generation process is not well-guided. To address this a constrained decoding process, either based on the N-best list or an ASR lattice, is used which allows additional information to be propagated.Comment: submitted to INTERSPEEC

    Studying seabird diet through genetic analysis of faeces: a case study on Macaroni Penguins (Eudyptes chrysolophus)

    Get PDF
    Determination of seabird diet usually relies on the analysis of stomach-content remains obtained through stomach flushing; this technique is both invasive and logistically difficult. We evaluate the usefulness of DNA-based faecal analysis in a dietary study on chick-rearing macaroni penguins (Eudyptes chrysolophus) at Heard Island. Conventional stomach-content data was also collected, allowing comparison of the approaches. Methodology/Principal Findings. Preyspecific PCR tests were used to detect dietary DNA in faecal samples and amplified prey DNA was cloned and sequenced. Of the 88 faecal samples collected, 39 contained detectable DNA from one or more of the prey groups targeted with PCR tests. Euphausiid DNA was most commonly detected in the early (guard) stage of chick-rearing, and detection of DNA from the myctophid fish Krefftichthys anderssoni and amphipods became more common in samples collected in the later (cre`che) stage. These trends followed those observed in the penguins’ stomach contents. In euphausiid-specific clone libraries the proportion of sequences from the two dominant euphausiid prey species (Euphausia vallentini and Thysanoessa macrura) changed over the sampling period; again, this reflected the trend in the stomach content data. Analysis of prey sequences in universal clone libraries revealed a higher diversity of fish prey than identified in the stomachs, but non-fish prey were not well represented. Conclusions/Significance. The present study is one of the first to examine the full breadth of a predator’s diet using DNA based faecal analysis. We discuss methodological difficulties encountered and suggest possible refinements. Overall, the ability of the DNA-based approach to detect temporal variation in the diet of macaroni penguins indicates this non-invasive method will be generally useful for monitoring population-level dietary trends in seabirds

    Zero-shot Audio Topic Reranking using Large Language Models

    Full text link
    The Multimodal Video Search by Examples (MVSE) project investigates using video clips as the query term for information retrieval, rather than the more traditional text query. This enables far richer search modalities such as images, speaker, content, topic, and emotion. A key element for this process is highly rapid, flexible, search to support large archives, which in MVSE is facilitated by representing video attributes by embeddings. This work aims to mitigate any performance loss from this rapid archive search by examining reranking approaches. In particular, zero-shot reranking methods using large language models are investigated as these are applicable to any video archive audio content. Performance is evaluated for topic-based retrieval on a publicly available video archive, the BBC Rewind corpus. Results demonstrate that reranking can achieve improved retrieval ranking without the need for any task-specific training data

    Identifying Ligand Binding Conformations of the β2-Adrenergic Receptor by Using Its Agonists as Computational Probes

    Get PDF
    Recently available G-protein coupled receptor (GPCR) structures and biophysical studies suggest that the difference between the effects of various agonists and antagonists cannot be explained by single structures alone, but rather that the conformational ensembles of the proteins need to be considered. Here we use an elastic network model-guided molecular dynamics simulation protocol to generate an ensemble of conformers of a prototypical GPCR, β2-adrenergic receptor (β2AR). The resulting conformers are clustered into groups based on the conformations of the ligand binding site, and distinct conformers from each group are assessed for their binding to known agonists of β2AR. We show that the select ligands bind preferentially to different predicted conformers of β2AR, and identify a role of β2AR extracellular region as an allosteric binding site for larger drugs such as salmeterol. Thus, drugs and ligands can be used as "computational probes" to systematically identify protein conformers with likely biological significance. © 2012 Isin et al

    Efficacy and safety of tigecycline monotherapy vs. imipenem/cilastatin in Chinese patients with complicated intra-abdominal infections: a randomized controlled trial

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Tigecycline, a first-in-class broad-spectrum glycylcycline antibiotic, has broad-spectrum in vitro activity against bacteria commonly encountered in complicated intra-abdominal infections (cIAIs), including aerobic and facultative Gram-positive and Gram-negative bacteria and anaerobic bacteria. In the current trial, tigecycline was evaluated for safety and efficacy vs. imipenem/cilastatin in hospitalized Chinese patients with cIAIs.</p> <p>Methods</p> <p>In this phase 3, multicenter, open-label study, patients were randomly assigned to receive IV tigecycline or imipenem/cilastatin for ≤2 weeks. The primary efficacy endpoints were clinical response at the test-of-cure visit (12-37 days after therapy) for the microbiologic modified intent-to-treat and microbiologically evaluable populations. Because the study was not powered to demonstrate non-inferiority between tigecycline and imipenem/cilastatin, no formal statistical analysis was performed. Two-sided 95% confidence intervals (CIs) were calculated for the response rates in each treatment group and for differences between treatment groups for descriptive purposes.</p> <p>Results</p> <p>One hundred ninety-nine patients received ≥1 dose of study drug and comprised the modified intent-to-treat population. In the microbiologically evaluable population, 86.5% (45 of 52) of tigecycline- and 97.9% (47 of 48) of imipenem/cilastatin-treated patients were cured at the test-of-cure assessment (12-37 days after therapy); in the microbiologic modified intent-to-treat population, cure rates were 81.7% (49 of 60) and 90.9% (50 of 55), respectively. The overall incidence of treatment-emergent adverse events was 80.4% for tigecycline vs. 53.9% after imipenem/cilastatin therapy (<it>P </it>< 0.001), primarily due to gastrointestinal-related events, especially nausea (21.6% vs. 3.9%; <it>P </it>< 0.001) and vomiting (12.4% vs. 2.0%; <it>P </it>= 0.005).</p> <p>Conclusions</p> <p>Clinical cure rates for tigecycline were consistent with those found in global cIAI studies. The overall safety profile was also consistent with that observed in global studies of tigecycline for treatment of cIAI, as well as that observed in analyses of Chinese patients in those studies; no novel trends were observed.</p> <p>Trial Registration</p> <p>ClinicalTrials.gov NCT00136201</p

    A cross-lingual adaptation approach for rapid development of speech recognizers for learning disabled users

    Get PDF
    Building a voice-operated system for learning disabled users is a difficult task that requires a considerable amount of time and effort. Due to the wide spectrum of disabilities and their different related phonopathies, most approaches available are targeted to a specific pathology. This may improve their accuracy for some users, but makes them unsuitable for others. In this paper, we present a cross-lingual approach to adapt a general-purpose modular speech recognizer for learning disabled people. The main advantage of this approach is that it allows rapid and cost-effective development by taking the already built speech recognition engine and its modules, and utilizing existing resources for standard speech in different languages for the recognition of the users’ atypical voices. Although the recognizers built with the proposed technique obtain lower accuracy rates than those trained for specific pathologies, they can be used by a wide population and developed more rapidly, which makes it possible to design various types of speech-based applications accessible to learning disabled users.This research was supported by the project ‘Favoreciendo la vida autónoma de discapacitados intelectuales con problemas de comunicación oral mediante interfaces personalizados de reconocimiento automático del habla’, financed by the Centre of Initiatives for Development Cooperation (Centro de Iniciativas de Cooperación al Desarrollo, CICODE), University of Granada, Spain. This research was supported by the Student Grant Scheme 2014 (SGS) at the Technical University of Liberec
    • …
    corecore