115 research outputs found

    Anchored Speech Recognition with Neural Transducers

    Full text link
    Neural transducers have achieved human level performance on standard speech recognition benchmarks. However, their performance significantly degrades in the presence of cross-talk, especially when the primary speaker has a low signal-to-noise ratio. Anchored speech recognition refers to a class of methods that use information from an anchor segment (e.g., wake-words) to recognize device-directed speech while ignoring interfering background speech. In this paper, we investigate anchored speech recognition to make neural transducers robust to background speech. We extract context information from the anchor segment with a tiny auxiliary network, and use encoder biasing and joiner gating to guide the transducer towards the target speech. Moreover, to improve the robustness of context embedding extraction, we propose auxiliary training objectives to disentangle lexical content from speaking style. We evaluate our methods on synthetic LibriSpeech-based mixtures comprising several SNR and overlap conditions; they improve relative word error rates by 19.6% over a strong baseline, when averaged over all conditions.Comment: To appear at IEEE ICASSP 202

    Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data

    Full text link
    In this work, we extend the instruction-tuned Llama-2 model with end-to-end general-purpose speech processing and reasoning abilities while maintaining the wide range of LLM capabilities, without using any carefully curated paired data. The proposed model can utilize audio prompts as a replacement for text and sustain a conversation. Such a model also has extended cross-modal capabilities such as being able to perform speech question answering, speech translation, and audio summarization amongst many other closed and open-domain tasks. This is unlike prior approaches in speech, in which LLMs are extended to handle audio for a limited number of pre-designated tasks. Experiments show that our end-to-end approach is on par with or outperforms a cascaded system (speech recognizer + LLM) in terms of modeling the response to a prompt. Furthermore, unlike a cascade, our approach shows the ability to interchange text and audio modalities and utilize the prior context in a conversation to provide better results

    Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model

    Full text link
    Neural network pruning offers an effective method for compressing a multilingual automatic speech recognition (ASR) model with minimal performance loss. However, it entails several rounds of pruning and re-training needed to be run for each language. In this work, we propose the use of an adaptive masking approach in two scenarios for pruning a multilingual ASR model efficiently, each resulting in sparse monolingual models or a sparse multilingual model (named as Dynamic ASR Pathways). Our approach dynamically adapts the sub-network, avoiding premature decisions about a fixed sub-network structure. We show that our approach outperforms existing pruning methods when targeting sparse monolingual models. Further, we illustrate that Dynamic ASR Pathways jointly discovers and trains better sub-networks (pathways) of a single multilingual model by adapting from different sub-network initializations, thereby reducing the need for language-specific pruning

    Prompting Large Language Models with Speech Recognition Abilities

    Full text link
    Large language models have proven themselves highly flexible, able to solve a wide range of generative tasks, such as abstractive summarization and open-ended question answering. In this paper we extend the capabilities of LLMs by directly attaching a small audio encoder allowing it to perform speech recognition. By directly prepending a sequence of audial embeddings to the text token embeddings, the LLM can be converted to an automatic speech recognition (ASR) system, and be used in the exact same manner as its textual counterpart. Experiments on Multilingual LibriSpeech (MLS) show that incorporating a conformer encoder into the open sourced LLaMA-7B allows it to outperform monolingual baselines by 18% and perform multilingual speech recognition despite LLaMA being trained overwhelmingly on English text. Furthermore, we perform ablation studies to investigate whether the LLM can be completely frozen during training to maintain its original capabilities, scaling up the audio encoder, and increasing the audio encoder striding to generate fewer embeddings. The results from these studies show that multilingual ASR is possible even when the LLM is frozen or when strides of almost 1 second are used in the audio encoder opening up the possibility for LLMs to operate on long-form audio

    How many cases are required to achieving early proficiency in purely off-clamp robot-assisted partial nephrectomy?

    Get PDF
    Background and purposeOff-clamp robot-assisted partial nephrectomy (Offc-RAPN) is a technically challenging procedure that can effectively avoid renal ischemia owing to the absence of hilar vessel preparation and clamping. However, data on the learning curve (LC) for this technique are limited. The purpose of this study was to assess the LC of Offc-RAPN and compare the perioperative outcomes between different learning phases.MethodsThis retrospective study included 50 consecutive patients who underwent purely Offc-RAPN between January 2022 and April 2023. Multidimensional cumulative sum (CUSUM) analysis method was used to assess LC. Spearman's correlation and LOWESS analysis were performed to analyze the continuous variables of perioperative outcomes. Baseline characteristics and perioperative outcomes were compared using χ2-test, t-test and U-test.ResultsCUSUM analysis identified two LC phases: phase I (the first 24 cases) and phase II (the subsequent 26 cases). Phase II showed significant reductions in mean operative time (133.5 vs. 115.31 min; p = 0.04), mean console time (103.21 vs. 81.27 min; p = 0.01), and mean postoperative length of stay (5.33 vs. 4.30 days; p = 0.04) compared to phase I. However, no significant differences were observed in other perioperative outcomes or baseline characteristics between the two LC phases.ConclusionsOffc-RAPN performed by a surgeon with experience in laparoscopic and robotic surgeries achieved early proficiency in 24 cases. Moreover, Offc-RAPN alone is safe and feasible even in the initial phase of the LC for an experienced surgeon

    Modified hood technique for single-port robot-assisted radical prostatectomy contributes to early recovery of continence

    Get PDF
    Background and purposeUrinary incontinence is one of the common side effects of robot-assisted radical prostatectomy (RARP). Here, we described the modified Hood technique for single-port RARP (sp-RARP) and assessed the interest of this new technique for early continence recovery.MethodsWe retrospectively reviewed 24 patients who underwent sp-RARP modified hood technique from June 2021 to December 2021. The pre-and intraoperative variables, postoperative functional and oncological outcomes of patients were collected and analyzed. The continence rates were estimated at 0 day, 1 week, 4 weeks, 3 months and 12 months after catheter removal. Continence was defined as wearing no pad over a 24 h period.ResultsMean time of operation and estimated blood loss were 183 min and 170 ml, respectively. The postoperative continence rates at 0 day, 1 week, 4 weeks, 3 months and 12 months after catheter removal were 41.7%, 54.2%, 75.0%, 91.7% and 95.8%, respectively. There were two patients who detected positive surgical margins and no patients observed complications requiring further treatment.ConclusionThe modified hood technique is a safe and feasible method that provides better outcomes in terms of early return of continence, without increasing estimated blood loss and compromising oncologic outcomes

    TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models

    Full text link
    Automatic Speech Recognition (ASR) models need to be optimized for specific hardware before they can be deployed on devices. This can be done by tuning the model's hyperparameters or exploring variations in its architecture. Re-training and re-validating models after making these changes can be a resource-intensive task. This paper presents TODM (Train Once Deploy Many), a new approach to efficiently train many sizes of hardware-friendly on-device ASR models with comparable GPU-hours to that of a single training job. TODM leverages insights from prior work on Supernet, where Recurrent Neural Network Transducer (RNN-T) models share weights within a Supernet. It reduces layer sizes and widths of the Supernet to obtain subnetworks, making them smaller models suitable for all hardware types. We introduce a novel combination of three techniques to improve the outcomes of the TODM Supernet: adaptive dropouts, an in-place Alpha-divergence knowledge distillation, and the use of ScaledAdam optimizer. We validate our approach by comparing Supernet-trained versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using LibriSpeech. Results demonstrate that our TODM Supernet either matches or surpasses the performance of manually tuned models by up to a relative of 3% better in word error rate (WER), while efficiently keeping the cost of training many models at a small constant.Comment: Meta AI; Submitted to ICASSP 202

    Role of Optical Coherence Tomography in Diagnosis and Treatment of Patients with Acute Coronary Syndrome

    Get PDF
    Acute coronary syndrome (ACS) is the main cause of death worldwide and the leading cause of disease burden in high-income countries. ACS refers to a constellation of clinical symptoms that are compatible with acute myocardial ischemia. It describes a spectrum of clinical manifestations that result from a common pathophysiological process. The most common cause of ACS are rupture of an atherosclerotic lesion containing a large necrotic core and a thin fibrous cap followed by acute luminal thrombosis. It was thought that a high-resolution imaging modality would be ideal to detect high-risk plaques before their disruption and the formation of an occlusive thrombus. Optical coherence tomography has proven to be an invaluable tool in early detection of high-risk plaques and particularly in the understanding of ACS. This review focuses on the current evidence for the role of optical coherence tomography in the diagnosis and treatment of patients with ACS

    Recombinant proteins A29L, M1R, A35R, and B6R vaccination protects mice from mpox virus challenge

    Get PDF
    Since May 2022, mutant strains of mpox (formerly monkeypox) virus (MPXV) have been rapidly spreading among individuals who have not traveled to endemic areas in multiple locations, including Europe and the United States. Both intracellular and extracellular forms of mpox virus have multiple outer membrane proteins that can stimulate immune response. Here, we investigated the immunogenicity of MPXV structural proteins such as A29L, M1R, A35R, and B6R as a combination vaccine, and the protective effect against the 2022 mpox mutant strain was also evaluated in BALB/c mice. After mixed 15 μg QS-21 adjuvant, all four virus structural proteins were administered subcutaneously to mice. Antibody titers in mouse sera rose sharply after the initial boost, along with an increased capacity of immune cells to produce IFN-γ alongside an elevated level of cellular immunity mediated by Th1 cells. The vaccine-induced neutralizing antibodies significantly inhibited the replication of MPXV in mice and reduced the pathological damage of organs. This study demonstrates the feasibility of a multiple recombinant vaccine for MPXV variant strains
    • …
    corecore