115 research outputs found
Anchored Speech Recognition with Neural Transducers
Neural transducers have achieved human level performance on standard speech
recognition benchmarks. However, their performance significantly degrades in
the presence of cross-talk, especially when the primary speaker has a low
signal-to-noise ratio. Anchored speech recognition refers to a class of methods
that use information from an anchor segment (e.g., wake-words) to recognize
device-directed speech while ignoring interfering background speech. In this
paper, we investigate anchored speech recognition to make neural transducers
robust to background speech. We extract context information from the anchor
segment with a tiny auxiliary network, and use encoder biasing and joiner
gating to guide the transducer towards the target speech. Moreover, to improve
the robustness of context embedding extraction, we propose auxiliary training
objectives to disentangle lexical content from speaking style. We evaluate our
methods on synthetic LibriSpeech-based mixtures comprising several SNR and
overlap conditions; they improve relative word error rates by 19.6% over a
strong baseline, when averaged over all conditions.Comment: To appear at IEEE ICASSP 202
Towards General-Purpose Speech Abilities for Large Language Models Using Unpaired Data
In this work, we extend the instruction-tuned Llama-2 model with end-to-end
general-purpose speech processing and reasoning abilities while maintaining the
wide range of LLM capabilities, without using any carefully curated paired
data. The proposed model can utilize audio prompts as a replacement for text
and sustain a conversation. Such a model also has extended cross-modal
capabilities such as being able to perform speech question answering, speech
translation, and audio summarization amongst many other closed and open-domain
tasks. This is unlike prior approaches in speech, in which LLMs are extended to
handle audio for a limited number of pre-designated tasks. Experiments show
that our end-to-end approach is on par with or outperforms a cascaded system
(speech recognizer + LLM) in terms of modeling the response to a prompt.
Furthermore, unlike a cascade, our approach shows the ability to interchange
text and audio modalities and utilize the prior context in a conversation to
provide better results
Dynamic ASR Pathways: An Adaptive Masking Approach Towards Efficient Pruning of A Multilingual ASR Model
Neural network pruning offers an effective method for compressing a
multilingual automatic speech recognition (ASR) model with minimal performance
loss. However, it entails several rounds of pruning and re-training needed to
be run for each language. In this work, we propose the use of an adaptive
masking approach in two scenarios for pruning a multilingual ASR model
efficiently, each resulting in sparse monolingual models or a sparse
multilingual model (named as Dynamic ASR Pathways). Our approach dynamically
adapts the sub-network, avoiding premature decisions about a fixed sub-network
structure. We show that our approach outperforms existing pruning methods when
targeting sparse monolingual models. Further, we illustrate that Dynamic ASR
Pathways jointly discovers and trains better sub-networks (pathways) of a
single multilingual model by adapting from different sub-network
initializations, thereby reducing the need for language-specific pruning
Synthesis and Electrochemical and Photophysical Studies of Tetrathiafulvalene-Annulated Phthalocyanines
Prompting Large Language Models with Speech Recognition Abilities
Large language models have proven themselves highly flexible, able to solve a
wide range of generative tasks, such as abstractive summarization and
open-ended question answering. In this paper we extend the capabilities of LLMs
by directly attaching a small audio encoder allowing it to perform speech
recognition. By directly prepending a sequence of audial embeddings to the text
token embeddings, the LLM can be converted to an automatic speech recognition
(ASR) system, and be used in the exact same manner as its textual counterpart.
Experiments on Multilingual LibriSpeech (MLS) show that incorporating a
conformer encoder into the open sourced LLaMA-7B allows it to outperform
monolingual baselines by 18% and perform multilingual speech recognition
despite LLaMA being trained overwhelmingly on English text. Furthermore, we
perform ablation studies to investigate whether the LLM can be completely
frozen during training to maintain its original capabilities, scaling up the
audio encoder, and increasing the audio encoder striding to generate fewer
embeddings. The results from these studies show that multilingual ASR is
possible even when the LLM is frozen or when strides of almost 1 second are
used in the audio encoder opening up the possibility for LLMs to operate on
long-form audio
How many cases are required to achieving early proficiency in purely off-clamp robot-assisted partial nephrectomy?
Background and purposeOff-clamp robot-assisted partial nephrectomy (Offc-RAPN) is a technically challenging procedure that can effectively avoid renal ischemia owing to the absence of hilar vessel preparation and clamping. However, data on the learning curve (LC) for this technique are limited. The purpose of this study was to assess the LC of Offc-RAPN and compare the perioperative outcomes between different learning phases.MethodsThis retrospective study included 50 consecutive patients who underwent purely Offc-RAPN between January 2022 and April 2023. Multidimensional cumulative sum (CUSUM) analysis method was used to assess LC. Spearman's correlation and LOWESS analysis were performed to analyze the continuous variables of perioperative outcomes. Baseline characteristics and perioperative outcomes were compared using χ2-test, t-test and U-test.ResultsCUSUM analysis identified two LC phases: phase I (the first 24 cases) and phase II (the subsequent 26 cases). Phase II showed significant reductions in mean operative time (133.5 vs. 115.31 min; p = 0.04), mean console time (103.21 vs. 81.27 min; p = 0.01), and mean postoperative length of stay (5.33 vs. 4.30 days; p = 0.04) compared to phase I. However, no significant differences were observed in other perioperative outcomes or baseline characteristics between the two LC phases.ConclusionsOffc-RAPN performed by a surgeon with experience in laparoscopic and robotic surgeries achieved early proficiency in 24 cases. Moreover, Offc-RAPN alone is safe and feasible even in the initial phase of the LC for an experienced surgeon
Modified hood technique for single-port robot-assisted radical prostatectomy contributes to early recovery of continence
Background and purposeUrinary incontinence is one of the common side effects of robot-assisted radical prostatectomy (RARP). Here, we described the modified Hood technique for single-port RARP (sp-RARP) and assessed the interest of this new technique for early continence recovery.MethodsWe retrospectively reviewed 24 patients who underwent sp-RARP modified hood technique from June 2021 to December 2021. The pre-and intraoperative variables, postoperative functional and oncological outcomes of patients were collected and analyzed. The continence rates were estimated at 0 day, 1 week, 4 weeks, 3 months and 12 months after catheter removal. Continence was defined as wearing no pad over a 24 h period.ResultsMean time of operation and estimated blood loss were 183 min and 170 ml, respectively. The postoperative continence rates at 0 day, 1 week, 4 weeks, 3 months and 12 months after catheter removal were 41.7%, 54.2%, 75.0%, 91.7% and 95.8%, respectively. There were two patients who detected positive surgical margins and no patients observed complications requiring further treatment.ConclusionThe modified hood technique is a safe and feasible method that provides better outcomes in terms of early return of continence, without increasing estimated blood loss and compromising oncologic outcomes
TODM: Train Once Deploy Many Efficient Supernet-Based RNN-T Compression For On-device ASR Models
Automatic Speech Recognition (ASR) models need to be optimized for specific
hardware before they can be deployed on devices. This can be done by tuning the
model's hyperparameters or exploring variations in its architecture.
Re-training and re-validating models after making these changes can be a
resource-intensive task. This paper presents TODM (Train Once Deploy Many), a
new approach to efficiently train many sizes of hardware-friendly on-device ASR
models with comparable GPU-hours to that of a single training job. TODM
leverages insights from prior work on Supernet, where Recurrent Neural Network
Transducer (RNN-T) models share weights within a Supernet. It reduces layer
sizes and widths of the Supernet to obtain subnetworks, making them smaller
models suitable for all hardware types. We introduce a novel combination of
three techniques to improve the outcomes of the TODM Supernet: adaptive
dropouts, an in-place Alpha-divergence knowledge distillation, and the use of
ScaledAdam optimizer. We validate our approach by comparing Supernet-trained
versus individually tuned Multi-Head State Space Model (MH-SSM) RNN-T using
LibriSpeech. Results demonstrate that our TODM Supernet either matches or
surpasses the performance of manually tuned models by up to a relative of 3%
better in word error rate (WER), while efficiently keeping the cost of training
many models at a small constant.Comment: Meta AI; Submitted to ICASSP 202
Role of Optical Coherence Tomography in Diagnosis and Treatment of Patients with Acute Coronary Syndrome
Acute coronary syndrome (ACS) is the main cause of death worldwide and the leading cause of disease burden in high-income countries. ACS refers to a constellation of clinical symptoms that are compatible with acute myocardial ischemia. It describes a spectrum of clinical manifestations that result from a common pathophysiological process. The most common cause of ACS are rupture of an atherosclerotic lesion containing a large necrotic core and a thin fibrous cap followed by acute luminal thrombosis. It was thought that a high-resolution imaging modality would be ideal to detect high-risk plaques before their disruption and the formation of an occlusive thrombus. Optical coherence tomography has proven to be an invaluable tool in early detection of high-risk plaques and particularly in the understanding of ACS. This review focuses on the current evidence for the role of optical coherence tomography in the diagnosis and treatment of patients with ACS
Recombinant proteins A29L, M1R, A35R, and B6R vaccination protects mice from mpox virus challenge
Since May 2022, mutant strains of mpox (formerly monkeypox) virus (MPXV) have been rapidly spreading among individuals who have not traveled to endemic areas in multiple locations, including Europe and the United States. Both intracellular and extracellular forms of mpox virus have multiple outer membrane proteins that can stimulate immune response. Here, we investigated the immunogenicity of MPXV structural proteins such as A29L, M1R, A35R, and B6R as a combination vaccine, and the protective effect against the 2022 mpox mutant strain was also evaluated in BALB/c mice. After mixed 15 μg QS-21 adjuvant, all four virus structural proteins were administered subcutaneously to mice. Antibody titers in mouse sera rose sharply after the initial boost, along with an increased capacity of immune cells to produce IFN-γ alongside an elevated level of cellular immunity mediated by Th1 cells. The vaccine-induced neutralizing antibodies significantly inhibited the replication of MPXV in mice and reduced the pathological damage of organs. This study demonstrates the feasibility of a multiple recombinant vaccine for MPXV variant strains
- …