115 research outputs found

    Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition

    Full text link
    Recent years have witnessed great strides in self-supervised learning (SSL) on the speech processing. The SSL model is normally pre-trained on a great variety of unlabelled data and a large model size is preferred to increase the modeling capacity. However, this might limit its potential applications due to the expensive computation and memory costs introduced by the oversize model. Miniaturization for SSL models has become an important research direction of practical value. To this end, we explore the effective distillation of HuBERT-based SSL models for automatic speech recognition (ASR). First, in order to establish a strong baseline, a comprehensive study on different student model structures is conducted. On top of this, as a supplement to the regression loss widely adopted in previous works, a discriminative loss is introduced for HuBERT to enhance the distillation performance, especially in low-resource scenarios. In addition, we design a simple and effective algorithm to distill the front-end input from waveform to Fbank feature, resulting in 17% parameter reduction and doubling inference speed, at marginal performance degradation.Comment: Submitted to ICASSP 202

    Proximity effect at superconducting Sn-Bi2Se3 interface

    Full text link
    We have investigated the conductance spectra of Sn-Bi2Se3 interface junctions down to 250 mK and in different magnetic fields. A number of conductance anomalies were observed below the superconducting transition temperature of Sn, including a small gap different from that of Sn, and a zero-bias conductance peak growing up at lower temperatures. We discussed the possible origins of the smaller gap and the zero-bias conductance peak. These phenomena support that a proximity-effect-induced chiral superconducting phase is formed at the interface between the superconducting Sn and the strong spin-orbit coupling material Bi2Se3.Comment: 7 pages, 8 figure

    Fine-grained Audio-Visual Joint Representations for Multimodal Large Language Models

    Full text link
    Audio-visual large language models (LLM) have drawn significant attention, yet the fine-grained combination of both input streams is rather under-explored, which is challenging but necessary for LLMs to understand general video inputs. To this end, a fine-grained audio-visual joint representation (FAVOR) learning framework for multimodal LLMs is proposed in this paper, which extends a text-based LLM to simultaneously perceive speech and audio events in the audio input stream and images or videos in the visual input stream, at the frame level. To fuse the audio and visual feature streams into joint representations and to align the joint space with the LLM input embedding space, we propose a causal Q-Former structure with a causal attention module to enhance the capture of causal relations of the audio-visual frames across time. An audio-visual evaluation benchmark (AVEB) is also proposed which comprises six representative single-modal tasks with five cross-modal tasks reflecting audio-visual co-reasoning abilities. While achieving competitive single-modal performance on audio, speech and image tasks in AVEB, FAVOR achieved over 20% accuracy improvements on the video question-answering task when fine-grained information or temporal causal reasoning is required. FAVOR, in addition, demonstrated remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other multimodal LLMs. An interactive demo of FAVOR is available at https://github.com/BriansIDP/AudioVisualLLM.git, and the training code and model checkpoints will be released soon

    Connecting Speech Encoder and Large Language Model for ASR

    Full text link
    The impressive capability and versatility of large language models (LLMs) have aroused increasing attention in automatic speech recognition (ASR), with several pioneering studies attempting to build integrated ASR models by connecting a speech encoder with an LLM. This paper presents a comparative study of three commonly used structures as connectors, including fully connected layers, multi-head cross-attention, and Q-Former. Speech encoders from the Whisper model series as well as LLMs from the Vicuna model series with different model sizes were studied. Experiments were performed on the commonly used LibriSpeech, Common Voice, and GigaSpeech datasets, where the LLMs with Q-Formers demonstrated consistent and considerable word error rate (WER) reductions over LLMs with other connector structures. Q-Former-based LLMs can generalise well to out-of-domain datasets, where 12% relative WER reductions over the Whisper baseline ASR model were achieved on the Eval2000 test set without using any in-domain training data from Switchboard. Moreover, a novel segment-level Q-Former is proposed to enable LLMs to recognise speech segments with a duration exceeding the limitation of the encoders, which results in 17% relative WER reductions over other connector structures on 90-second-long speech data

    A case report of adrenocorticotropic hormone to treat recurrent focal segmental glomerular sclerosis post-transplantation and biomarker monitoring

    Get PDF
    Background: Recurrent focal segmental glomerular sclerosis (rFSGS) in renal transplant recipients (RTR) is difficult to predict and treat. Early rFSGS is likely from circulating factors and preformed antibodies. Methods: We present the case of a 23-year-old white man who presented with rFSGS and acute renal failure requiring dialysis 9-months after a 1-haplotype matched living-related transplant. We retrospectively analyzed serum samples from various clinical stages for rFSGS biomarkers: serum glomerular albumin permeability (Palb), soluble urokinase-type plasminogen activator receptor (suPAR) serum level with suPAR-β3 integrin signaling on human podocytes, and angiotensin II type I receptor-antibody (AT1R-Ab) titer. Results: All biomarkers were abnormal at 1-year pre-transplant prior to initiation of dialysis and at the time of transplant. After initiation of hemodialysis, β3 integrin activity on human podocytes, in response to patient serum, as well as AT1R-Ab were further elevated. At the time of biopsy-proven recurrence, all biomarkers were abnormally high. One week after therapy with aborted plasmapheresis (secondary to intolerance), and high dose steroids, the Palb and suPAR- β3 integrin activity remained significantly positive. After 12-weeks of treatment with high-dose steroids, rituximab, and galactose, the patient remained hemodialysis-dependent. Three-months after his initial presentation we commenced adrenocorticotropic hormone (ACTH, Acthar® Gel), 80 units subcutaneously twice weekly. Four-weeks later he was able to discontinue dialysis. After 8-months of maintenance ACTH therapy, his serum creatinine stabilized at 1.79 mg/dL with less than 1 gram of proteinuria. Conclusion: ACTH therapy was associated with improvement in renal function within 4 weeks. The use of rFSGS biomarkers may aid in predicting development of rFSGS

    Podocyte-Specific Overexpression of Wild Type or Mutant Trpc6 in Mice Is Sufficient to Cause Glomerular Disease

    Get PDF
    Mutations in the TRPC6 calcium channel (Transient receptor potential channel 6) gene have been associated with familiar forms of Focal and Segmental Glomerulosclerosis (FSGS) affecting children and adults. In addition, acquired glomerular diseases are associated with increased expression levels of TRPC6. However, the exact role of TRPC6 in the pathogenesis of FSGS remains to be elucidated. In this work we describe the generation and phenotypic characterization of three different transgenic mouse lines with podocyte-specific overexpression of the wild type or any of two mutant forms of Trpc6 (P111Q and E896K) previously related to FSGS. Consistent with the human phenotype a non-nephrotic range of albuminuria was detectable in almost all transgenic lines. The histological analysis demonstrated that the transgenic mice developed a kidney disease similar to human FSGS. Differences of 2–3 folds in the presence of glomerular lesions were found between the non transgenic and transgenic mice expressing Trpc6 in its wild type or mutant forms specifically in podocytes. Electron microscopy of glomerulus from transgenic mice showed extensive podocyte foot process effacement. We conclude that overexpression of Trpc6 (wild type or mutated) in podocytes is sufficient to cause a kidney disease consistent with FSGS. Our results contribute to reinforce the central role of podocytes in the etiology of FSGS. These mice constitute an important new model in which to study future therapies and outcomes of this complex disease
    • …
    corecore