105 research outputs found

    SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training

    Full text link
    The rapid development of single-modal pre-training has prompted researchers to pay more attention to cross-modal pre-training methods. In this paper, we propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Leveraging hidden-unit as an interface to align speech and text, we can decompose the speech-to-text model into a speech-to-unit model and a unit-to-text model, which can be jointly pre-trained with unpaired speech and text data respectively. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks. Experimental results show that SpeechUT gets substantial improvements over strong baselines, and achieves state-of-the-art performance on both the LibriSpeech ASR and MuST-C ST tasks. To better understand the proposed SpeechUT, detailed analyses are conducted. The code and pre-trained models are available at https://aka.ms/SpeechUT.Comment: 14 pages, accepted by EMNLP 202

    VATLM: Visual-Audio-Text Pre-Training with Unified Masked Prediction for Speech Representation Learning

    Full text link
    Although speech is a simple and effective way for humans to communicate with the outside world, a more realistic speech interaction contains multimodal information, e.g., vision, text. How to design a unified framework to integrate different modal information and leverage different resources (e.g., visual-audio pairs, audio-text pairs, unlabeled speech, and unlabeled text) to facilitate speech representation learning was not well explored. In this paper, we propose a unified cross-modal representation learning framework VATLM (Visual-Audio-Text Language Model). The proposed VATLM employs a unified backbone network to model the modality-independent information and utilizes three simple modality-dependent modules to preprocess visual, speech, and text inputs. In order to integrate these three modalities into one shared semantic space, VATLM is optimized with a masked prediction task of unified tokens, given by our proposed unified tokenizer. We evaluate the pre-trained VATLM on audio-visual related downstream tasks, including audio-visual speech recognition (AVSR), visual speech recognition (VSR) tasks. Results show that the proposed VATLM outperforms previous the state-of-the-art models, such as audio-visual pre-trained AV-HuBERT model, and analysis also demonstrates that VATLM is capable of aligning different modalities into the same space. To facilitate future research, we release the code and pre-trained models at https://aka.ms/vatlm.Comment: 10 page

    Probing the limits of optical cycling in a predissociative diatomic molecule

    Full text link
    Molecular predissociation is the spontaneous, nonradiative bond breaking process that can occur upon excitation. In the context of laser cooling, predissociation is an unwanted consequence of molecular structure that limits the ability to scatter a large number of photons required to reach the ultracold regime. Unlike rovibrational branching, predissociation is irreversible since the fragments fly apart with high kinetic energy. Of particular interest is the simple diatomic molecule, CaH, for which the two lowest electronically excited states used in laser cooling lie above the dissociation threshold of the ground potential. In this work, we present measurements and calculations that quantify the predissociation probabilities affecting the cooling cycle. The results allow us to design a laser cooling scheme that will enable the creation of an ultracold and optically trapped cloud of CaH molecules. In addition, we use the results to propose a two-photon pathway to controlled dissociation of the molecules, in order to gain access to their ultracold fragments, including hydrogen.Comment: 16 pages, 4 figure

    Experimental realization of a highly secure chaos communication under strong channel noise

    Full text link
    A one-way coupled spatiotemporally chaotic map lattice is used to contruct cryptosystem. With the combinatorial applications of both chaotic computations and conventional algebraic operations, our system has optimal cryptographic properties much better than the separative applications of known chaotic and conventional methods. We have realized experiments to pratice duplex voice secure communications in realistic Wired Public Switched Telephone Network by applying our chaotic system and the system of Advanced Encryption Standard (AES), respectively, for cryptography. Our system can work stably against strong channel noise when AES fails to work.Comment: 15 pages, 5 figure

    SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data

    Full text link
    How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and Language Model (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. Leveraging only 10K text sentences, our SpeechLM gets a 16\% relative WER reduction over the best base model performance (from 6.8 to 5.7) on the public LibriSpeech ASR benchmark. Moreover, SpeechLM with fewer parameters even outperforms previous SOTA models on CoVoST-2 speech translation tasks. We also evaluate our SpeechLM on various spoken language processing tasks under the universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Our code and models are available at https://aka.ms/SpeechLM.Comment: 14 page

    Ras-induced Epigenetic Inactivation of the RRAD ( Ras-related Associated with Diabetes) Gene Promotes Glucose Uptake in a Human Ovarian Cancer Model

    Get PDF
    Background: Increased glucose uptake is essential for carcinogenesis. Results: Ras(V12)-induced epigenetic inactivation of RRAD promotes glucose uptake and tumor formation. Conclusion: RRAD might act as a functional tumor suppressor by inhibiting glucose uptake. Significance: Down-regulation of RRAD in tumor tissues might be associated with the Warburg effect. RRAD (Ras-related associated with diabetes) is a small Ras-related GTPase that is frequently inactivated by DNA methylation of the CpG island in its promoter region in cancer tissues. However, the role of the methylation-induced RRAD inactivation in tumorigenesis remains unclear. In this study, the Ras-regulated transcriptome and epigenome were profiled by comparing T29H (a Ras(V12)-transformed human ovarian epithelial cell line) with T29 (an immortalized but non-transformed cell line) through reduced representation bisulfite sequencing and digital gene expression. We found that Ras(V12)-mediated oncogenic transformation was accompanied by RRAD promoter hypermethylation and a concomitant loss of RRAD expression. In addition, we found that the RRAD promoter was hypermethylated, and its transcription was reduced in ovarian cancer versus normal ovarian tissues. Treatment with the DNA methyltransferase inhibitor 5-aza-2-deoxycytidine resulted in demethylation in the RRAD promoter and restored RRAD expression in T29H cells. Additionally, treatment with farnesyltransferase inhibitor FTI277 resulted in restored RRAD expression and inhibited DNA methytransferase expression and activity in T29H cells. By employing knockdown and overexpression techniques in T29 and T29H, respectively, we found that RRAD inhibited glucose uptake and lactate production by repressing the expression of glucose transporters. Finally, RRAD overexpression in T29H cells inhibited tumor formation in nude mice, suggesting that RRAD is a tumor suppressor gene. Our results indicate that Ras(V12)-mediated oncogenic transformation induces RRAD epigenetic inactivation, which in turn promotes glucose uptake and may contribute to ovarian cancer tumorigenesis

    Effect of megarectum on postoperative defecation of female patients with congenital rectovestibular fistula or rectoperineal fistula

    Get PDF
    BackgroundTo assess the effect of megarectum on postoperative defecation of female patients with congenital rectovestibular fistula or rectoperineal fistula.MethodsFrom March 2013 to February 2021, 74 female patients with congenital rectovestibular fistula or rectoperineal fistula were treated. The age of patients ranged from 3 months to 1 year. Barium enema and spinal cord MRI were performed in all children. 4 patients were removed from the study because of spinal cord and sacral agenesis. Finally, 70 patients underwent one-stage anterior sagittal anorectoplasty (ASARP). Anal endoscopy and anorectal pressure measurement were performed 1 year after surgery. All patients were divided into two groups depending on the presence of megarectum (+) and (−) and observed for constipation and anal sphincter function.Results16 patients (4 months to 1 year) were complicated with megarectum, and 5 patients (3 months to 9 months) were without megarectum. The incision infection was seen in 3 patients. All patients were followed up for 1 year to 5 years. Fecal soiling was seen in 2 patients and constipation in 14 patients. Among 16 patients with megarectum, soiling was seen in 1 patient and the constipation in 12 patients. Among 54 patients without megarectum, soiling was seen in 1 patient and constipation in 2 patients. There was a significant difference in the incidence of postoperative constipation between the two groups (megarectum (+) 75% vs. megarectum (−) 3.7% (P < 0.05)). However, there was no significant difference in the score of anal sphincters between the two groups (P < 0.05). And there was no significant difference in anal resting pressure (P = 0.49) and length of anal high pressure area (P = 0.76). 7 patients with constipation and megarectum acquired normal anal function after the dilated rectum was resected.ConclusionMegarectum increases the possibility of difficult postoperative defecation in the patients with congenital rectovestibular fistula or rectoperineal fistula. However, constipation was not associated with ASARP postoperative effects on sphincter function. Resection of megarectum is helpful to the improvement of constipation

    Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired Speech Data

    Full text link
    This paper studies a novel pre-training technique with unpaired speech data, Speech2C, for encoder-decoder based automatic speech recognition (ASR). Within a multi-task learning framework, we introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes, derived from an offline clustering model. One is to predict the pseudo codes via masked language modeling in encoder output, like HuBERT model, while the other lets the decoder learn to reconstruct pseudo codes autoregressively instead of generating textual scripts. In this way, the decoder learns to reconstruct original speech information with codes before learning to generate correct text. Comprehensive experiments on the LibriSpeech corpus show that the proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training, and also outperforms significantly the state-of-the-art wav2vec 2.0 and HuBERT on fine-tuning subsets of 10h and 100h.Comment: Submitted to INTERSPEECH 202
    corecore