25 research outputs found

    Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers

    Full text link
    In this paper, we focus on Whisper, a recent automatic speech recognition model trained with a massive 680k hour labeled speech corpus recorded in diverse conditions. We first show an interesting finding that while Whisper is very robust against real-world background sounds (e.g., music), its audio representation is actually not noise-invariant, but is instead highly correlated to non-speech sounds, indicating that Whisper recognizes speech conditioned on the noise type. With this finding, we build a unified audio tagging and speech recognition model Whisper-AT by freezing the backbone of Whisper, and training a lightweight audio tagging model on top of it. With <1% extra computational cost, Whisper-AT can recognize audio events, in addition to spoken text, in a single forward pass.Comment: Accepted at Interspeech 2023. Code at https://github.com/yuangongnd/whisper-a

    Direct Text to Speech Translation System using Acoustic Units

    Full text link
    This paper proposes a direct text to speech translation system using discrete acoustic units. This framework employs text in different source languages as input to generate speech in the target language without the need for text transcriptions in this language. Motivated by the success of acoustic units in previous works for direct speech to speech translation systems, we use the same pipeline to extract the acoustic units using a speech encoder combined with a clustering algorithm. Once units are obtained, an encoder-decoder architecture is trained to predict them. Then a vocoder generates speech from units. Our approach for direct text to speech translation was tested on the new CVSS corpus with two different text mBART models employed as initialisation. The systems presented report competitive performance for most of the language pairs evaluated. Besides, results show a remarkable improvement when initialising our proposed architecture with a model pre-trained with more languages.Comment: 5 pages, 4 figure

    Improved Cross-Lingual Transfer Learning For Automatic Speech Translation

    Full text link
    Research in multilingual speech-to-text translation is topical. Having a single model that supports multiple translation tasks is desirable. The goal of this work it to improve cross-lingual transfer learning in multilingual speech-to-text translation via semantic knowledge distillation. We show that by initializing the encoder of the encoder-decoder sequence-to-sequence translation model with SAMU-XLS-R, a multilingual speech transformer encoder trained using multi-modal (speech-text) semantic knowledge distillation, we achieve significantly better cross-lingual task knowledge transfer than the baseline XLS-R, a multilingual speech transformer encoder trained via self-supervised learning. We demonstrate the effectiveness of our approach on two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU points over the baselines. In the zero-shot translation scenario, we achieve an average gain of 18.8 and 11.9 average BLEU points on unseen medium and low-resource languages. We make similar observations on Europarl speech translation benchmark

    Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction

    Full text link
    Target speech extraction aims to extract, based on a given conditioning cue, a target speech signal that is corrupted by interfering sources, such as noise or competing speakers. Building upon the achievements of the state-of-the-art (SOTA) time-frequency speaker separation model TF-GridNet, we propose AV-GridNet, a visual-grounded variant that incorporates the face recording of a target speaker as a conditioning factor during the extraction process. Recognizing the inherent dissimilarities between speech and noise signals as interfering sources, we also propose SAV-GridNet, a scenario-aware model that identifies the type of interfering scenario first and then applies a dedicated expert model trained specifically for that scenario. Our proposed model achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhancement Challenge, outperforming other models by a significant margin, objectively and in a listening test. We also perform an extensive analysis of the results under the two scenarios.Comment: Accepted by ASRU 202

    Metastatic Signet-Ring Cell Gastric Carcinoma Masquerading as Breast Primary

    Get PDF
    Metastasis to the breast from an extra-mammary primary is a rare phenomenon; metastasis from gastric carcinoma to the breast is extremely so. We report a case who initially presented as mucin-secreting and signet-ring cell tumor of the ovary, and after an interval of 8 months with breast and chest wall metastatic nodules. The covert gastric primary eluded the oncologists at both presentations

    Extraocular retinoblastoma in Indian children:clinical, imaging and histopathological features

    Get PDF
    AIM: To study eyes with extraocular dissemination (EORB), with the following aims:first to establish the mean lag period and to understand various reasons for delayed presentation, second to study their imaging profiles and third to analyze histopathological features of eyes enucleated after neoadjuvant chemotherapy.METHODS: Prospective study of clinical and imaging features of EORBs (stage Ⅲ and Ⅳ International Retinoblastoma Staging System) presenting to a tertiary eye care centre. Histopathological features of eyes enucleated after receiving neoadjuvant chemotherapy were analyzed. A pictorial illustration of the varied imaging profile of EORB was also presented.RESULTS: Over a period of one year, 97 eyes were diagnosed with retinoblastoma; 32 children (36 eyes) (37.1%) had EORB. Mean age 3.6±1.9 years, 71.9% males, 71.9% unilateral, 3.1% with positive family history and 40.6% with metastasis. On imaging, there was extrascleral involvement in 22.2%, involvement of orbital part of optic nerve in 33.3%, involvement of central nervous system in 27.8% and orbital wall involvement in 2.9% eyes. On histopathological analysis of eyes enucleated after neoadjuvant chemotherapy, 25.0% had no residual viable tumour tissue and rest all tumours were poorly differentiated.CONCLUSION:There are very few human malignancies where definitive treatment is started without any confirmed histopathological diagnosis and imaging plays an important role in diagnosis and appropriate staging of the disease. Chemotherapy has a variable effect on EORB, 75.0% of eyes with EORB had residual viable tumour tissue when enucleated after receiving neoadjuvant chemotherapy

    Automatic Dialect Detection in Arabic Broadcast Speech

    Get PDF
    We investigate different approaches for dialect identification in Arabic broadcast speech, using phonetic, lexical features obtained from a speech recognition system, and acoustic features using the i-vector framework. We studied both generative and discriminate classifiers, and we combined these features using a multi-class Support Vector Machine (SVM). We validated our results on an Arabic/English language identification task, with an accuracy of 100%. We used these features in a binary classifier to discriminate between Modern Standard Arabic (MSA) and Dialectal Arabic, with an accuracy of 100%. We further report results using the proposed method to discriminate between the five most widely used dialects of Arabic: namely Egyptian, Gulf, Levantine, North African, and MSA, with an accuracy of 52%. We discuss dialect identification errors in the context of dialect code-switching between Dialectal Arabic and MSA, and compare the error pattern between manually labeled data, and the output from our classifier. We also release the train and test data as standard corpus for dialect identification

    The SUMMA Platform Prototype

    Get PDF
    We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams
    corecore