25 research outputs found
Whisper-AT: Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers
In this paper, we focus on Whisper, a recent automatic speech recognition
model trained with a massive 680k hour labeled speech corpus recorded in
diverse conditions. We first show an interesting finding that while Whisper is
very robust against real-world background sounds (e.g., music), its audio
representation is actually not noise-invariant, but is instead highly
correlated to non-speech sounds, indicating that Whisper recognizes speech
conditioned on the noise type. With this finding, we build a unified audio
tagging and speech recognition model Whisper-AT by freezing the backbone of
Whisper, and training a lightweight audio tagging model on top of it. With <1%
extra computational cost, Whisper-AT can recognize audio events, in addition to
spoken text, in a single forward pass.Comment: Accepted at Interspeech 2023. Code at
https://github.com/yuangongnd/whisper-a
Direct Text to Speech Translation System using Acoustic Units
This paper proposes a direct text to speech translation system using discrete
acoustic units. This framework employs text in different source languages as
input to generate speech in the target language without the need for text
transcriptions in this language. Motivated by the success of acoustic units in
previous works for direct speech to speech translation systems, we use the same
pipeline to extract the acoustic units using a speech encoder combined with a
clustering algorithm. Once units are obtained, an encoder-decoder architecture
is trained to predict them. Then a vocoder generates speech from units. Our
approach for direct text to speech translation was tested on the new CVSS
corpus with two different text mBART models employed as initialisation. The
systems presented report competitive performance for most of the language pairs
evaluated. Besides, results show a remarkable improvement when initialising our
proposed architecture with a model pre-trained with more languages.Comment: 5 pages, 4 figure
Improved Cross-Lingual Transfer Learning For Automatic Speech Translation
Research in multilingual speech-to-text translation is topical. Having a
single model that supports multiple translation tasks is desirable. The goal of
this work it to improve cross-lingual transfer learning in multilingual
speech-to-text translation via semantic knowledge distillation. We show that by
initializing the encoder of the encoder-decoder sequence-to-sequence
translation model with SAMU-XLS-R, a multilingual speech transformer encoder
trained using multi-modal (speech-text) semantic knowledge distillation, we
achieve significantly better cross-lingual task knowledge transfer than the
baseline XLS-R, a multilingual speech transformer encoder trained via
self-supervised learning. We demonstrate the effectiveness of our approach on
two popular datasets, namely, CoVoST-2 and Europarl. On the 21 translation
tasks of the CoVoST-2 benchmark, we achieve an average improvement of 12.8 BLEU
points over the baselines. In the zero-shot translation scenario, we achieve an
average gain of 18.8 and 11.9 average BLEU points on unseen medium and
low-resource languages. We make similar observations on Europarl speech
translation benchmark
Scenario-Aware Audio-Visual TF-GridNet for Target Speech Extraction
Target speech extraction aims to extract, based on a given conditioning cue,
a target speech signal that is corrupted by interfering sources, such as noise
or competing speakers. Building upon the achievements of the state-of-the-art
(SOTA) time-frequency speaker separation model TF-GridNet, we propose
AV-GridNet, a visual-grounded variant that incorporates the face recording of a
target speaker as a conditioning factor during the extraction process.
Recognizing the inherent dissimilarities between speech and noise signals as
interfering sources, we also propose SAV-GridNet, a scenario-aware model that
identifies the type of interfering scenario first and then applies a dedicated
expert model trained specifically for that scenario. Our proposed model
achieves SOTA results on the second COG-MHEAR Audio-Visual Speech Enhancement
Challenge, outperforming other models by a significant margin, objectively and
in a listening test. We also perform an extensive analysis of the results under
the two scenarios.Comment: Accepted by ASRU 202
Metastatic Signet-Ring Cell Gastric Carcinoma Masquerading as Breast Primary
Metastasis to the breast from an extra-mammary primary is a rare phenomenon; metastasis from gastric carcinoma to the breast is extremely so. We report a case who initially presented as mucin-secreting and signet-ring cell tumor of the ovary, and after an interval of 8 months with breast and chest wall metastatic nodules. The covert gastric primary eluded the oncologists at both presentations
Extraocular retinoblastoma in Indian children:clinical, imaging and histopathological features
AIM: To study eyes with extraocular dissemination (EORB), with the following aims:first to establish the mean lag period and to understand various reasons for delayed presentation, second to study their imaging profiles and third to analyze histopathological features of eyes enucleated after neoadjuvant chemotherapy.METHODS: Prospective study of clinical and imaging features of EORBs (stage Ⅲ and Ⅳ International Retinoblastoma Staging System) presenting to a tertiary eye care centre. Histopathological features of eyes enucleated after receiving neoadjuvant chemotherapy were analyzed. A pictorial illustration of the varied imaging profile of EORB was also presented.RESULTS: Over a period of one year, 97 eyes were diagnosed with retinoblastoma; 32 children (36 eyes) (37.1%) had EORB. Mean age 3.6±1.9 years, 71.9% males, 71.9% unilateral, 3.1% with positive family history and 40.6% with metastasis. On imaging, there was extrascleral involvement in 22.2%, involvement of orbital part of optic nerve in 33.3%, involvement of central nervous system in 27.8% and orbital wall involvement in 2.9% eyes. On histopathological analysis of eyes enucleated after neoadjuvant chemotherapy, 25.0% had no residual viable tumour tissue and rest all tumours were poorly differentiated.CONCLUSION:There are very few human malignancies where definitive treatment is started without any confirmed histopathological diagnosis and imaging plays an important role in diagnosis and appropriate staging of the disease. Chemotherapy has a variable effect on EORB, 75.0% of eyes with EORB had residual viable tumour tissue when enucleated after receiving neoadjuvant chemotherapy
Automatic Dialect Detection in Arabic Broadcast Speech
We investigate different approaches for dialect identification in Arabic
broadcast speech, using phonetic, lexical features obtained from a speech
recognition system, and acoustic features using the i-vector framework. We
studied both generative and discriminate classifiers, and we combined these
features using a multi-class Support Vector Machine (SVM). We validated our
results on an Arabic/English language identification task, with an accuracy of
100%. We used these features in a binary classifier to discriminate between
Modern Standard Arabic (MSA) and Dialectal Arabic, with an accuracy of 100%. We
further report results using the proposed method to discriminate between the
five most widely used dialects of Arabic: namely Egyptian, Gulf, Levantine,
North African, and MSA, with an accuracy of 52%. We discuss dialect
identification errors in the context of dialect code-switching between
Dialectal Arabic and MSA, and compare the error pattern between manually
labeled data, and the output from our classifier. We also release the train and
test data as standard corpus for dialect identification
The SUMMA Platform Prototype
We present the first prototype of the SUMMA Platform: an integrated platform for multilingual media monitoring. The platform contains a rich suite of low-level and high-level natural language processing technologies: automatic speech recognition of broadcast media, machine translation, automated tagging and classification of named entities, semantic parsing to detect relationships between entities, and automatic construction / augmentation of factual knowledge bases. Implemented on the Docker platform, it can easily be deployed, customised, and scaled to large volumes of incoming media streams