Search CORE

35 research outputs found

CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice

Author: Ahmed Sara
Subakan Cem
Visockas Danielius
Zuluaga-Gomez Juan
Publication venue
Publication date: 29/05/2023
Field of study

Despite the recent advancements in Automatic Speech Recognition (ASR), the recognition of accented speech still remains a dominant problem. In order to create more inclusive ASR systems, research has shown that the integration of accent information, as part of a larger ASR framework, can lead to the mitigation of accented speech errors. We address multilingual accent classification through the ECAPA-TDNN and Wav2Vec 2.0/XLSR architectures which have been proven to perform well on a variety of speech-related downstream tasks. We introduce a simple-to-follow recipe aligned to the SpeechBrain toolkit for accent classification based on Common Voice 7.0 (English) and Common Voice 11.0 (Italian, German, and Spanish). Furthermore, we establish new state-of-the-art for English accent classification with as high as 95% accuracy. We also study the internal categorization of the Wav2Vev 2.0 embeddings through t-SNE, noting that there is a level of clustering based on phonological similarity. (Our recipe is open-source in the SpeechBrain toolkit, see: https://github.com/speechbrain/speechbrain/tree/develop/recipes)Comment: To appear in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 202

arXiv.org e-Print Archive

HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition

Author: Mai Florian
Motlicek Petr
Parcollet Titouan
Zuluaga-Gomez Juan
Publication venue
Publication date: 29/05/2023
Field of study

State-of-the-art ASR systems have achieved promising results by modeling local and global interactions separately. While the former can be computed efficiently, global interactions are usually modeled via attention mechanisms, which are expensive for long input sequences. Here, we address this by extending HyperMixer, an efficient alternative to attention exhibiting linear complexity, to the Conformer architecture for speech recognition, leading to HyperConformer. In particular, multi-head HyperConformer achieves comparable or higher recognition performance while being more efficient than Conformer in terms of inference speed, memory, parameter count, and available training data. HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean with less than 8M neural parameters and a peak memory during training of 5.7GB, hence trainable with accessible hardware. Encoder speed is between 38% on mid-length speech and 56% on long speech faster than an equivalent Conformer. (The HyperConformer recipe is publicly available in: https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer/)Comment: Florian Mai and Juan Zuluaga-Gomez contributed equally. To appear in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 202

arXiv.org e-Print Archive

Implementing contextual biasing in GPU decoder for online ASR

Author: Ganapathiraju Aravind
Madikeri Srikanth
Motliček Petr
Nigmatulina Iuliia
Pandia Karthik
Villatoro-Tello Esaú
Zuluaga-Gomez Juan
Publication venue
Publication date: 23/06/2023
Field of study

GPU decoding significantly accelerates the output of ASR predictions. While GPUs are already being used for online ASR decoding, post-processing and rescoring on GPUs have not been properly investigated yet. Rescoring with available contextual information can considerably improve ASR predictions. Previous studies have proven the viability of lattice rescoring in decoding and biasing language model (LM) weights in offline and online CPU scenarios. In real-time GPU decoding, partial recognition hypotheses are produced without lattice generation, which makes the implementation of biasing more complex. The paper proposes and describes an approach to integrate contextual biasing in real-time GPU decoding while exploiting the standard Kaldi GPU decoder. Besides the biasing of partial ASR predictions, our approach also permits dynamic context switching allowing a flexible rescoring per each speech segment directly on GPU. The code is publicly released and tested with open-sourced test sets.Comment: Accepted to Interspeech 202

arXiv.org e-Print Archive

Behavior of Belite cement blended with Calcium sulfoaluminate cement: an Ecocement

Author: Gomez-de-la-Torre Maria de los Angeles
Londono-Zuluaga Diana
Pineda Juan Manuel
Publication venue
Publication date: 26/09/2019
Field of study

Belite Portland Cements (BPC) and calcium sulfoaluminate cements (CSA) are considered as environmentally friendly cements due to their lower CO2 emissions. These ecocements (BPC and CSA) emit about 0.03 and 0.18 less tons of carbon dioxide from raw materials, respectively, than Portland Cement (PC). However, BPCs have a technological disadvantage, due to the slow kinetic of hydration of belite (their main phase), causing low mechanical strengths at early ages. On the other hand, CSA cements are more expensive due their high alumina content, but they develop high mechanical strengths since early ages. Those are the main reasons why it is essential to develop strategies that could reduce their cost with competitive mechanical strengths. A CSA clinker (ye’elimite as main phase) and a BPC (belite as main phase), have been mixed with the objective of producing a cheaper ecocement, labelled B#, that releases less CO2 than PC and with competitive mechanical strengths. Cements with 83 wt%, 75 wt% and 65 wt% of BPC with CSA have been prepared. Moreover, anhydrite has been added as set regulator. Pastes with water/cement ratio of 0.4 have been prepared. The hydration of these pastes have been characterized by laboratory X-ray powder diffraction, using Rietveld methodology and thermogravimetric analysis, to obtain mineralogical phase assemblage as a function of time during a year, including amorphous content and free water. Mineralogical phase assemblage has been correlated to compressive strengths, porosity and dimensional stability of mortars.BIA-82391-R Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec

Repositorio Institucional Universidad de Málaga

Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding

Author: Choukri Khalid
Khalil Driss
Lenders Vincent
Madikeri Srikanth
Motlicek Petr
Nigmatulina Iuliia
Prasad Amrutha
Rigault Mickael
Szoke Igor
Tart Allan
Zuluaga-Gomez Juan
Publication venue
Publication date: 01/05/2023
Field of study

Voice communication between air traffic controllers (ATCos) and pilots is critical for ensuring safe and efficient air traffic control (ATC). This task requires high levels of awareness from ATCos and can be tedious and error-prone. Recent attempts have been made to integrate artificial intelligence (AI) into ATC in order to reduce the workload of ATCos. However, the development of data-driven AI systems for ATC demands large-scale annotated datasets, which are currently lacking in the field. This paper explores the lessons learned from the ATCO2 project, a project that aimed to develop a unique platform to collect and preprocess large amounts of ATC data from airspace in real time. Audio and surveillance data were collected from publicly accessible radio frequency channels with VHF receivers owned by a community of volunteers and later uploaded to Opensky Network servers, which can be considered an "unlimited source" of data. In addition, this paper reviews previous work from ATCO2 partners, including (i) robust automatic speech recognition, (ii) natural language processing, (iii) English language identification of ATC communications, and (iv) the integration of surveillance data such as ADS-B. We believe that the pipeline developed during the ATCO2 project, along with the open-sourcing of its data, will encourage research in the ATC field. A sample of the ATCO2 corpus is available on the following website: https://www.atco2.org/data, while the full corpus can be purchased through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. We demonstrated that ATCO2 is an appropriate dataset to develop ASR engines when little or near to no ATC in-domain data is available. For instance, with the CNN-TDNNf kaldi model, we reached the performance of as low as 17.9% and 24.9% WER on public ATC datasets which is 6.6/7.6% better than "out-of-domain" but supervised CNN-TDNNf model.Comment: Manuscript under revie

arXiv.org e-Print Archive

Grammar Based Speaker Role Identification for Air Traffic Control Speech Recognition

Author: Helmke Hartmut
Motlicek Petr
Nigmatulina Iuliia
Ohneiser Oliver
Prasad Amrutha
Sarfjoo Saeed
Zuluaga-Gomez Juan Pablo
Publication venue
Publication date: 01/01/2022
Field of study

Automatic Speech Recognition (ASR) for air traffic control is generally trained by pooling Air Traffic Controller (ATCO) and pilot data. In practice, this is motivated by the proportion of annotated data from pilots being less than ATCO’s. However, due to the data imbalance of ATCO and pilot and their varying acoustic conditions, the ASR performance is usually significantly better for ATCOs speech than pilots. Obtaining the speaker roles requires manual effort when the voice recordings are collected using Very High Frequency (VHF) receivers and the data is noisy and in a single channel without the push-totalk (PTT) signal. In this paper, we propose to (1) split the ATCO and pilot data using an intuitive approach exploiting ASR transcripts and (2) consider ATCO and pilot ASR as two separate tasks for Acoustic Model (AM) training. The paper focuses on applying this approach to noisy data collected using VHF receivers, as this data is helpful for training despite its noisy nature. We also developed a simple yet efficient knowledgebased system for speaker role classification based on grammar defined by the International Civil Aviation Organization (ICAO). Our system accepts as input text, thus, either gold annotations or transcripts generated by an ABSR system. This approach provides an average accuracy in speaker role identification of 83%. Finally, we show that training AMs separately for each task, or using a multitask approach, is well suited for the noisy data compared to the traditional ASR system, where all data is pooled together for AM training

Institute of Transport Research:Publications