35 research outputs found
CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice
Despite the recent advancements in Automatic Speech Recognition (ASR), the
recognition of accented speech still remains a dominant problem. In order to
create more inclusive ASR systems, research has shown that the integration of
accent information, as part of a larger ASR framework, can lead to the
mitigation of accented speech errors. We address multilingual accent
classification through the ECAPA-TDNN and Wav2Vec 2.0/XLSR architectures which
have been proven to perform well on a variety of speech-related downstream
tasks. We introduce a simple-to-follow recipe aligned to the SpeechBrain
toolkit for accent classification based on Common Voice 7.0 (English) and
Common Voice 11.0 (Italian, German, and Spanish). Furthermore, we establish new
state-of-the-art for English accent classification with as high as 95%
accuracy. We also study the internal categorization of the Wav2Vev 2.0
embeddings through t-SNE, noting that there is a level of clustering based on
phonological similarity. (Our recipe is open-source in the SpeechBrain toolkit,
see: https://github.com/speechbrain/speechbrain/tree/develop/recipes)Comment: To appear in Proceedings of the Annual Conference of the
International Speech Communication Association, INTERSPEECH 202
HyperConformer: Multi-head HyperMixer for Efficient Speech Recognition
State-of-the-art ASR systems have achieved promising results by modeling
local and global interactions separately. While the former can be computed
efficiently, global interactions are usually modeled via attention mechanisms,
which are expensive for long input sequences. Here, we address this by
extending HyperMixer, an efficient alternative to attention exhibiting linear
complexity, to the Conformer architecture for speech recognition, leading to
HyperConformer. In particular, multi-head HyperConformer achieves comparable or
higher recognition performance while being more efficient than Conformer in
terms of inference speed, memory, parameter count, and available training data.
HyperConformer achieves a word error rate of 2.9% on Librispeech test-clean
with less than 8M neural parameters and a peak memory during training of 5.7GB,
hence trainable with accessible hardware. Encoder speed is between 38% on
mid-length speech and 56% on long speech faster than an equivalent Conformer.
(The HyperConformer recipe is publicly available in:
https://github.com/speechbrain/speechbrain/tree/develop/recipes/LibriSpeech/ASR/transformer/)Comment: Florian Mai and Juan Zuluaga-Gomez contributed equally. To appear in
Proceedings of the Annual Conference of the International Speech
Communication Association, INTERSPEECH 202
Implementing contextual biasing in GPU decoder for online ASR
GPU decoding significantly accelerates the output of ASR predictions. While
GPUs are already being used for online ASR decoding, post-processing and
rescoring on GPUs have not been properly investigated yet. Rescoring with
available contextual information can considerably improve ASR predictions.
Previous studies have proven the viability of lattice rescoring in decoding and
biasing language model (LM) weights in offline and online CPU scenarios. In
real-time GPU decoding, partial recognition hypotheses are produced without
lattice generation, which makes the implementation of biasing more complex. The
paper proposes and describes an approach to integrate contextual biasing in
real-time GPU decoding while exploiting the standard Kaldi GPU decoder. Besides
the biasing of partial ASR predictions, our approach also permits dynamic
context switching allowing a flexible rescoring per each speech segment
directly on GPU. The code is publicly released and tested with open-sourced
test sets.Comment: Accepted to Interspeech 202
Behavior of Belite cement blended with Calcium sulfoaluminate cement: an Ecocement
Belite Portland Cements (BPC) and calcium sulfoaluminate cements (CSA) are considered as environmentally friendly cements due to their lower CO2 emissions. These ecocements (BPC and CSA) emit about 0.03 and 0.18 less tons of carbon dioxide from raw materials, respectively, than Portland Cement (PC). However, BPCs have a technological disadvantage, due to the slow kinetic of hydration of belite (their main phase), causing low mechanical strengths at early ages. On the other hand, CSA cements are more expensive due their high alumina content, but they develop high mechanical strengths since early ages. Those are the main reasons why it is essential to develop strategies that could reduce their cost with competitive mechanical strengths.
A CSA clinker (ye’elimite as main phase) and a BPC (belite as main phase), have been mixed with the objective of producing a cheaper ecocement, labelled B#, that releases less CO2 than PC and with competitive mechanical strengths. Cements with 83 wt%, 75 wt% and 65 wt% of BPC with CSA have been prepared. Moreover, anhydrite has been added as set regulator. Pastes with water/cement ratio of 0.4 have been prepared. The hydration of these pastes have been characterized by laboratory X-ray powder diffraction, using Rietveld methodology and thermogravimetric analysis, to obtain mineralogical phase assemblage as a function of time during a year, including amorphous content and free water. Mineralogical phase assemblage has been correlated to compressive strengths, porosity and dimensional stability of mortars.BIA-82391-R
Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tec
Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding
Voice communication between air traffic controllers (ATCos) and pilots is
critical for ensuring safe and efficient air traffic control (ATC). This task
requires high levels of awareness from ATCos and can be tedious and
error-prone. Recent attempts have been made to integrate artificial
intelligence (AI) into ATC in order to reduce the workload of ATCos. However,
the development of data-driven AI systems for ATC demands large-scale annotated
datasets, which are currently lacking in the field. This paper explores the
lessons learned from the ATCO2 project, a project that aimed to develop a
unique platform to collect and preprocess large amounts of ATC data from
airspace in real time. Audio and surveillance data were collected from publicly
accessible radio frequency channels with VHF receivers owned by a community of
volunteers and later uploaded to Opensky Network servers, which can be
considered an "unlimited source" of data. In addition, this paper reviews
previous work from ATCO2 partners, including (i) robust automatic speech
recognition, (ii) natural language processing, (iii) English language
identification of ATC communications, and (iv) the integration of surveillance
data such as ADS-B. We believe that the pipeline developed during the ATCO2
project, along with the open-sourcing of its data, will encourage research in
the ATC field. A sample of the ATCO2 corpus is available on the following
website: https://www.atco2.org/data, while the full corpus can be purchased
through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. We
demonstrated that ATCO2 is an appropriate dataset to develop ASR engines when
little or near to no ATC in-domain data is available. For instance, with the
CNN-TDNNf kaldi model, we reached the performance of as low as 17.9% and 24.9%
WER on public ATC datasets which is 6.6/7.6% better than "out-of-domain" but
supervised CNN-TDNNf model.Comment: Manuscript under revie
Grammar Based Speaker Role Identification for Air Traffic Control Speech Recognition
Automatic Speech Recognition (ASR) for air traffic
control is generally trained by pooling Air Traffic Controller
(ATCO) and pilot data. In practice, this is motivated by the
proportion of annotated data from pilots being less than ATCO’s.
However, due to the data imbalance of ATCO and pilot and
their varying acoustic conditions, the ASR performance is usually
significantly better for ATCOs speech than pilots. Obtaining the
speaker roles requires manual effort when the voice recordings
are collected using Very High Frequency (VHF) receivers and
the data is noisy and in a single channel without the push-totalk (PTT) signal. In this paper, we propose to (1) split the
ATCO and pilot data using an intuitive approach exploiting
ASR transcripts and (2) consider ATCO and pilot ASR as two
separate tasks for Acoustic Model (AM) training. The paper
focuses on applying this approach to noisy data collected using
VHF receivers, as this data is helpful for training despite its
noisy nature. We also developed a simple yet efficient knowledgebased system for speaker role classification based on grammar
defined by the International Civil Aviation Organization (ICAO).
Our system accepts as input text, thus, either gold annotations
or transcripts generated by an ABSR system. This approach
provides an average accuracy in speaker role identification of
83%. Finally, we show that training AMs separately for each
task, or using a multitask approach, is well suited for the noisy
data compared to the traditional ASR system, where all data is
pooled together for AM training