1,706 research outputs found
Review of Automatic Speech Recognition Methodologies
DTFACT-14-D-00004,692M152240001This report highlights the crucial role of Automatic Speech Recognition (ASR) techniques in enhancing safety for air traffic control (ATC) in terminal environments. ASR techniques facilitate efficient and accurate transcription of verbal communications, reducing the likelihood of errors. The report also details the evolution of ASR technologies, converging to machine learning approaches from Hidden Markov Models (HMMs), Deep Neural Networks (DNNs) to End-to-End models. Finally, the report details the latest advancements in ASR techniques, focusing on transformer-based models that have outperformed traditional ASR approaches and achieved state-of-the-art results on ASR benchmarks
How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications
Recent work on self-supervised pre-training focus on leveraging large-scale
unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM)
that can be later fine-tuned on downstream tasks e.g., automatic speech
recognition (ASR). Yet, few works investigated the impact on performance when
the data substantially differs between the pre-training and downstream
fine-tuning phases (i.e., domain shift). We target this scenario by analyzing
the robustness of Wav2Vec2.0 and XLS-R models on downstream ASR for a
completely unseen domain, i.e., air traffic control (ATC) communications. We
benchmark the proposed models on four challenging ATC test sets
(signal-to-noise ratio varies between 5 to 20 dB). Relative word error rate
(WER) reduction between 20% to 40% are obtained in comparison to hybrid-based
state-of-the-art ASR baselines by fine-tuning E2E acoustic models with a small
fraction of labeled data. We also study the impact of fine-tuning data size on
WERs, going from 5 minutes (few-shot) to 15 hours.Comment: This paper has been submitted to Interspeech 202
Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review
Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined
Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding
Voice communication between air traffic controllers (ATCos) and pilots is
critical for ensuring safe and efficient air traffic control (ATC). This task
requires high levels of awareness from ATCos and can be tedious and
error-prone. Recent attempts have been made to integrate artificial
intelligence (AI) into ATC in order to reduce the workload of ATCos. However,
the development of data-driven AI systems for ATC demands large-scale annotated
datasets, which are currently lacking in the field. This paper explores the
lessons learned from the ATCO2 project, a project that aimed to develop a
unique platform to collect and preprocess large amounts of ATC data from
airspace in real time. Audio and surveillance data were collected from publicly
accessible radio frequency channels with VHF receivers owned by a community of
volunteers and later uploaded to Opensky Network servers, which can be
considered an "unlimited source" of data. In addition, this paper reviews
previous work from ATCO2 partners, including (i) robust automatic speech
recognition, (ii) natural language processing, (iii) English language
identification of ATC communications, and (iv) the integration of surveillance
data such as ADS-B. We believe that the pipeline developed during the ATCO2
project, along with the open-sourcing of its data, will encourage research in
the ATC field. A sample of the ATCO2 corpus is available on the following
website: https://www.atco2.org/data, while the full corpus can be purchased
through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. We
demonstrated that ATCO2 is an appropriate dataset to develop ASR engines when
little or near to no ATC in-domain data is available. For instance, with the
CNN-TDNNf kaldi model, we reached the performance of as low as 17.9% and 24.9%
WER on public ATC datasets which is 6.6/7.6% better than "out-of-domain" but
supervised CNN-TDNNf model.Comment: Manuscript under revie
A MACHINE LEARNING FRAMEWORK FOR AUTOMATIC SPEECH RECOGNITION IN AIR TRAFFIC CONTROL USING WORD LEVEL BINARY CLASSIFICATION AND TRANSCRIPTION
Advances in Artificial Intelligence and Machine learning have enabled a variety of new technologies. One such technology is Automatic Speech Recognition (ASR), where a machine is given audio and transcribes the words that were spoken. ASR can be applied in a variety of domains to improve general usability and safety. One such domain is Air Traffic Control (ATC). ASR in ATC promises to improve safety in a mission critical environment. ASR models have historically required a large amount of clean training data. ATC environments are noisy and acquiring labeled data is a difficult, expertise dependent task. This thesis attempts to solve these problems by presenting a machine learning framework which uses word-by-word audio samples to transcribe ATC speech. Instead of transcribing an entire speech sample, this framework transcribes every word individually. Then, overall transcription is pieced together based on the word sequence. Each stage of the framework is trained and tested independently of one another, and the overall performance is gauged. The overall framework was gauged to be a feasible approach to ASR in ATC
Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization
Automatic speech recognition (ASR) has recently become an important challenge
when using deep learning (DL). It requires large-scale training datasets and
high computational and storage resources. Moreover, DL techniques and machine
learning (ML) approaches in general, hypothesize that training and testing data
come from the same domain, with the same input feature space and data
distribution characteristics. This assumption, however, is not applicable in
some real-world artificial intelligence (AI) applications. Moreover, there are
situations where gathering real data is challenging, expensive, or rarely
occurring, which can not meet the data requirements of DL models. deep transfer
learning (DTL) has been introduced to overcome these issues, which helps
develop high-performing models using real datasets that are small or slightly
different but related to the training data. This paper presents a comprehensive
survey of DTL-based ASR frameworks to shed light on the latest developments and
helps academics and professionals understand current challenges. Specifically,
after presenting the DTL background, a well-designed taxonomy is adopted to
inform the state-of-the-art. A critical analysis is then conducted to identify
the limitations and advantages of each framework. Moving on, a comparative
study is introduced to highlight the current challenges before deriving
opportunities for future research
Counter-Terrorism in New Europe
In recent years the nature of terrorism has changed dramatically and has taken on a new combination of characteristics. The fight against this terrorism has become a global concern and central issue of international government policies. Counter-terrorism policies have transformed all around the world, and the importance states place on certain aspects of their counter-terrorist measures vary considerably. There is no agreement on how best to fight terrorism. Within the European Union (EU) this disagreement is the most visible, with some countries supporting the United States in their military fight against terrorism, while other strongly oppose it. This paper will focus on five of the ten new EU members that joined in 2004 (Estonia, Poland, the Czech Republic, Slovenia and Malta) and review some of their existing counter-terrorism measures. In doing so the paper will examine the strengths and weaknesses of each individual stateâs policy and highlight some of the general trends and patterns among them
Automatic speech recognition for European Portuguese
Dissertação de mestrado em Informatics EngineeringThe process of Automatic Speech Recognition (ASR) opens doors to a vast amount of possible
improvements in customer experience. The use of this type of technology has increased
significantly in recent years, this change being the result of the recent evolution in ASR
systems. The opportunities to use ASR are vast, covering several areas, such as medical,
industrial, business, among others. We must emphasize the use of these voice recognition
systems in telecommunications companies, namely, in the automation of consumer assistance
operators, allowing the service to be routed to specialized operators automatically through
the detection of matters to be dealt with through recognition of the spoken utterances. In
recent years, we have seen big technological breakthrough in ASR, achieving unprecedented
accuracy results that are comparable to humans. We are also seeing a move from what
is known as the Traditional approach of ASR systems, based on Hidden Markov Models
(HMM), to the newer End-to-End ASR systems that obtain benefits from the use of deep
neural networks (DNNs), large amounts of data and process parallelization.
The literature review showed us that the focus of this previous work was almost exclusively
for the English and Chinese languages, with little effort being made in the development of
other languages, as it is the case with Portuguese. In the research carried out, we did not
find a model for the European Portuguese (EP) dialect that is freely available for general
use. Focused on this problem, this work describes the development of a End-to-End ASR
system for EP. To achieve this goal, a set of procedures was followed that allowed us to
present the concepts, characteristics and all the steps inherent to the construction of these
types of systems. Furthermore, since the transcribed speech needed to accomplish our goal
is very limited for EP, we also describe the process of collecting and formatting data from a
variety of different sources, most of them freely available to the public. To further try and
improve our results, a variety of different data augmentation techniques were implemented
and tested. The obtained models are based on a PyTorch implementation of the Deep Speech
2 model.
Our best model achieved an Word Error Rate (WER) of 40.5%, in our main test corpus,
achieving slightly better results to those obtained by commercial systems on the same data.
Around 150 hours of transcribed EP was collected, so that it can be used to train other ASR
systems or models in different areas of investigation. We gathered a series of interesting
results on the use of different batch size values as well as the improvements provided by
the use of a large variety of data augmentation techniques. Nevertheless, the ASR theme is vast and there is still a variety of different methods and interesting concepts that we could
research in order to seek an improvement of the achieved results.O processo de Reconhecimento AutomĂĄtico de Fala (ASR) abre portas para uma grande
quantidade de melhorias possĂveis na experiĂȘncia do cliente. A utilização deste tipo de
tecnologia tem aumentado significativamente nos Ășltimos anos, sendo esta alteração o
resultado da evolução recente dos sistemas ASR. As oportunidades de utilização do ASR
são vastas, abrangendo diversas åreas, como médica, industrial, empresarial, entre outras.
Ă
de realçar que a utilização destes sistemas de reconhecimento de voz nas empresas de
telecomunicaçÔes, nomeadamente, na automatização dos operadores de atendimento ao
consumidor, permite o encaminhamento automåtico do serviço para operadores especializados
através da detecção de assuntos a tratar através do reconhecimento de voz. Nos
Ășltimos anos, vimos um grande avanço tecnolĂłgico em ASR, alcançando resultados de
precisĂŁo sem precedentes que sĂŁo comparĂĄveis aos atingidos por humanos. Por outro lado,
vemos também uma mudança do que é conhecido como a abordagem tradicional, baseados
em modelos de Markov ocultos (HMM), para sistemas mais recentes ponta-a-ponta que
reĂșnem benefĂcios do uso de redes neurais profundas, em grandes quantidades de dados e
da paralelização de processos.
A revisĂŁo da literatura efetuada mostra que o foco do trabalho anterior foi quase que
exclusivamente para as lĂnguas inglesa e chinesa, com pouco esforço no desenvolvimento de
outras lĂnguas, como Ă© o caso do portuguĂȘs. Na pesquisa realizada, nĂŁo encontramos um
modelo para o dialeto portuguĂȘs europeu (PE) que se encontre disponĂvel gratuitamente para
uso geral. Focado neste problema, este trabalho descreve o desenvolvimento de um sistema
de ASR ponta-a-ponta para o PE. Para atingir este objetivo, foi seguido um conjunto de
procedimentos que nos permitiram apresentar os conceitos, caracterĂsticas e todas as etapas
inerentes à construção destes tipos de sistemas. Além disso, como a fala transcrita necessåria
para cumprir o nosso objetivo é muito limitada para PE, também descrevemos o processo
de coleta e formatação desses dados em uma variedade de fontes diferentes, a maioria
delas disponĂveis gratuitamente ao pĂșblico. Para tentar melhorar os nossos resultados, uma
variedade de diferentes técnicas de aumento de dados foram implementadas e testadas. Os
modelos obtidos são baseados numa implementação PyTorch do modelo Deep Speech 2.
O nosso melhor modelo obteve uma taxa de erro de palavras (WER) de 40,5% no nosso
corpus de teste principal, obtendo resultados ligeiramente melhores do que aqueles obtidos
por sistemas comerciais sobre os mesmos dados. Foram coletadas cerca de 150 horas de PE
transcritas, que podem ser utilizadas para treinar outros sistemas ou modelos de ASR em
diferentes åreas de investigação. Reunimos uma série de resultados interessantes sobre o uso de diferentes valores de batch size, bem como as melhorias fornecidas pelo uso de uma
grande variedade de técnicas de data augmentation. O tema ASR é vasto e ainda existe uma
grande variedade de métodos diferentes e conceitos interessantes que podemos investigar
para melhorar os resultados alcançados
Smart Home Personal Assistants: A Security and Privacy Review
Smart Home Personal Assistants (SPA) are an emerging innovation that is
changing the way in which home users interact with the technology. However,
there are a number of elements that expose these systems to various risks: i)
the open nature of the voice channel they use, ii) the complexity of their
architecture, iii) the AI features they rely on, and iv) their use of a
wide-range of underlying technologies. This paper presents an in-depth review
of the security and privacy issues in SPA, categorizing the most important
attack vectors and their countermeasures. Based on this, we discuss open
research challenges that can help steer the community to tackle and address
current security and privacy issues in SPA. One of our key findings is that
even though the attack surface of SPA is conspicuously broad and there has been
a significant amount of recent research efforts in this area, research has so
far focused on a small part of the attack surface, particularly on issues
related to the interaction between the user and the SPA devices. We also point
out that further research is needed to tackle issues related to authorization,
speech recognition or profiling, to name a few. To the best of our knowledge,
this is the first article to conduct such a comprehensive review and
characterization of the security and privacy issues and countermeasures of SPA.Comment: Accepted for publication in ACM Computing Survey
- âŠ