1,706 research outputs found

    Review of Automatic Speech Recognition Methodologies

    Get PDF
    DTFACT-14-D-00004,692M152240001This report highlights the crucial role of Automatic Speech Recognition (ASR) techniques in enhancing safety for air traffic control (ATC) in terminal environments. ASR techniques facilitate efficient and accurate transcription of verbal communications, reducing the likelihood of errors. The report also details the evolution of ASR technologies, converging to machine learning approaches from Hidden Markov Models (HMMs), Deep Neural Networks (DNNs) to End-to-End models. Finally, the report details the latest advancements in ASR techniques, focusing on transformer-based models that have outperformed traditional ASR approaches and achieved state-of-the-art results on ASR benchmarks

    How Does Pre-trained Wav2Vec2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications

    Full text link
    Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data substantially differs between the pre-training and downstream fine-tuning phases (i.e., domain shift). We target this scenario by analyzing the robustness of Wav2Vec2.0 and XLS-R models on downstream ASR for a completely unseen domain, i.e., air traffic control (ATC) communications. We benchmark the proposed models on four challenging ATC test sets (signal-to-noise ratio varies between 5 to 20 dB). Relative word error rate (WER) reduction between 20% to 40% are obtained in comparison to hybrid-based state-of-the-art ASR baselines by fine-tuning E2E acoustic models with a small fraction of labeled data. We also study the impact of fine-tuning data size on WERs, going from 5 minutes (few-shot) to 15 hours.Comment: This paper has been submitted to Interspeech 202

    Attention-Inspired Artificial Neural Networks for Speech Processing: A Systematic Review

    Get PDF
    Artificial Neural Networks (ANNs) were created inspired by the neural networks in the human brain and have been widely applied in speech processing. The application areas of ANN include: Speech recognition, speech emotion recognition, language identification, speech enhancement, and speech separation, amongst others. Likewise, given that speech processing performed by humans involves complex cognitive processes known as auditory attention, there has been a growing amount of papers proposing ANNs supported by deep learning algorithms in conjunction with some mechanism to achieve symmetry with the human attention process. However, while these ANN approaches include attention, there is no categorization of attention integrated into the deep learning algorithms and their relation with human auditory attention. Therefore, we consider it necessary to have a review of the different ANN approaches inspired in attention to show both academic and industry experts the available models for a wide variety of applications. Based on the PRISMA methodology, we present a systematic review of the literature published since 2000, in which deep learning algorithms are applied to diverse problems related to speech processing. In this paper 133 research works are selected and the following aspects are described: (i) Most relevant features, (ii) ways in which attention has been implemented, (iii) their hypothetical relationship with human attention, and (iv) the evaluation metrics used. Additionally, the four publications most related with human attention were analyzed and their strengths and weaknesses were determined

    Lessons Learned in ATCO2: 5000 hours of Air Traffic Control Communications for Robust Automatic Speech Recognition and Understanding

    Full text link
    Voice communication between air traffic controllers (ATCos) and pilots is critical for ensuring safe and efficient air traffic control (ATC). This task requires high levels of awareness from ATCos and can be tedious and error-prone. Recent attempts have been made to integrate artificial intelligence (AI) into ATC in order to reduce the workload of ATCos. However, the development of data-driven AI systems for ATC demands large-scale annotated datasets, which are currently lacking in the field. This paper explores the lessons learned from the ATCO2 project, a project that aimed to develop a unique platform to collect and preprocess large amounts of ATC data from airspace in real time. Audio and surveillance data were collected from publicly accessible radio frequency channels with VHF receivers owned by a community of volunteers and later uploaded to Opensky Network servers, which can be considered an "unlimited source" of data. In addition, this paper reviews previous work from ATCO2 partners, including (i) robust automatic speech recognition, (ii) natural language processing, (iii) English language identification of ATC communications, and (iv) the integration of surveillance data such as ADS-B. We believe that the pipeline developed during the ATCO2 project, along with the open-sourcing of its data, will encourage research in the ATC field. A sample of the ATCO2 corpus is available on the following website: https://www.atco2.org/data, while the full corpus can be purchased through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. We demonstrated that ATCO2 is an appropriate dataset to develop ASR engines when little or near to no ATC in-domain data is available. For instance, with the CNN-TDNNf kaldi model, we reached the performance of as low as 17.9% and 24.9% WER on public ATC datasets which is 6.6/7.6% better than "out-of-domain" but supervised CNN-TDNNf model.Comment: Manuscript under revie

    A MACHINE LEARNING FRAMEWORK FOR AUTOMATIC SPEECH RECOGNITION IN AIR TRAFFIC CONTROL USING WORD LEVEL BINARY CLASSIFICATION AND TRANSCRIPTION

    Get PDF
    Advances in Artificial Intelligence and Machine learning have enabled a variety of new technologies. One such technology is Automatic Speech Recognition (ASR), where a machine is given audio and transcribes the words that were spoken. ASR can be applied in a variety of domains to improve general usability and safety. One such domain is Air Traffic Control (ATC). ASR in ATC promises to improve safety in a mission critical environment. ASR models have historically required a large amount of clean training data. ATC environments are noisy and acquiring labeled data is a difficult, expertise dependent task. This thesis attempts to solve these problems by presenting a machine learning framework which uses word-by-word audio samples to transcribe ATC speech. Instead of transcribing an entire speech sample, this framework transcribes every word individually. Then, overall transcription is pieced together based on the word sequence. Each stage of the framework is trained and tested independently of one another, and the overall performance is gauged. The overall framework was gauged to be a feasible approach to ASR in ATC

    Deep Transfer Learning for Automatic Speech Recognition: Towards Better Generalization

    Full text link
    Automatic speech recognition (ASR) has recently become an important challenge when using deep learning (DL). It requires large-scale training datasets and high computational and storage resources. Moreover, DL techniques and machine learning (ML) approaches in general, hypothesize that training and testing data come from the same domain, with the same input feature space and data distribution characteristics. This assumption, however, is not applicable in some real-world artificial intelligence (AI) applications. Moreover, there are situations where gathering real data is challenging, expensive, or rarely occurring, which can not meet the data requirements of DL models. deep transfer learning (DTL) has been introduced to overcome these issues, which helps develop high-performing models using real datasets that are small or slightly different but related to the training data. This paper presents a comprehensive survey of DTL-based ASR frameworks to shed light on the latest developments and helps academics and professionals understand current challenges. Specifically, after presenting the DTL background, a well-designed taxonomy is adopted to inform the state-of-the-art. A critical analysis is then conducted to identify the limitations and advantages of each framework. Moving on, a comparative study is introduced to highlight the current challenges before deriving opportunities for future research

    Counter-Terrorism in New Europe

    Get PDF
    In recent years the nature of terrorism has changed dramatically and has taken on a new combination of characteristics. The fight against this terrorism has become a global concern and central issue of international government policies. Counter-terrorism policies have transformed all around the world, and the importance states place on certain aspects of their counter-terrorist measures vary considerably. There is no agreement on how best to fight terrorism. Within the European Union (EU) this disagreement is the most visible, with some countries supporting the United States in their military fight against terrorism, while other strongly oppose it. This paper will focus on five of the ten new EU members that joined in 2004 (Estonia, Poland, the Czech Republic, Slovenia and Malta) and review some of their existing counter-terrorism measures. In doing so the paper will examine the strengths and weaknesses of each individual state’s policy and highlight some of the general trends and patterns among them

    Automatic speech recognition for European Portuguese

    Get PDF
    Dissertação de mestrado em Informatics EngineeringThe process of Automatic Speech Recognition (ASR) opens doors to a vast amount of possible improvements in customer experience. The use of this type of technology has increased significantly in recent years, this change being the result of the recent evolution in ASR systems. The opportunities to use ASR are vast, covering several areas, such as medical, industrial, business, among others. We must emphasize the use of these voice recognition systems in telecommunications companies, namely, in the automation of consumer assistance operators, allowing the service to be routed to specialized operators automatically through the detection of matters to be dealt with through recognition of the spoken utterances. In recent years, we have seen big technological breakthrough in ASR, achieving unprecedented accuracy results that are comparable to humans. We are also seeing a move from what is known as the Traditional approach of ASR systems, based on Hidden Markov Models (HMM), to the newer End-to-End ASR systems that obtain benefits from the use of deep neural networks (DNNs), large amounts of data and process parallelization. The literature review showed us that the focus of this previous work was almost exclusively for the English and Chinese languages, with little effort being made in the development of other languages, as it is the case with Portuguese. In the research carried out, we did not find a model for the European Portuguese (EP) dialect that is freely available for general use. Focused on this problem, this work describes the development of a End-to-End ASR system for EP. To achieve this goal, a set of procedures was followed that allowed us to present the concepts, characteristics and all the steps inherent to the construction of these types of systems. Furthermore, since the transcribed speech needed to accomplish our goal is very limited for EP, we also describe the process of collecting and formatting data from a variety of different sources, most of them freely available to the public. To further try and improve our results, a variety of different data augmentation techniques were implemented and tested. The obtained models are based on a PyTorch implementation of the Deep Speech 2 model. Our best model achieved an Word Error Rate (WER) of 40.5%, in our main test corpus, achieving slightly better results to those obtained by commercial systems on the same data. Around 150 hours of transcribed EP was collected, so that it can be used to train other ASR systems or models in different areas of investigation. We gathered a series of interesting results on the use of different batch size values as well as the improvements provided by the use of a large variety of data augmentation techniques. Nevertheless, the ASR theme is vast and there is still a variety of different methods and interesting concepts that we could research in order to seek an improvement of the achieved results.O processo de Reconhecimento AutomĂĄtico de Fala (ASR) abre portas para uma grande quantidade de melhorias possĂ­veis na experiĂȘncia do cliente. A utilização deste tipo de tecnologia tem aumentado significativamente nos Ășltimos anos, sendo esta alteração o resultado da evolução recente dos sistemas ASR. As oportunidades de utilização do ASR sĂŁo vastas, abrangendo diversas ĂĄreas, como mĂ©dica, industrial, empresarial, entre outras. É de realçar que a utilização destes sistemas de reconhecimento de voz nas empresas de telecomunicaçÔes, nomeadamente, na automatização dos operadores de atendimento ao consumidor, permite o encaminhamento automĂĄtico do serviço para operadores especializados atravĂ©s da detecção de assuntos a tratar atravĂ©s do reconhecimento de voz. Nos Ășltimos anos, vimos um grande avanço tecnolĂłgico em ASR, alcançando resultados de precisĂŁo sem precedentes que sĂŁo comparĂĄveis aos atingidos por humanos. Por outro lado, vemos tambĂ©m uma mudança do que Ă© conhecido como a abordagem tradicional, baseados em modelos de Markov ocultos (HMM), para sistemas mais recentes ponta-a-ponta que reĂșnem benefĂ­cios do uso de redes neurais profundas, em grandes quantidades de dados e da paralelização de processos. A revisĂŁo da literatura efetuada mostra que o foco do trabalho anterior foi quase que exclusivamente para as lĂ­nguas inglesa e chinesa, com pouco esforço no desenvolvimento de outras lĂ­nguas, como Ă© o caso do portuguĂȘs. Na pesquisa realizada, nĂŁo encontramos um modelo para o dialeto portuguĂȘs europeu (PE) que se encontre disponĂ­vel gratuitamente para uso geral. Focado neste problema, este trabalho descreve o desenvolvimento de um sistema de ASR ponta-a-ponta para o PE. Para atingir este objetivo, foi seguido um conjunto de procedimentos que nos permitiram apresentar os conceitos, caracterĂ­sticas e todas as etapas inerentes Ă  construção destes tipos de sistemas. AlĂ©m disso, como a fala transcrita necessĂĄria para cumprir o nosso objetivo Ă© muito limitada para PE, tambĂ©m descrevemos o processo de coleta e formatação desses dados em uma variedade de fontes diferentes, a maioria delas disponĂ­veis gratuitamente ao pĂșblico. Para tentar melhorar os nossos resultados, uma variedade de diferentes tĂ©cnicas de aumento de dados foram implementadas e testadas. Os modelos obtidos sĂŁo baseados numa implementação PyTorch do modelo Deep Speech 2. O nosso melhor modelo obteve uma taxa de erro de palavras (WER) de 40,5% no nosso corpus de teste principal, obtendo resultados ligeiramente melhores do que aqueles obtidos por sistemas comerciais sobre os mesmos dados. Foram coletadas cerca de 150 horas de PE transcritas, que podem ser utilizadas para treinar outros sistemas ou modelos de ASR em diferentes ĂĄreas de investigação. Reunimos uma sĂ©rie de resultados interessantes sobre o uso de diferentes valores de batch size, bem como as melhorias fornecidas pelo uso de uma grande variedade de tĂ©cnicas de data augmentation. O tema ASR Ă© vasto e ainda existe uma grande variedade de mĂ©todos diferentes e conceitos interessantes que podemos investigar para melhorar os resultados alcançados

    Smart Home Personal Assistants: A Security and Privacy Review

    Get PDF
    Smart Home Personal Assistants (SPA) are an emerging innovation that is changing the way in which home users interact with the technology. However, there are a number of elements that expose these systems to various risks: i) the open nature of the voice channel they use, ii) the complexity of their architecture, iii) the AI features they rely on, and iv) their use of a wide-range of underlying technologies. This paper presents an in-depth review of the security and privacy issues in SPA, categorizing the most important attack vectors and their countermeasures. Based on this, we discuss open research challenges that can help steer the community to tackle and address current security and privacy issues in SPA. One of our key findings is that even though the attack surface of SPA is conspicuously broad and there has been a significant amount of recent research efforts in this area, research has so far focused on a small part of the attack surface, particularly on issues related to the interaction between the user and the SPA devices. We also point out that further research is needed to tackle issues related to authorization, speech recognition or profiling, to name a few. To the best of our knowledge, this is the first article to conduct such a comprehensive review and characterization of the security and privacy issues and countermeasures of SPA.Comment: Accepted for publication in ACM Computing Survey
    • 

    corecore