691 research outputs found
A MACHINE LEARNING FRAMEWORK FOR AUTOMATIC SPEECH RECOGNITION IN AIR TRAFFIC CONTROL USING WORD LEVEL BINARY CLASSIFICATION AND TRANSCRIPTION
Advances in Artificial Intelligence and Machine learning have enabled a variety of new technologies. One such technology is Automatic Speech Recognition (ASR), where a machine is given audio and transcribes the words that were spoken. ASR can be applied in a variety of domains to improve general usability and safety. One such domain is Air Traffic Control (ATC). ASR in ATC promises to improve safety in a mission critical environment. ASR models have historically required a large amount of clean training data. ATC environments are noisy and acquiring labeled data is a difficult, expertise dependent task. This thesis attempts to solve these problems by presenting a machine learning framework which uses word-by-word audio samples to transcribe ATC speech. Instead of transcribing an entire speech sample, this framework transcribes every word individually. Then, overall transcription is pieced together based on the word sequence. Each stage of the framework is trained and tested independently of one another, and the overall performance is gauged. The overall framework was gauged to be a feasible approach to ASR in ATC
Deep Learning for Distant Speech Recognition
Deep learning is an emerging technology that is considered one of the most
promising directions for reaching higher levels of artificial intelligence.
Among the other achievements, building computers that understand speech
represents a crucial leap towards intelligent machines. Despite the great
efforts of the past decades, however, a natural and robust human-machine speech
interaction still appears to be out of reach, especially when users interact
with a distant microphone in noisy and reverberant environments. The latter
disturbances severely hamper the intelligibility of a speech signal, making
Distant Speech Recognition (DSR) one of the major open challenges in the field.
This thesis addresses the latter scenario and proposes some novel techniques,
architectures, and algorithms to improve the robustness of distant-talking
acoustic models. We first elaborate on methodologies for realistic data
contamination, with a particular emphasis on DNN training with simulated data.
We then investigate on approaches for better exploiting speech contexts,
proposing some original methodologies for both feed-forward and recurrent
neural networks. Lastly, inspired by the idea that cooperation across different
DNNs could be the key for counteracting the harmful effects of noise and
reverberation, we propose a novel deep learning paradigm called network of deep
neural networks. The analysis of the original concepts were based on extensive
experimental validations conducted on both real and simulated data, considering
different corpora, microphone configurations, environments, noisy conditions,
and ASR tasks.Comment: PhD Thesis Unitn, 201
Towards Scalable, Private and Practical Deep Learning
Deep Learning (DL) models have drastically improved the performance of Artificial Intelligence (AI) tasks such as image recognition, word prediction, translation, among many others, on which traditional Machine Learning (ML) models fall short. However, DL models are costly to design, train, and deploy due to their computing and memory demands. Designing DL models usually requires extensive expertise and significant manual tuning efforts. Even with the latest accelerators such as Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU), training DL models can take prohibitively long time, therefore training large DL models in a distributed manner is a norm. Massive amount of data is made available thanks to the prevalence of mobile and internet-of-things (IoT) devices. However, regulations such as HIPAA and GDPR limit the access and transmission of personal data to protect security and privacy. Therefore, enabling DL model training in a decentralized but private fashion is urgent and critical. Deploying trained DL models in a real world environment usually requires meeting Quality of Service (QoS) standards, which makes adaptability of DL models an important yet challenging matter. In this dissertation, we aim to address the above challenges to make a step towards scalable, private, and practical deep learning. To simplify DL model design, we propose Efficient Progressive Neural-Architecture Search (EPNAS) and FedCust to automatically design model architectures and tune hyperparameters, respectively. To provide efficient and robust distributed training while preserving privacy, we design LEASGD, TiFL, and HDFL. We further conduct a study on the security aspect of distributed learning by focusing on how data heterogeneity affects backdoor attacks and how to mitigate such threats. Finally, we use super resolution (SR) as an example application to explore model adaptability for cross platform deployment and dynamic runtime environment. Specifically, we propose DySR and AdaSR frameworks which enable SR models to meet QoS by dynamically adapting to available resources instantly and seamlessly without excessive memory overheads
Deepfake detection and low-resource language speech recognition using deep learning
While deep learning algorithms have made significant progress in automatic speech recognition and natural language processing, they require a significant amount of labelled training data to perform effectively. As such, these applications have not been extended to languages that have only limited amount of data available, such as extinct or endangered languages. Another problem caused by the rise of deep learning is that individuals with malicious intents have been able to leverage these algorithms to create fake contents that can pose serious harm to security and public safety. In this work, we explore the solutions to both of these problems. First, we investigate different data augmentation methods and acoustic architecture designs to improve automatic speech recognition performance on low-resource languages. Data augmentation for audio often involves changing the characteristic of the audio without modifying the ground truth. For example, different background noise can be added to an utterance while maintaining the content of the speech. We also explored how different acoustic model paradigms and complexity affect performance on low-resource languages. These methods are evaluated on Seneca, an endangered language spoken by a Native American tribe, and Iban, a low-resource language spoken in Malaysia and Brunei. Secondly, we explore methods to determine speaker identification and audio spoofing detection. A spoofing attack involves using either a text-to-speech voice conversion application to generate audio that mimic the identity of a target speaker. These methods are evaluated on the ASVSpoof 2019 Logical Access dataset containing audio generated using various methods of voice conversion and text-to-speech synthesis
Recommended from our members
Data-Driven Policy Optimisation for Multi-Domain Task-Oriented Dialogue
Recent developments in machine learning along with a general shift in the public attitude towards digital personal assistants has opened new frontiers for conversational systems. Nevertheless, building data-driven multi-domain conversational agents that act optimally given a dialogue context is an open challenge. The first step towards that goal is developing an efficient way of learning a dialogue policy in new domains. Secondly, it is important to have the ability to collect and utilise human-human conversational data to bootstrap an agent's knowledge. The work presented in this thesis demonstrates how a neural dialogue manager fine-tuned with reinforcement learning presents a viable approach for learning a dialogue policy efficiently and across many domains.
The thesis starts by introducing a dialogue management module that learns through interactions to act optimally given a current context of a conversation. The current shift towards neural, parameter-rich systems does not fully address the problem of error noise coming from speech recognition or natural language understanding components. A Bayesian approach is therefore proposed to learn more robust and effective policy management in direct interactions without any prior data. By putting a distribution over model weights, the learning agent is less prone to overfit to particular dialogue realizations and a more efficient exploration policy can be therefore employed. The results show that deep reinforcement learning performs on par with non-parametric models even in a low data regime while significantly reducing the computational complexity compared with the previous state-of-the-art.
The deployment of a dialogue manager without any pre-training on human conversations is not a viable option from an industry perspective. However, the progress in building statistical systems, particularly dialogue managers, is hindered by the scale of data available. To address this fundamental obstacle, a novel data-collection pipeline entirely based on crowdsourcing without the need for hiring professional annotators is introduced. The validation of the approach results in the collection of the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully labeled collection of human-human written conversations spanning over multiple domains and topics. The proposed dataset creates a set of new benchmarks (belief tracking, policy optimisation, and response generation) significantly raising the complexity of analysed dialogues.
The collected dataset serves as a foundation for a novel reinforcement learning (RL)-based approach for training a multi-domain dialogue manager. A Multi-Action and Slot Dialogue Agent (MASDA) is proposed to combat some limitations: 1) handling complex multi-domain dialogues with multiple concurrent actions present in a single turn; and 2) lack of interpretability, which consequently impedes the use of intermediate signals (e.g., dialogue turn annotations) if such signals are available. MASDA explicitly models system acts and slots using intermediate signals, resulting in an improved task-based end-to-end framework. The model can also select concurrent actions in a single turn, thus enriching the representation of the generated responses. The proposed framework allows for RL training of dialogue task completion metrics when dealing with concurrent actions. The results demonstrate the advantages of both 1) handling concurrent actions and 2) exploiting intermediate signals: MASDA outperforms previous end-to-end frameworks while also offering improved scalability.EPSR
Adversarial inference and manipulation of machine learning models
Machine learning (ML) has established itself as a core component for various critical applications. However, with this increasing adoption rate of ML models, multiple attacks have emerged targeting different stages of the ML pipeline. Abstractly, the ML pipeline is divided into three phases, including training, updating, and inference. In this thesis, we evaluate the privacy, security, and accountability risks of the three stages of the ML pipeline. Firstly, we explore the inference phase, where the adversary can only access the target model after deployment. In this setting, we explore one of the most severe attacks against ML models, namely the membership inference attack (MIA). We relax all the MIA's key assumptions, thereby showing that such attacks are broadly applicable at low cost and thereby pose a more severe risk than previously thought. Secondly, we study the updating phase. To that end, we propose a new attack surface against ML models, i.e., the change in the output of an ML model before and after being updated. We then introduce four attacks, including data reconstruction ones, against this setting. Thirdly, we explore the training phase, where the adversary interferes with the target model's training. In this setting, we propose the model hijacking attack, in which the adversary can hijack the target model to provide their own illegal task. Finally, we propose different defense mechanisms to mitigate such identified risks.Maschinelles Lernen (ML) hat sich als Kernkomponente für verschiedene kritische Anwendungen etabliert. Mit der zunehmenden Verbreitung von ML-Modellen sind jedoch auch zahlreiche Angriffe auf verschiedene Phasen der ML-Pipeline aufgetreten. Abstrakt betrachtet ist die ML-Pipeline in drei Phasen unterteilt, darunter Training, Update und Inferenz. In dieser Arbeit werden die Datenschutz-, Sicherheits- und Verantwortlichkeitsrisiken der drei Phasen der ML-Pipeline bewertet. Zunächst untersuchen wir die Inferenzphase. Insbesondere untersuchen wir einen der schwerwiegendsten Angriffe auf ML-Modelle, nämlich den Membership Inference Attack (MIA). Wir lockern alle Hauptannahmen des MIA und zeigen, dass solche Angriffe mit geringen Kosten breit anwendbar sind und somit ein größeres Risiko darstellen als bisher angenommen. Zweitens untersuchen wir die Updatephase. Zu diesem Zweck führen wir eine neue Angriffsmethode gegen ML-Modelle ein, nämlich die Änderung der Ausgabe eines ML-Modells vor und nach dem Update. Anschließend stellen wir vier Angriffe vor, einschließlich auch Angriffe zur Datenrekonstruktion, die sich gegen dieses Szenario richten. Drittens untersuchen wir die Trainingsphase. In diesem Zusammenhang schlagen wir den Angriff “Model Hijacking” vor, bei dem der Angreifer das Zielmodell für seine eigenen illegalen Zwecke übernehmen kann. Schließlich schlagen wir verschiedene Verteidigungsmechanismen vor, um solche Risiken zu entschärfen
- …