11 research outputs found

    Machine Learning Mitigants for Speech Based Cyber Risk

    Get PDF
    Statistical analysis of speech is an emerging area of machine learning. In this paper, we tackle the biometric challenge of Automatic Speaker Verification (ASV) of differentiating between samples generated by two distinct populations of utterances, those of an authentic human voice and those generated by a synthetic one. Solving such an issue through a statistical perspective foresees the definition of a decision rule function and a learning procedure to identify the optimal classifier. Classical state-of-the-art countermeasures rely on strong assumptions such as stationarity or local-stationarity of speech that may be atypical to encounter in practice. We explore in this regard a robust non-linear and non-stationary signal decomposition method known as the Empirical Mode Decomposition combined with the Mel-Frequency Cepstral Coefficients in a novel fashion with a refined classifier technique known as multi-kernel Support Vector machine. We undertake significant real data case studies covering multiple ASV systems using different datasets, including the ASVSpoof 2019 challenge database. The obtained results overwhelmingly demonstrate the significance of our feature extraction and classifier approach versus existing conventional methods in reducing the threat of cyber-attack perpetrated by synthetic voice replication seeking unauthorised access

    A Statistical Perspective of the Empirical Mode Decomposition

    Get PDF
    This research focuses on non-stationary basis decompositions methods in time-frequency analysis. Classical methodologies in this field such as Fourier Analysis and Wavelet Transforms rely on strong assumptions of the underlying moment generating process, which, may not be valid in real data scenarios or modern applications of machine learning. The literature on non-stationary methods is still in its infancy, and the research contained in this thesis aims to address challenges arising in this area. Among several alternatives, this work is based on the method known as the Empirical Mode Decomposition (EMD). The EMD is a non-parametric time-series decomposition technique that produces a set of time-series functions denoted as Intrinsic Mode Functions (IMFs), which carry specific statistical properties. The main focus is providing a general and flexible family of basis extraction methods with minimal requirements compared to those within the Fourier or Wavelet techniques. This is highly important for two main reasons: first, more universal applications can be taken into account; secondly, the EMD has very little a priori knowledge of the process required to apply it, and as such, it can have greater generalisation properties in statistical applications across a wide array of applications and data types. The contributions of this work deal with several aspects of the decomposition. The first set regards the construction of an IMF from several perspectives: (1) achieving a semi-parametric representation of each basis; (2) extracting such semi-parametric functional forms in a computationally efficient and statistically robust framework. The EMD belongs to the class of path-based decompositions and, therefore, they are often not treated as a stochastic representation. (3) A major contribution involves the embedding of the deterministic pathwise decomposition framework into a formal stochastic process setting. One of the assumptions proper of the EMD construction is the requirement for a continuous function to apply the decomposition. In general, this may not be the case within many applications. (4) Various multi-kernel Gaussian Process formulations of the EMD will be proposed through the introduced stochastic embedding. Particularly, two different models will be proposed: one modelling the temporal mode of oscillations of the EMD and the other one capturing instantaneous frequencies location in specific frequency regions or bandwidths. (5) The construction of the second stochastic embedding will be achieved with an optimisation method called the cross-entropy method. Two formulations will be provided and explored in this regard. Application on speech time-series are explored to study such methodological extensions given that they are non-stationary

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Energy-Efficient Recurrent Neural Network Accelerators for Real-Time Inference

    Full text link
    Over the past decade, Deep Learning (DL) and Deep Neural Network (DNN) have gone through a rapid development. They are now vastly applied to various applications and have profoundly changed the life of hu- man beings. As an essential element of DNN, Recurrent Neural Networks (RNN) are helpful in processing time-sequential data and are widely used in applications such as speech recognition and machine translation. RNNs are difficult to compute because of their massive arithmetic operations and large memory footprint. RNN inference workloads used to be executed on conventional general-purpose processors including Central Processing Units (CPU) and Graphics Processing Units (GPU); however, they have un- necessary hardware blocks for RNN computation such as branch predictor, caching system, making them not optimal for RNN processing. To accelerate RNN computations and outperform the performance of conventional processors, previous work focused on optimization methods on both software and hardware. On the software side, previous works mainly used model compression to reduce the memory footprint and the arithmetic operations of RNNs. On the hardware side, previous works also designed domain-specific hardware accelerators based on Field Pro- grammable Gate Arrays (FPGA) or Application Specific Integrated Circuits (ASIC) with customized hardware pipelines optimized for efficient pro- cessing of RNNs. By following this software-hardware co-design strategy, previous works achieved at least 10X speedup over conventional processors. Many previous works focused on achieving high throughput with a large batch of input streams. However, in real-time applications, such as gaming Artificial Intellegence (AI), dynamical system control, low latency is more critical. Moreover, there is a trend of offloading neural network workloads to edge devices to provide a better user experience and privacy protection. Edge devices, such as mobile phones and wearable devices, are usually resource-constrained with a tight power budget. They require RNN hard- ware that is more energy-efficient to realize both low-latency inference and long battery life. Brain neurons have sparsity in both the spatial domain and time domain. Inspired by this human nature, previous work mainly explored model compression to induce spatial sparsity in RNNs. The delta network algorithm alternatively induces temporal sparsity in RNNs and can save over 10X arithmetic operations in RNNs proven by previous works. In this work, we have proposed customized hardware accelerators to exploit temporal sparsity in Gated Recurrent Unit (GRU)-RNNs and Long Short-Term Memory (LSTM)-RNNs to achieve energy-efficient real-time RNN inference. First, we have proposed DeltaRNN, the first-ever RNN accelerator to exploit temporal sparsity in GRU-RNNs. DeltaRNN has achieved 1.2 TOp/s effective throughput with a batch size of 1, which is 15X higher than its related works. Second, we have designed EdgeDRNN to accelerate GRU-RNN edge inference. Compared to DeltaRNN, EdgeDRNN does not rely on on-chip memory to store RNN weights and focuses on reducing off-chip Dynamic Random Access Memory (DRAM) data traffic using a more scalable architecture. EdgeDRNN have realized real-time inference of large GRU-RNNs with submillisecond latency and only 2.3 W wall plug power consumption, achieving 4X higher energy efficiency than commercial edge AI platforms like NVIDIA Jetson Nano. Third, we have used DeltaRNN to realize the first-ever continuous speech recognition sys- tem with the Dynamic Audio Sensor (DAS) as the front-end. The DAS is a neuromorphic event-driven sensor that produces a stream of asyn- chronous events instead of audio data sampled at a fixed sample rate. We have also showcased how an RNN accelerator can be integrated with an event-driven sensor on the same chip to realize ultra-low-power Keyword Spotting (KWS) on the extreme edge. Fourth, we have used EdgeDRNN to control a powered robotic prosthesis using an RNN controller to replace a conventional proportional–derivative (PD) controller. EdgeDRNN has achieved 21 μs latency of running the RNN controller and could maintain stable control of the prosthesis. We have used DeltaRNN and EdgeDRNN to solve these problems to prove their value in solving real-world problems. Finally, we have applied the delta network algorithm on LSTM-RNNs and have combined it with a customized structured pruning method, called Column-Balanced Targeted Dropout (CBTD), to induce spatio-temporal sparsity in LSTM-RNNs. Then, we have proposed another FPGA-based accelerator called Spartus, the first RNN accelerator that exploits spatio- temporal sparsity. Spartus achieved 9.4 TOp/s effective throughput with a batch size of 1, the highest among present FPGA-based RNN accelerators with a power budget around 10 W. Spartus can complete the inference of an LSTM layer having 5 million parameters within 1 μs

    Multi-modal Video Content Understanding

    Get PDF
    Video is an important format of information. Humans use videos for a variety of purposes such as entertainment, education, communication, information sharing, and capturing memories. To this date, humankind accumulated a colossal amount of video material online which is freely available. Manual processing at this scale is simply impossible. To this end, many research efforts have been dedicated to the automatic processing of video content. At the same time, human perception of the world is multi-modal. A human uses multiple senses to understand the environment and objects, and their interactions. When watching a video, we perceive the content via both audio and visual modalities, and removing one of these modalities results in less immersive experience. Similarly, if information in both modalities does not correspond, it may create a sense of dissonance. Therefore, joint modelling of multiple modalities (such as audio, visual, and text) within one model is an active research area. In the last decade, the fields of automatic video understanding and multi-modal modelling have seen exceptional progress due to the ubiquitous success of deep learning models and, more recently, transformer-based architectures in particular. Our work draws on these advances and pushes the state-of-the-art of multi-modal video understanding forward. Applications of automatic multi-modal video processing are broad and exciting! For instance, the content-based textual description of a video (video captioning) may allow a visually- or auditory-impaired person to understand the content and, thus, engage in brighter social interactions. However, prior work in video content description relies on the visual input alone, missing vital information only available in the audio stream. To this end, we proposed two novel multi-modal transformer models that encode audio and visual interactions simultaneously. More specifically, first, we introduced a late-fusion multi-modal transformer that is highly modular and allows the processing of an arbitrary set of modalities. Second, an efficient bi-modal transformer was presented to encode audio-visual cues starting from the lower network layers allowing more rich audio-visual features and stronger performance as a result. Another application is the automatic visually-guided sound generation that might help professional sound (foley) designers who spend hours searching a database for relevant audio for a movie scene. Previous approaches for automatic conditional audio generation support only one class (e. g. “dog barking”), while real-life applications may require generation for hundreds of data classes and one would need to train one model for every data class which can be infeasible. To bridge this gap, we introduced a novel two-stage model that, first, efficiently encodes audio as a set of codebook vectors (i. e. trains to make “building blocks”) and, then, learns to sample these audio vectors given visual inputs to make a relevant audio track for this visual input. Moreover, we studied the automatic evaluation of the conditional audio generation model and proposed metrics that measure both quality and relevance of the generated samples. Finally, as video editing is becoming more common among non-professionals due to the increased popularity of such services as YouTube, automatic assistance during video editing grows in demand, e. g. off-sync detection between audio and visual tracks. Prior work in audio-visual synchronization was devoted to solving the task on lip-syncing datasets with “dense” signals, such as interviews and presentations. In such videos, synchronization cues occur “densely” across time, and it is enough to process just a few tens of a second to synchronize the tracks. In contrast, opendomain videos mostly have only “sparse” cues that occur just once in a seconds-long video clip (e. g. “chopping wood”). To address this, we: a) proposed a novel dataset with “sparse” sounds; b) designed a model which can efficiently encode seconds-long audio-visual tracks in a small set of “learnable selectors” that is, then, used for synchronization. In addition, we explored the temporal artefacts that common audio and video compression algorithms leave in data streams. To prevent a model from learning to rely on these artefacts, we introduced a list of recommendations on how to mitigate them. This thesis provides the details of the proposed methodologies as well as a comprehensive overview of advances in relevant fields of multi-modal video understanding. In addition, we provide a discussion of potential research directions that can bring significant contributions to the field

    Deep Scattering and End-to-End Speech Models towards Low Resource Speech Recognition

    Get PDF
    Automatic Speech Recognition (ASR) has made major leaps in its advancement largely due to two different machine learning models: Hidden Markov Models (HMMs) and Deep Neural Networks (DNNs). State-of-the art results have been achieved by combining these two disparate methods to form a hybrid system. This also requires that various components of the speech recognizer be trained independently based on a probabilistic noisy channel model. Although this HMM-DNN hybrid ASR method has been successful in recent studies, the independent development of the individual components used in hybrid HMM-DNN models makes ASR development fragile and expensive in terms of time-to-develop the various components and their associated sub-systems. The resulting trade-off is that ASR systems are difficult to develop and use especially for new applications and languages. The alternative approach, known as the end-to-end paradigm, makes use of a single deep neural-network architecture used to encapsulate as many as possible subcomponents of speech recognition as a single process. In the so-called end-to-end paradigm, latent variables of sub-components are subsumed by the neural network sub-architectures and the associated parameters. The end-to-end paradigm gains of a simplified ASR-development process again are traded for higher internal model complexity and computational resources needed to train the end-to-end models. This research focuses on taking advantage of the end-to-end model ASR development gains for new and low-resource languages. Using a specialised light weight convolution-like neural network called the deep scattering network (DSN) to replace the input layer of the end-to-end model, our objective was to measure the performance of the end-to-end model using these augmented speech features while checking to see if the light-weight, wavelet-based architecture brought about any improvements for low resource Speech recognition in particular. The results showed that it is possible to use this compact strategy for speech pattern recognition by deploying deep scattering network features with higher dimensional vectors when compared to traditional speech features. With Word Error Rates of 26.8% and 76.7% for SVCSR and LVCSR respective tasks, the ASR system metrics fell few WER points short of their respective baselines. In addition, training times tended to be longer when compared to their respective baselines and therefore had no significant improvement for low resource speech recognition training

    Integration of MKL-Based and I-Vector-Based Speaker Verification by Short Utterances

    No full text
    corecore