151 research outputs found

    An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

    Get PDF
    Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance

    Proceedings of the 8th Workshop on Detection and Classification of Acoustic Scenes and Events (DCASE 2023)

    Get PDF
    This volume gathers the papers presented at the Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), Tampere, Finland, during 21–22 September 2023

    Energy-Efficient Recurrent Neural Network Accelerators for Real-Time Inference

    Full text link
    Over the past decade, Deep Learning (DL) and Deep Neural Network (DNN) have gone through a rapid development. They are now vastly applied to various applications and have profoundly changed the life of hu- man beings. As an essential element of DNN, Recurrent Neural Networks (RNN) are helpful in processing time-sequential data and are widely used in applications such as speech recognition and machine translation. RNNs are difficult to compute because of their massive arithmetic operations and large memory footprint. RNN inference workloads used to be executed on conventional general-purpose processors including Central Processing Units (CPU) and Graphics Processing Units (GPU); however, they have un- necessary hardware blocks for RNN computation such as branch predictor, caching system, making them not optimal for RNN processing. To accelerate RNN computations and outperform the performance of conventional processors, previous work focused on optimization methods on both software and hardware. On the software side, previous works mainly used model compression to reduce the memory footprint and the arithmetic operations of RNNs. On the hardware side, previous works also designed domain-specific hardware accelerators based on Field Pro- grammable Gate Arrays (FPGA) or Application Specific Integrated Circuits (ASIC) with customized hardware pipelines optimized for efficient pro- cessing of RNNs. By following this software-hardware co-design strategy, previous works achieved at least 10X speedup over conventional processors. Many previous works focused on achieving high throughput with a large batch of input streams. However, in real-time applications, such as gaming Artificial Intellegence (AI), dynamical system control, low latency is more critical. Moreover, there is a trend of offloading neural network workloads to edge devices to provide a better user experience and privacy protection. Edge devices, such as mobile phones and wearable devices, are usually resource-constrained with a tight power budget. They require RNN hard- ware that is more energy-efficient to realize both low-latency inference and long battery life. Brain neurons have sparsity in both the spatial domain and time domain. Inspired by this human nature, previous work mainly explored model compression to induce spatial sparsity in RNNs. The delta network algorithm alternatively induces temporal sparsity in RNNs and can save over 10X arithmetic operations in RNNs proven by previous works. In this work, we have proposed customized hardware accelerators to exploit temporal sparsity in Gated Recurrent Unit (GRU)-RNNs and Long Short-Term Memory (LSTM)-RNNs to achieve energy-efficient real-time RNN inference. First, we have proposed DeltaRNN, the first-ever RNN accelerator to exploit temporal sparsity in GRU-RNNs. DeltaRNN has achieved 1.2 TOp/s effective throughput with a batch size of 1, which is 15X higher than its related works. Second, we have designed EdgeDRNN to accelerate GRU-RNN edge inference. Compared to DeltaRNN, EdgeDRNN does not rely on on-chip memory to store RNN weights and focuses on reducing off-chip Dynamic Random Access Memory (DRAM) data traffic using a more scalable architecture. EdgeDRNN have realized real-time inference of large GRU-RNNs with submillisecond latency and only 2.3 W wall plug power consumption, achieving 4X higher energy efficiency than commercial edge AI platforms like NVIDIA Jetson Nano. Third, we have used DeltaRNN to realize the first-ever continuous speech recognition sys- tem with the Dynamic Audio Sensor (DAS) as the front-end. The DAS is a neuromorphic event-driven sensor that produces a stream of asyn- chronous events instead of audio data sampled at a fixed sample rate. We have also showcased how an RNN accelerator can be integrated with an event-driven sensor on the same chip to realize ultra-low-power Keyword Spotting (KWS) on the extreme edge. Fourth, we have used EdgeDRNN to control a powered robotic prosthesis using an RNN controller to replace a conventional proportional–derivative (PD) controller. EdgeDRNN has achieved 21 μs latency of running the RNN controller and could maintain stable control of the prosthesis. We have used DeltaRNN and EdgeDRNN to solve these problems to prove their value in solving real-world problems. Finally, we have applied the delta network algorithm on LSTM-RNNs and have combined it with a customized structured pruning method, called Column-Balanced Targeted Dropout (CBTD), to induce spatio-temporal sparsity in LSTM-RNNs. Then, we have proposed another FPGA-based accelerator called Spartus, the first RNN accelerator that exploits spatio- temporal sparsity. Spartus achieved 9.4 TOp/s effective throughput with a batch size of 1, the highest among present FPGA-based RNN accelerators with a power budget around 10 W. Spartus can complete the inference of an LSTM layer having 5 million parameters within 1 μs

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    The assessment and development of methods in (spatial) sound ecology

    Get PDF
    As vital ecosystems across the globe enter unchartered pressure from climate change industrial land use, understanding the processes driving ecosystem viability has never been more critical. Nuanced ecosystem understanding comes from well-collected field data and a wealth of associated interpretations. In recent years the most popular methods of ecosystem monitoring have revolutionised from often damaging and labour-intensive manual data collection to automated methods of data collection and analysis. Sound ecology describes the school of research that uses information transmitted through sound to infer properties about an area's species, biodiversity, and health. In this thesis, we explore and develop state-of-the-art automated monitoring with sound, specifically relating to data storage practice and spatial acoustic recording and data analysis. In the first chapter, we explore the necessity and methods of ecosystem monitoring, focusing on acoustic monitoring, later exploring how and why sound is recorded and the current state-of-the-art in acoustic monitoring. Chapter one concludes with us setting out the aims and overall content of the following chapters. We begin the second chapter by exploring methods used to mitigate data storage expense, a widespread issue as automated methods quickly amass vast amounts of data which can be expensive and impractical to manage. Importantly I explain how these data management practices are often used without known consequence, something I then address. Specifically, I present evidence that the most used data reduction methods (namely compression and temporal subsetting) have a surprisingly small impact on the information content of recorded sound compared to the method of analysis. This work also adds to the increasing evidence that deep learning-based methods of environmental sound quantification are more powerful and robust to experimental variation than more traditional acoustic indices. In the latter chapters, I focus on using multichannel acoustic recording for sound-source localisation. Knowing where a sound originated has a range of ecological uses, including counting individuals, locating threats, and monitoring habitat use. While an exciting application of acoustic technology, spatial acoustics has had minimal uptake owing to the expense, impracticality and inaccessibility of equipment. In my third chapter, I introduce MAARU (Multichannel Acoustic Autonomous Recording Unit), a low-cost, easy-to-use and accessible solution to this problem. I explain the software and hardware necessary for spatial recording and show how MAARU can be used to localise the direction of a sound to within ±10˚ accurately. In the fourth chapter, I explore how MAARU devices deployed in the field can be used for enhanced ecosystem monitoring by spatially clustering individuals by calling directions for more accurate abundance approximations and crude species-specific habitat usage monitoring. Most literature on spatial acoustics cites the need for many accurately synced recording devices over an area. This chapter provides the first evidence of advances made with just one recorder. Finally, I conclude this thesis by restating my aims and discussing my success in achieving them. Specifically, in the thesis’ conclusion, I reiterate the contributions made to the field as a direct result of this work and outline some possible development avenues.Open Acces

    Egocentric vision-based passive dietary intake monitoring

    Get PDF
    Egocentric (first-person) perception captures and reveals how people perceive their surroundings. This unique perceptual view enables passive and objective monitoring of human-centric activities and behaviours. In capturing egocentric visual data, wearable cameras are used. Recent advances in wearable technologies have enabled wearable cameras to be lightweight, accurate, and with long battery life, making long-term passive monitoring a promising solution for healthcare and human behaviour understanding. In addition, recent progress in deep learning has provided an opportunity to accelerate the development of passive methods to enable pervasive and accurate monitoring, as well as comprehensive modelling of human-centric behaviours. This thesis investigates and proposes innovative egocentric technologies for passive dietary intake monitoring and human behaviour analysis. Compared to conventional dietary assessment methods in nutritional epidemiology, such as 24-hour dietary recall (24HR) and food frequency questionnaires (FFQs), which heavily rely on subjects’ memory to recall the dietary intake, and trained dietitians to collect, interpret, and analyse the dietary data, passive dietary intake monitoring can ease such burden and provide more accurate and objective assessment of dietary intake. Egocentric vision-based passive monitoring uses wearable cameras to continuously record human-centric activities with a close-up view. This passive way of monitoring does not require active participation from the subject, and records rich spatiotemporal details for fine-grained analysis. Based on egocentric vision and passive dietary intake monitoring, this thesis proposes: 1) a novel network structure called PAR-Net to achieve accurate food recognition by mining discriminative food regions. PAR-Net has been evaluated with food intake images captured by wearable cameras as well as those non-egocentric food images to validate its effectiveness for food recognition; 2) a deep learning-based solution for recognising consumed food items as well as counting the number of bites taken by the subjects from egocentric videos in an end-to-end manner; 3) in light of privacy concerns in egocentric data, this thesis also proposes a privacy-preserved solution for passive dietary intake monitoring, which uses image captioning techniques to summarise the image content and subsequently combines image captioning with 3D container reconstruction to report the actual food volume consumed. Furthermore, a novel framework that integrates food recognition, hand tracking and face recognition has also been developed to tackle the challenge of assessing individual dietary intake in food sharing scenarios with the use of a panoramic camera. Extensive experiments have been conducted. Tested with both laboratory (captured in London) and field study data (captured in Africa), the above proposed solutions have proven the feasibility and accuracy of using the egocentric camera technologies with deep learning methods for individual dietary assessment and human behaviour analysis.Open Acces
    • …
    corecore