58 research outputs found

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Final Research Report on Auto-Tagging of Music

    Get PDF
    The deliverable D4.7 concerns the work achieved by IRCAM until M36 for the “auto-tagging of music”. The deliverable is a research report. The software libraries resulting from the research have been integrated into Fincons/HearDis! Music Library Manager or are used by TU Berlin. The final software libraries are described in D4.5. The research work on auto-tagging has concentrated on four aspects: 1) Further improving IRCAM’s machine-learning system ircamclass. This has been done by developing the new MASSS audio features, including audio augmentation and audio segmentation into ircamclass. The system has then been applied to train HearDis! “soft” features (Vocals-1, Vocals-2, Pop-Appeal, Intensity, Instrumentation, Timbre, Genre, Style). This is described in Part 3. 2) Developing two sets of “hard” features (i.e. related to musical or musicological concepts) as specified by HearDis! (for integration into Fincons/HearDis! Music Library Manager) and TU Berlin (as input for the prediction model of the GMBI attributes). Such features are either derived from previously estimated higher-level concepts (such as structure, key or succession of chords) or by developing new signal processing algorithm (such as HPSS) or main melody estimation. This is described in Part 4. 3) Developing audio features to characterize the audio quality of a music track. The goal is to describe the quality of the audio independently of its apparent encoding. This is then used to estimate audio degradation or music decade. This is to be used to ensure that playlists contain tracks with similar audio quality. This is described in Part 5. 4) Developing innovative algorithms to extract specific audio features to improve music mixes. So far, innovative techniques (based on various Blind Audio Source Separation algorithms and Convolutional Neural Network) have been developed for singing voice separation, singing voice segmentation, music structure boundaries estimation, and DJ cue-region estimation. This is described in Part 6.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D

    Relative Positional Encoding for Transformers with Linear Complexity

    Full text link
    Recent advances in Transformer models allow for unprecedented sequence lengths, due to linear space and time complexity. In the meantime, relative positional encoding (RPE) was proposed as beneficial for classical Transformers and consists in exploiting lags instead of absolute positions for inference. Still, RPE is not available for the recent linear-variants of the Transformer, because it requires the explicit computation of the attention matrix, which is precisely what is avoided by such methods. In this paper, we bridge this gap and present Stochastic Positional Encoding as a way to generate PE that can be used as a replacement to the classical additive (sinusoidal) PE and provably behaves like RPE. The main theoretical contribution is to make a connection between positional encoding and cross-covariance structures of correlated Gaussian processes. We illustrate the performance of our approach on the Long-Range Arena benchmark and on music generation.Comment: ICML 2021 (long talk) camera-ready. 24 page

    Short-term motion prediction of autonomous vehicles in complex environments: A Deep Learning approach

    Get PDF
    Complex environments manifest a high level of complexity and it is of critical importance that the safety systems embedded within autonomous vehicles (AVs) are able to accurately anticipate short-term future motion of agents in close proximity. This problem can be further understood as generating a sequence of coordinates describing the plausible future motion of the tracked agent. Number of recently proposed techniques that present satisfactory performance exploit the learning capabilities of novel deep learning (DL) architectures to tackle the discussed task. Nonetheless, there still exists a vast number of challenging issues that must be resolved to further advance capabilities of motion prediction models.This thesis explores novel deep learning techniques within the area of short-term motion prediction of on-road participants, specifically other vehicles from a points of autonomous vehicles. First and foremost, various approaches in the literature demonstrate significant benefits of using a rasterised top-down image of the road to encode the context of tracked vehicle’s surroundings which generally encapsulates a large, global portion of the environment. This work on the other hand explores a use of local regions of the rasterised map to more explicitly focus on the encoding of the tracked vehicle’s state. The proposed technique demonstrates plausible results against several baseline models and in addition outperforms the same model that instead uses global maps. Next, the typical method for extracting features from rasterised maps involves employing one of the popular vision models (e.g. ResNet-50) that has been previously pre-trained on a distinct task such as image classification. Recently however, it has been demonstrated that this approach can be sub-optimal for tasks that strongly rely on precise localisation of features and it can be more advantageous to train the model from scratch directly on the task at hand. In contrast, the subsequent part of this thesis investigates an alternative method for processing and encoding of spatial data based on the capsule networks in order to eradicate several issues that standard vision models exhibit. Through several experiments it is established that the novel capsule based motion predictor that is trained from scratch is able to achieve competitive results against numerous popular vision models. Finally, the proposed model is further extended with the use of generative framework to account for the fact that the space of possible movements of the tracked vehicle is not strictly limited to single trajectory. More specifically, to account for the multi-modality of the problem a conditional variational auto-encoder (CVAE) is employed which enables to sample an arbitrary amount of diverse trajectories. The final model is examined against methods from literature on a publicly available dataset and as presented it significantly outperforms other models whilst drastically reducing the number of trainable parameters

    Review : Deep learning in electron microscopy

    Get PDF
    Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy

    Neural Networks for Analysing Music and Environmental Audio

    Get PDF
    PhDIn this thesis, we consider the analysis of music and environmental audio recordings with neural networks. Recently, neural networks have been shown to be an effective family of models for speech recognition, computer vision, natural language processing and a number of other statistical modelling problems. The composite layer-wise structure of neural networks allows for flexible model design, where prior knowledge about the domain of application can be used to inform the design and architecture of the neural network models. Additionally, it has been shown that when trained on sufficient quantities of data, neural networks can be directly applied to low-level features to learn mappings to high level concepts like phonemes in speech and object classes in computer vision. In this thesis we investigate whether neural network models can be usefully applied to processing music and environmental audio. With regards to music signal analysis, we investigate 2 different problems. The fi rst problem, automatic music transcription, aims to identify the score or the sequence of musical notes that comprise an audio recording. We also consider the problem of automatic chord transcription, where the aim is to identify the sequence of chords in a given audio recording. For both problems, we design neural network acoustic models which are applied to low-level time-frequency features in order to detect the presence of notes or chords. Our results demonstrate that the neural network acoustic models perform similarly to state-of-the-art acoustic models, without the need for any feature engineering. The networks are able to learn complex transformations from time-frequency features to the desired outputs, given sufficient amounts of training data. Additionally, we use recurrent neural networks to model the temporal structure of sequences of notes or chords, similar to language modelling in speech. Our results demonstrate that the combination of the acoustic and language model predictions yields improved performance over the acoustic models alone. We also observe that convolutional neural networks yield better performance compared to other neural network architectures for acoustic modelling. For the analysis of environmental audio recordings, we consider the problem of acoustic event detection. Acoustic event detection has a similar structure to automatic music and chord transcription, where the system is required to output the correct sequence of semantic labels along with onset and offset times. We compare the performance of neural network architectures against Gaussian mixture models and support vector machines. In order to account for the fact that such systems are typically deployed on embedded devices, we compare performance as a function of the computational cost of each model. We evaluate the models on 2 large datasets of real-world recordings of baby cries and smoke alarms. Our results demonstrate that the neural networks clearly outperform the other models and they are able to do so without incurring a heavy computation cost

    Audio speech enhancement using masks derived from visual speech

    Get PDF
    The aim of the work in this thesis is to explore how visual speech can be used within monaural masking based speech enhancement to remove interfering noise, with a focus on improving intelligibility. Visual speech has the advantage of not being corrupted by interfering noise and can therefore provide additional information within a speech enhancement framework. More specifically, this work considers audio-only, visual-only and audio-visual methods of mask estimation within deep learning architectures with application to both seen and unseen noise types. To estimate masks from audio and visual speech information, models are developed using deep neural networks, specifically feed-forward (DNN) and recurrent (RNN) neural networks for temporal modelling and convolutional neural networks (CNN) for visual feature extraction. It was found that the proposed layer normalised bi-directional feed-forward hybrid network using gated recurrent units (LNBiGRUDNN) provided best performance across all objective measures for temporal modelling. Also, extracting visual features using both pre-trained and end-to-end trained CNNs outperform traditional active appearance model (AAM) feature extraction across all noise types and SNRs tested. End-to-end CNNs trained on images focused on mouth-only regions-of-interest provided best performance for both audio-visual and visual-only models. The best performing audio-visual masking method outperformed both audio-only and visual-only masking methods in both matched and unseen noise type and SNR dependent conditions. For example, in unseen cafeteria babble noise at -10 dB, audio-visual masking had an ESTOI of 46.8, while audio-only and visual-only masking scored 15.0 and 42.4, and the unprocessed audio scored 9.3. Formal tests show that visual information is critical for improving intelligibility at low SNRs and for generalisation to unseen noise conditions. Experiments in large unconstrained vocabulary speech confirm that the model architectures and approaches developed can generalise to unconstrained speech across noise independent conditions and can be considered for monaural speaker dependent real-world applications

    Machine Learning and Signal Processing Design for Edge Acoustic Applications

    Get PDF

    Machine Learning and Signal Processing Design for Edge Acoustic Applications

    Get PDF
    • …
    corecore