58 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Final Research Report on Auto-Tagging of Music
The deliverable D4.7 concerns the work achieved by IRCAM until M36 for the “auto-tagging of music”. The deliverable is a research report. The software libraries resulting from the research have been integrated into Fincons/HearDis! Music Library Manager or are used by TU Berlin. The final software libraries are described in D4.5.
The research work on auto-tagging has concentrated on four aspects:
1) Further improving IRCAM’s machine-learning system ircamclass. This has been done by developing the new MASSS audio features, including audio augmentation and audio segmentation into ircamclass. The system has then been applied to train HearDis! “soft” features (Vocals-1, Vocals-2, Pop-Appeal, Intensity, Instrumentation, Timbre, Genre, Style). This is described in Part 3.
2) Developing two sets of “hard” features (i.e. related to musical or musicological concepts) as specified by HearDis! (for integration into Fincons/HearDis! Music Library Manager) and TU Berlin (as input for the prediction model of the GMBI attributes). Such features are either derived from previously estimated higher-level concepts (such as structure, key or succession of chords) or by developing new signal processing algorithm (such as HPSS) or main melody estimation. This is described in Part 4.
3) Developing audio features to characterize the audio quality of a music track. The goal is to describe the quality of the audio independently of its apparent encoding. This is then used to estimate audio degradation or music decade. This is to be used to ensure that playlists contain tracks with similar audio quality. This is described in Part 5.
4) Developing innovative algorithms to extract specific audio features to improve music mixes. So far, innovative techniques (based on various Blind Audio Source Separation algorithms and Convolutional Neural Network) have been developed for singing voice separation, singing voice segmentation, music structure boundaries estimation, and DJ cue-region estimation. This is described in Part 6.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D
Relative Positional Encoding for Transformers with Linear Complexity
Recent advances in Transformer models allow for unprecedented sequence
lengths, due to linear space and time complexity. In the meantime, relative
positional encoding (RPE) was proposed as beneficial for classical Transformers
and consists in exploiting lags instead of absolute positions for inference.
Still, RPE is not available for the recent linear-variants of the Transformer,
because it requires the explicit computation of the attention matrix, which is
precisely what is avoided by such methods. In this paper, we bridge this gap
and present Stochastic Positional Encoding as a way to generate PE that can be
used as a replacement to the classical additive (sinusoidal) PE and provably
behaves like RPE. The main theoretical contribution is to make a connection
between positional encoding and cross-covariance structures of correlated
Gaussian processes. We illustrate the performance of our approach on the
Long-Range Arena benchmark and on music generation.Comment: ICML 2021 (long talk) camera-ready. 24 page
Short-term motion prediction of autonomous vehicles in complex environments: A Deep Learning approach
Complex environments manifest a high level of complexity and it is of critical importance that the safety systems embedded within autonomous vehicles (AVs) are able to accurately anticipate short-term future motion of agents in close proximity. This problem can be further understood as generating a sequence of coordinates describing the plausible future motion of the tracked agent. Number of recently proposed techniques that present satisfactory performance exploit the learning capabilities of novel deep learning (DL) architectures to tackle the discussed task. Nonetheless, there still exists a vast number of challenging issues that must be resolved to further advance capabilities of motion prediction models.This thesis explores novel deep learning techniques within the area of short-term motion prediction of on-road participants, specifically other vehicles from a points of autonomous vehicles. First and foremost, various approaches in the literature demonstrate significant benefits of using a rasterised top-down image of the road to encode the context of tracked vehicle’s surroundings which generally encapsulates a large, global portion of the environment. This work on the other hand explores a use of local regions of the rasterised map to more explicitly focus on the encoding of the tracked vehicle’s state. The proposed technique demonstrates plausible results against several baseline models and in addition outperforms the same model that instead uses global maps. Next, the typical method for extracting features from rasterised maps involves employing one of the popular vision models (e.g. ResNet-50) that has been previously pre-trained on a distinct task such as image classification. Recently however, it has been demonstrated that this approach can be sub-optimal for tasks that strongly rely on precise localisation of features and it can be more advantageous to train the model from scratch directly on the task at hand. In contrast, the subsequent part of this thesis investigates an alternative method for processing and encoding of spatial data based on the capsule networks in order to eradicate several issues that standard vision models exhibit. Through several experiments it is established that the novel capsule based motion predictor that is trained from scratch is able to achieve competitive results against numerous popular vision models. Finally, the proposed model is further extended with the use of generative framework to account for the fact that the space of possible movements of the tracked vehicle is not strictly limited to single trajectory. More specifically, to account for the multi-modality of the problem a conditional variational auto-encoder (CVAE) is employed which enables to sample an arbitrary amount of diverse trajectories. The final model is examined against methods from literature on a publicly available dataset and as presented it significantly outperforms other models whilst drastically reducing the number of trainable parameters
Review : Deep learning in electron microscopy
Deep learning is transforming most areas of science and technology, including electron microscopy. This review paper offers a practical perspective aimed at developers with limited familiarity. For context, we review popular applications of deep learning in electron microscopy. Following, we discuss hardware and software needed to get started with deep learning and interface with electron microscopes. We then review neural network components, popular architectures, and their optimization. Finally, we discuss future directions of deep learning in electron microscopy
Neural Networks for Analysing Music and Environmental Audio
PhDIn this thesis, we consider the analysis of music and environmental audio
recordings with neural networks. Recently, neural networks have been
shown to be an effective family of models for speech recognition, computer
vision, natural language processing and a number of other statistical modelling
problems. The composite layer-wise structure of neural networks
allows for flexible model design, where prior knowledge about the domain
of application can be used to inform the design and architecture of the
neural network models. Additionally, it has been shown that when trained
on sufficient quantities of data, neural networks can be directly applied to
low-level features to learn mappings to high level concepts like phonemes
in speech and object classes in computer vision. In this thesis we investigate
whether neural network models can be usefully applied to processing
music and environmental audio.
With regards to music signal analysis, we investigate 2 different problems.
The fi rst problem, automatic music transcription, aims to identify the
score or the sequence of musical notes that comprise an audio recording.
We also consider the problem of automatic chord transcription, where the
aim is to identify the sequence of chords in a given audio recording. For
both problems, we design neural network acoustic models which are applied
to low-level time-frequency features in order to detect the presence of
notes or chords. Our results demonstrate that the neural network acoustic
models perform similarly to state-of-the-art acoustic models, without the
need for any feature engineering. The networks are able to learn complex
transformations from time-frequency features to the desired outputs, given
sufficient amounts of training data. Additionally, we use recurrent neural
networks to model the temporal structure of sequences of notes or chords,
similar to language modelling in speech. Our results demonstrate that
the combination of the acoustic and language model predictions yields
improved performance over the acoustic models alone. We also observe
that convolutional neural networks yield better performance compared to
other neural network architectures for acoustic modelling.
For the analysis of environmental audio recordings, we consider the problem
of acoustic event detection. Acoustic event detection has a similar
structure to automatic music and chord transcription, where the system
is required to output the correct sequence of semantic labels along with
onset and offset times. We compare the performance of neural network
architectures against Gaussian mixture models and support vector machines.
In order to account for the fact that such systems are typically
deployed on embedded devices, we compare performance as a function of
the computational cost of each model. We evaluate the models on 2 large
datasets of real-world recordings of baby cries and smoke alarms. Our results
demonstrate that the neural networks clearly outperform the other
models and they are able to do so without incurring a heavy computation
cost
Audio speech enhancement using masks derived from visual speech
The aim of the work in this thesis is to explore how visual speech can be used within monaural masking based speech enhancement to remove interfering noise, with a focus on improving intelligibility. Visual speech has the advantage of not being corrupted by interfering noise and can therefore provide additional information within a speech enhancement framework. More specifically, this work considers
audio-only, visual-only and audio-visual methods of mask estimation within deep learning architectures with application to both seen and unseen noise types.
To estimate masks from audio and visual speech information, models are developed using deep neural networks, specifically feed-forward (DNN) and recurrent (RNN) neural networks for temporal modelling and convolutional neural networks (CNN) for visual feature extraction. It was found that the proposed layer normalised bi-directional feed-forward hybrid network using gated recurrent units (LNBiGRUDNN) provided best performance across all objective measures for temporal modelling. Also, extracting visual features using both pre-trained and end-to-end trained CNNs outperform traditional active appearance model (AAM) feature extraction across all noise types and SNRs tested. End-to-end CNNs trained on images focused on mouth-only regions-of-interest provided best performance for both audio-visual and visual-only models.
The best performing audio-visual masking method outperformed both audio-only and visual-only masking methods in both matched and unseen noise type and SNR dependent conditions. For example, in unseen cafeteria babble noise at -10 dB, audio-visual masking had an ESTOI of 46.8, while audio-only and visual-only masking scored 15.0 and 42.4, and the unprocessed audio scored 9.3. Formal tests show that visual information is critical for improving intelligibility at low SNRs and for generalisation to unseen noise conditions. Experiments in large unconstrained vocabulary speech confirm that the model architectures and approaches developed can generalise to unconstrained speech across noise independent conditions and can be considered for monaural speaker dependent real-world applications
- …