213 research outputs found
Automatic Piano Transcription with Hierarchical Frequency-Time Transformer
Taking long-term spectral and temporal dependencies into account is essential
for automatic piano transcription. This is especially helpful when determining
the precise onset and offset for each note in the polyphonic piano content. In
this case, we may rely on the capability of self-attention mechanism in
Transformers to capture these long-term dependencies in the frequency and time
axes. In this work, we propose hFT-Transformer, which is an automatic music
transcription method that uses a two-level hierarchical frequency-time
Transformer architecture. The first hierarchy includes a convolutional block in
the time axis, a Transformer encoder in the frequency axis, and a Transformer
decoder that converts the dimension in the frequency axis. The output is then
fed into the second hierarchy which consists of another Transformer encoder in
the time axis. We evaluated our method with the widely used MAPS and MAESTRO
v3.0.0 datasets, and it demonstrated state-of-the-art performance on all the
F1-scores of the metrics among Frame, Note, Note with Offset, and Note with
Offset and Velocity estimations.Comment: 8 pages, 6 figures, to be published in ISMIR202
Audio self-supervised learning: a survey
Inspired by the humans' cognitive ability to generalise knowledge and skills,
Self-Supervised Learning (SSL) targets at discovering general representations
from large-scale data without requiring human annotations, which is an
expensive and time consuming task. Its success in the fields of computer vision
and natural language processing have prompted its recent adoption into the
field of audio and speech processing. Comprehensive reviews summarising the
knowledge in audio SSL are currently missing. To fill this gap, in the present
work, we provide an overview of the SSL methods used for audio and speech
processing applications. Herein, we also summarise the empirical works that
exploit the audio modality in multi-modal SSL frameworks, and the existing
suitable benchmarks to evaluate the power of SSL in the computer audition
domain. Finally, we discuss some open problems and point out the future
directions on the development of audio SSL
Applied Deep Learning: Case Studies in Computer Vision and Natural Language Processing
Deep learning has proved to be successful for many computer vision and natural language processing applications. In this dissertation, three studies have been conducted to show the efficacy of deep learning models for computer vision and natural language processing. In the first study, an efficient deep learning model was proposed for seagrass scar detection in multispectral images which produced robust, accurate scars mappings. In the second study, an arithmetic deep learning model was developed to fuse multi-spectral images collected at different times with different resolutions to generate high-resolution images for downstream tasks including change detection, object detection, and land cover classification. In addition, a super-resolution deep model was implemented to further enhance remote sensing images. In the third study, a deep learning-based framework was proposed for fact-checking on social media to spot fake scientific news. The framework leveraged deep learning, information retrieval, and natural language processing techniques to retrieve pertinent scholarly papers for given scientific news and evaluate the credibility of the news
AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyancea)
Soundscape studies typically attempt to capture the perception and understanding of sonic environments by surveying users. However, for long-term monitoring or assessing interventions, sound-signal-based approaches are required. To this end, most previous research focused on psycho-acoustic quantities or automatic sound recognition. Few attempts were made to include appraisal (e.g., in circumplex frameworks). This paper proposes an artificial intelligence (AI)-based dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape characterization, including sound recognition and appraisal. Using the DeLTA dataset containing human-annotated sound source labels and perceived annoyance, the DCNN-CaF is proposed to perform sound source classification (SSC) and human-perceived annoyance rating prediction (ARP). Experimental findings indicate that (1) the proposed DCNN-CaF using loudness and Mel features outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with cross-attention fusion outperforms other typical AI-based models and soundscape-related traditional machine learning methods on the SSC and ARP tasks. (3) Correlation analysis reveals that the relationship between sound sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF model. (4) Generalization tests show that the proposed model's ARP in the presence of model-unknown sound sources is consistent with expert expectations and can explain previous findings from the literature on soundscape augmentation
AI-based soundscape analysis: Jointly identifying sound sources and predicting annoyance
Soundscape studies typically attempt to capture the perception and
understanding of sonic environments by surveying users. However, for long-term
monitoring or assessing interventions, sound-signal-based approaches are
required. To this end, most previous research focused on psycho-acoustic
quantities or automatic sound recognition. Few attempts were made to include
appraisal (e.g., in circumplex frameworks). This paper proposes an artificial
intelligence (AI)-based dual-branch convolutional neural network with
cross-attention-based fusion (DCNN-CaF) to analyze automatic soundscape
characterization, including sound recognition and appraisal. Using the DeLTA
dataset containing human-annotated sound source labels and perceived annoyance,
the DCNN-CaF is proposed to perform sound source classification (SSC) and
human-perceived annoyance rating prediction (ARP). Experimental findings
indicate that (1) the proposed DCNN-CaF using loudness and Mel features
outperforms the DCNN-CaF using only one of them. (2) The proposed DCNN-CaF with
cross-attention fusion outperforms other typical AI-based models and
soundscape-related traditional machine learning methods on the SSC and ARP
tasks. (3) Correlation analysis reveals that the relationship between sound
sources and annoyance is similar for humans and the proposed AI-based DCNN-CaF
model. (4) Generalization tests show that the proposed model's ARP in the
presence of model-unknown sound sources is consistent with expert expectations
and can explain previous findings from the literature on sound-scape
augmentation.Comment: The Journal of the Acoustical Society of America, 154 (5), 314
Sound Event Localization, Detection, and Tracking by Deep Neural Networks
In this thesis, we present novel sound representations and classification methods for the task of sound event localization, detection, and tracking (SELDT). The human auditory system has evolved to localize multiple sound events, recognize and further track their motion individually in an acoustic environment. This ability of humans makes them context-aware and enables them to interact with their surroundings naturally. Developing similar methods for machines will provide an automatic description of social and human activities around them and enable machines to be context-aware similar to humans. Such methods can be employed to assist the hearing impaired to visualize sounds, for robot navigation, and to monitor biodiversity, the home, and cities.
A real-life acoustic scene is complex in nature, with multiple sound events that are temporally and spatially overlapping, including stationary and moving events with varying angular velocities. Additionally, each individual sound event class, for example, a car horn can have a lot of variabilities, i.e., different cars have different horns, and within the same model of the car, the duration and the temporal structure of the horn sound is driver dependent. Performing SELDT in such overlapping and dynamic sound scenes while being robust is challenging for machines. Hence we propose to investigate the SELDT task in this thesis and use a data-driven approach using deep neural networks (DNNs).
The sound event detection (SED) task requires the detection of onset and offset time for individual sound events and their corresponding labels. In this regard, we propose to use spatial and perceptual features extracted from multichannel audio for SED using two different DNNs, recurrent neural networks (RNNs) and convolutional recurrent neural networks (CRNNs). We show that using multichannel audio features improves the SED performance for overlapping sound events in comparison to traditional single-channel audio features. The proposed novel features and methods produced state-of-the-art performance for the real-life SED task and won the IEEE AASP DCASE challenge consecutively in 2016 and 2017.
Sound event localization is the task of spatially locating the position of individual sound events. Traditionally, this has been approached using parametric methods. In this thesis, we propose a CRNN for detecting the azimuth and elevation angles of multiple temporally overlapping sound events. This is the first DNN-based method performing localization in complete azimuth and elevation space. In comparison to parametric methods which require the information of the number of active sources, the proposed method learns this information directly from the input data and estimates their respective spatial locations. Further, the proposed CRNN is shown to be more robust than parametric methods in reverberant scenarios.
Finally, the detection and localization tasks are performed jointly using a CRNN. This method additionally tracks the spatial location with time, thus producing the SELDT results. This is the first DNN-based SELDT method and is shown to perform equally with stand-alone baselines for SED, localization, and tracking. The proposed SELDT method is evaluated on nine datasets that represent anechoic and reverberant sound scenes, stationary and moving sources with varying velocities, a different number of overlapping sound events and different microphone array formats. The results show that the SELDT method can track multiple overlapping sound events that are both spatially stationary and moving
What You Hear Is What You See: Audio Quality Metrics From Image Quality Metrics
In this study, we investigate the feasibility of utilizing state-of-the-art
image perceptual metrics for evaluating audio signals by representing them as
spectrograms. The encouraging outcome of the proposed approach is based on the
similarity between the neural mechanisms in the auditory and visual pathways.
Furthermore, we customise one of the metrics which has a psychoacoustically
plausible architecture to account for the peculiarities of sound signals. We
evaluate the effectiveness of our proposed metric and several baseline metrics
using a music dataset, with promising results in terms of the correlation
between the metrics and the perceived quality of audio as rated by human
evaluators
- …