53 research outputs found
AN EVALUATION OF AUDIO FEATURE EXTRACTION TOOLBOXES
Audio feature extraction underpins a massive proportion of audio processing, music information retrieval, audio effect design and audio synthesis. Design, analysis, synthesis and evaluation often rely on audio features, but there are a large and diverse range of feature extraction tools presented to the community. An evaluation of existing audio feature extraction libraries was undertaken. Ten libraries and toolboxes were evaluated with the Cranfield Model for evaluation of information retrieval systems, reviewing the cov-erage, effort, presentation and time lag of a system. Comparisons are undertaken of these tools and example use cases are presented as to when toolboxes are most suitable. This paper allows a soft-ware engineer or researcher to quickly and easily select a suitable audio feature extraction toolbox. 1
TarsosDSP, a real-time audio processing framework in Java
This paper presents TarsosDSP, a framework for real-time audio analysis and processing. Most libraries and frameworks offer either audio analysis and feature extraction or audio synthesis and processing. TarsosDSP is one of a only a few frameworks that offers both analysis, processing and feature extraction in real-time, a unique feature in the Java ecosystem. The framework contains practical audio processing algorithms, it can be extended easily, and has no external dependencies. Each algorithm is implemented as simple as possible thanks to a straightforward processing pipeline. TarsosDSP's features include a resampling algorithm, onset detectors, a number of pitch estimation algorithms, a time stretch algorithm, a pitch shifting algorithm, and an algorithm to calculate the Constant-Q. The framework also allows simple audio synthesis, some audio effects, and several filters. The Open Source framework is a valuable contribution to the MIR-Community and ideal fit for interactive MIR-applications on Android
Listening to the World Improves Speech Command Recognition
We study transfer learning in convolutional network architectures applied to
the task of recognizing audio, such as environmental sound events and speech
commands. Our key finding is that not only is it possible to transfer
representations from an unrelated task like environmental sound classification
to a voice-focused task like speech command recognition, but also that doing so
improves accuracies significantly. We also investigate the effect of increased
model capacity for transfer learning audio, by first validating known results
from the field of Computer Vision of achieving better accuracies with
increasingly deeper networks on two audio datasets: UrbanSound8k and the newly
released Google Speech Commands dataset. Then we propose a simple multiscale
input representation using dilated convolutions and show that it is able to
aggregate larger contexts and increase classification performance. Further, the
models trained using a combination of transfer learning and multiscale input
representations need only 40% of the training data to achieve similar
accuracies as a freshly trained model with 100% of the training data. Finally,
we demonstrate a positive interaction effect for the multiscale input and
transfer learning, making a case for the joint application of the two
techniques.Comment: 8 page
Listening to features
This work explores nonparametric methods which aim at synthesizing audio from
low-dimensionnal acoustic features typically used in MIR frameworks. Several
issues prevent this task to be straightforwardly achieved. Such features are
designed for analysis and not for synthesis, thus favoring high-level
description over easily inverted acoustic representation. Whereas some previous
studies already considered the problem of synthesizing audio from features such
as Mel-Frequency Cepstral Coefficients, they mainly relied on the explicit
formula used to compute those features in order to inverse them. Here, we
instead adopt a simple blind approach, where arbitrary sets of features can be
used during synthesis and where reconstruction is exemplar-based. After testing
the approach on a speech synthesis from well known features problem, we apply
it to the more complex task of inverting songs from the Million Song Dataset.
What makes this task harder is twofold. First, that features are irregularly
spaced in the temporal domain according to an onset-based segmentation. Second
the exact method used to compute these features is unknown, although the
features for new audio can be computed using their API as a black-box. In this
paper, we detail these difficulties and present a framework to nonetheless
attempting such synthesis by concatenating audio samples from a training
dataset, whose features have been computed beforehand. Samples are selected at
the segment level, in the feature space with a simple nearest neighbor search.
Additionnal constraints can then be defined to enhance the synthesis
pertinence. Preliminary experiments are presented using RWC and GTZAN audio
datasets to synthesize tracks from the Million Song Dataset.Comment: Technical Repor
Concatenative Synthesis for Novel Timbral Creation
Modern day musicians rely on a variety of instruments for musical expression. Tones produced from electronic instruments have become almost as commonplace as those produced by traditional ones as evidenced by the plethora of artists who can be found composing and performing with nothing more than a personal computer. This desire to embrace technical innovation as a means to augment performance art has created a budding field in computer science that explores the creation and manipulation of sound for artistic purposes. One facet of this new frontier concerns timbral creation, or the development of new sounds with unique characteristics that can be wielded by the musician as a virtual instrument.
This thesis presents Timcat, a software system that can be used to create novel timbres from prerecorded audio. Various techniques for timbral feature extraction from short audio clips, or grains, are evaluated for use in timbral feature spaces. Clustering is performed on feature vectors in these spaces and groupings are recombined using concatenative synthesis techniques in order to form new instrument patches.
The results reveal that interesting timbres can be created using features extracted by both newly developed and existing signal analysis techniques, many common in other fields though not often applied to music audio signals. Several of the features employed also show high accuracy for instrument separation in randomly mixed tracks. Survey results demonstrate positive feedback concerning the timbres created by Timcat from electronic music composers, musicians, and music lovers alike
AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis
Recently, sound recognition has been used to identify sounds, such as car and
river. However, sounds have nuances that may be better described by
adjective-noun pairs such as slow car, and verb-noun pairs such as flying
insects, which are under explored. Therefore, in this work we investigate the
relation between audio content and both adjective-noun pairs and verb-noun
pairs. Due to the lack of datasets with these kinds of annotations, we
collected and processed the AudioPairBank corpus consisting of a combined total
of 1,123 pairs and over 33,000 audio files. One contribution is the previously
unavailable documentation of the challenges and implications of collecting
audio recordings with these type of labels. A second contribution is to show
the degree of correlation between the audio content and the labels through
sound recognition experiments, which yielded results of 70% accuracy, hence
also providing a performance benchmark. The results and study in this paper
encourage further exploration of the nuances in audio and are meant to
complement similar research performed on images and text in multimedia
analysis.Comment: This paper is a revised version of "AudioSentibank: Large-scale
Semantic Ontology of Acoustic Concepts for Audio Content Analysis
- …