11,954 research outputs found

    Filosax: A Dataset of Annotated Jazz Saxophone Recordings

    Get PDF
    The Filosax dataset is a large collection of specially commissioned recordings of jazz saxophonists playing with commercially available backing tracks. Five participants each recorded themselves playing the melody, interpreting a transcribed solo and improvising on 48 tracks, giving a total of around 24 hours of audio data. The solos are annotated both as individual note events with physical timing, and as sheet music with a metrical interpretation of the timing. In this paper, we outline the criteria used for choosing and sourcing the repertoire, the recording process and the semi-automatic transcription pipeline. We demonstrate the use of the dataset to analyse musical phenomena such as swing timing and dynamics of typical musical figures, as well as for training a source activity detection system and predicting expressive characteristics. Other potential applications include the modelling of jazz improvisation, performer identification, automatic music transcription, source separation and music generation

    Acoustically Inspired Probabilistic Time-domain Music Transcription and Source Separation.

    Get PDF
    PhD ThesisAutomatic music transcription (AMT) and source separation are important computational tasks, which can help to understand, analyse and process music recordings. The main purpose of AMT is to estimate, from an observed audio recording, a latent symbolic representation of a piece of music (piano-roll). In this sense, in AMT the duration and location of every note played is reconstructed from a mixture recording. The related task of source separation aims to estimate the latent functions or source signals that were mixed together in an audio recording. This task requires not only the duration and location of every event present in the mixture, but also the reconstruction of the waveform of all the individual sounds. Most methods for AMT and source separation rely on the magnitude of time-frequency representations of the analysed recording, i.e., spectrograms, and often arbitrarily discard phase information. On one hand, this decreases the time resolution in AMT. On the other hand, discarding phase information corrupts the reconstruction in source separation, because the phase of each source-spectrogram must be approximated. There is thus a need for models that circumvent phase approximation, while operating at sample-rate resolution. This thesis intends to solve AMT and source separation together from an unified perspective. For this purpose, Bayesian non-parametric signal processing, covariance kernels designed for audio, and scalable variational inference are integrated to form efficient and acoustically-inspired probabilistic models. To circumvent phase approximation while keeping sample-rate resolution, AMT and source separation are addressed from a Bayesian time-domain viewpoint. That is, the posterior distribution over the waveform of each sound event in the mixture is computed directly from the observed data. For this purpose, Gaussian processes (GPs) are used to define priors over the sources/pitches. GPs are probability distributions over functions, and its kernel or covariance determines the properties of the functions sampled from a GP. Finally, the GP priors and the available data (mixture recording) are combined using Bayes' theorem in order to compute the posterior distributions over the sources/pitches. Although the proposed paradigm is elegant, it introduces two main challenges. First, as mentioned before, the kernel of the GP priors determines the properties of each source/pitch function, that is, its smoothness, stationariness, and more importantly its spectrum. Consequently, the proposed model requires the design of flexible kernels, able to learn the rich frequency content and intricate properties of audio sources. To this end, spectral mixture (SM) kernels are studied, and the Mat ern spectral mixture (MSM) kernel is introduced, i.e. a modified version of the SM covariance function. The MSM kernel introduces less strong smoothness, thus it is more suitable for modelling physical processes. Second, the computational complexity of GP inference scales cubically with the number of audio samples. Therefore, the application of GP models to large audio signals becomes intractable. To overcome this limitation, variational inference is used to make the proposed model scalable and suitable for signals in the order of hundreds of thousands of data points. The integration of GP priors, kernels intended for audio, and variational inference could enable AMT and source separation time-domain methods to reconstruct sources and transcribe music in an efficient and informed manner. In addition, AMT and source separation are current challenges, because the spectra of the sources/pitches overlap with each other in intricate ways. Thus, the development of probabilistic models capable of differentiating sources/pitches in the time domain, despite the high similarity between their spectra, opens the possibility to take a step towards solving source separation and automatic music transcription. We demonstrate the utility of our methods using real and synthesized music audio datasets for various types of musical instruments

    Identifying Cover Songs Using Information-Theoretic Measures of Similarity

    Get PDF
    This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/This paper investigates methods for quantifying similarity between audio signals, specifically for the task of cover song detection. We consider an information-theoretic approach, where we compute pairwise measures of predictability between time series. We compare discrete-valued approaches operating on quantized audio features, to continuous-valued approaches. In the discrete case, we propose a method for computing the normalized compression distance, where we account for correlation between time series. In the continuous case, we propose to compute information-based measures of similarity as statistics of the prediction error between time series. We evaluate our methods on two cover song identification tasks using a data set comprised of 300 Jazz standards and using the Million Song Dataset. For both datasets, we observe that continuous-valued approaches outperform discrete-valued approaches. We consider approaches to estimating the normalized compression distance (NCD) based on string compression and prediction, where we observe that our proposed normalized compression distance with alignment (NCDA) improves average performance over NCD, for sequential compression algorithms. Finally, we demonstrate that continuous-valued distances may be combined to improve performance with respect to baseline approaches. Using a large-scale filter-and-refine approach, we demonstrate state-of-the-art performance for cover song identification using the Million Song Dataset.The work of P. Foster was supported by an Engineering and Physical Sciences Research Council Doctoral Training Account studentship

    Multimodal music information processing and retrieval: survey and future challenges

    Full text link
    Towards improving the performance in various music information processing tasks, recent studies exploit different modalities able to capture diverse aspects of music. Such modalities include audio recordings, symbolic music scores, mid-level representations, motion, and gestural data, video recordings, editorial or cultural tags, lyrics and album cover arts. This paper critically reviews the various approaches adopted in Music Information Processing and Retrieval and highlights how multimodal algorithms can help Music Computing applications. First, we categorize the related literature based on the application they address. Subsequently, we analyze existing information fusion approaches, and we conclude with the set of challenges that Music Information Retrieval and Sound and Music Computing research communities should focus in the next years

    Multi-label Ferns for Efficient Recognition of Musical Instruments in Recordings

    Full text link
    In this paper we introduce multi-label ferns, and apply this technique for automatic classification of musical instruments in audio recordings. We compare the performance of our proposed method to a set of binary random ferns, using jazz recordings as input data. Our main result is obtaining much faster classification and higher F-score. We also achieve substantial reduction of the model size
    • …
    corecore