121 research outputs found

    A Deep Representation for Invariance And Music Classification

    Get PDF
    Representations in the auditory cortex might be based on mechanisms similar to the visual ventral stream; modules for building invariance to transformations and multiple layers for compositionality and selectivity. In this paper we propose the use of such computational modules for extracting invariant and discriminative audio representations. Building on a theory of invariance in hierarchical architectures, we propose a novel, mid-level representation for acoustical signals, using the empirical distributions of projections on a set of templates and their transformations. Under the assumption that, by construction, this dictionary of templates is composed from similar classes, and samples the orbit of variance-inducing signal transformations (such as shift and scale), the resulting signature is theoretically guaranteed to be unique, invariant to transformations and stable to deformations. Modules of projection and pooling can then constitute layers of deep networks, for learning composite representations. We present the main theoretical and computational aspects of a framework for unsupervised learning of invariant audio representations, empirically evaluated on music genre classification.Comment: 5 pages, CBMM Memo No. 002, (to appear) IEEE 2014 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2014

    Final Research Report on Auto-Tagging of Music

    Get PDF
    The deliverable D4.7 concerns the work achieved by IRCAM until M36 for the “auto-tagging of music”. The deliverable is a research report. The software libraries resulting from the research have been integrated into Fincons/HearDis! Music Library Manager or are used by TU Berlin. The final software libraries are described in D4.5. The research work on auto-tagging has concentrated on four aspects: 1) Further improving IRCAM’s machine-learning system ircamclass. This has been done by developing the new MASSS audio features, including audio augmentation and audio segmentation into ircamclass. The system has then been applied to train HearDis! “soft” features (Vocals-1, Vocals-2, Pop-Appeal, Intensity, Instrumentation, Timbre, Genre, Style). This is described in Part 3. 2) Developing two sets of “hard” features (i.e. related to musical or musicological concepts) as specified by HearDis! (for integration into Fincons/HearDis! Music Library Manager) and TU Berlin (as input for the prediction model of the GMBI attributes). Such features are either derived from previously estimated higher-level concepts (such as structure, key or succession of chords) or by developing new signal processing algorithm (such as HPSS) or main melody estimation. This is described in Part 4. 3) Developing audio features to characterize the audio quality of a music track. The goal is to describe the quality of the audio independently of its apparent encoding. This is then used to estimate audio degradation or music decade. This is to be used to ensure that playlists contain tracks with similar audio quality. This is described in Part 5. 4) Developing innovative algorithms to extract specific audio features to improve music mixes. So far, innovative techniques (based on various Blind Audio Source Separation algorithms and Convolutional Neural Network) have been developed for singing voice separation, singing voice segmentation, music structure boundaries estimation, and DJ cue-region estimation. This is described in Part 6.EC/H2020/688122/EU/Artist-to-Business-to-Business-to-Consumer Audio Branding System/ABC D

    Automatic music genre classification

    Get PDF
    A dissertation submitted to the Faculty of Science, University of the Witwatersrand, in fulfillment of the requirements for the degree of Master of Science. 2014.No abstract provided

    The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use

    Get PDF
    The GTZAN dataset appears in at least 100 published works, and is the most-used public dataset for evaluation in machine listening research for music genre recognition (MGR). Our recent work, however, shows GTZAN has several faults (repetitions, mislabelings, and distortions), which challenge the interpretability of any result derived using it. In this article, we disprove the claims that all MGR systems are affected in the same ways by these faults, and that the performances of MGR systems in GTZAN are still meaningfully comparable since they all face the same faults. We identify and analyze the contents of GTZAN, and provide a catalog of its faults. We review how GTZAN has been used in MGR research, and find few indications that its faults have been known and considered. Finally, we rigorously study the effects of its faults on evaluating five different MGR systems. The lesson is not to banish GTZAN, but to use it with consideration of its contents.Comment: 29 pages, 7 figures, 6 tables, 128 reference

    Statistical distribution of common audio features : encounters in a heavy-tailed universe

    Get PDF
    In the last few years some Music Information Retrieval (MIR) researchers have spotted important drawbacks in applying standard successful-in-monophonic algorithms to polyphonic music classification and similarity assessment. Noticeably, these so called “Bag-of-Frames” (BoF) algorithms share a common set of assumptions. These assumptions are substantiated in the belief that the numerical descriptions extracted from short-time audio excerpts (or frames) are enough to capture relevant information for the task at hand, that these frame-based audio descriptors are time independent, and that descriptor frames are well described by Gaussian statistics. Thus, if we want to improve current BoF algorithms we could: i) improve current audio descriptors, ii) include temporal information within algorithms working with polyphonic music, and iii) study and characterize the real statistical properties of these frame-based audio descriptors. From a literature review, we have detected that many works focus on the first two improvements, but surprisingly, there is a lack of research in the third one. Therefore, in this thesis we analyze and characterize the statistical distribution of common audio descriptors of timbre, tonal and loudness information. Contrary to what is usually assumed, our work shows that the studied descriptors are heavy-tailed distributed and thus, they do not belong to a Gaussian universe. This new knowledge led us to propose new algorithms that show improvements over the BoF approach in current MIR tasks such as genre classification, instrument detection, and automatic tagging of music. Furthermore, we also address new MIR tasks such as measuring the temporal evolution of Western popular music. Finally, we highlight some promising paths for future audio-content MIR research that will inhabit a heavy-tailed universe.En el campo de la extracción de información musical o Music Information Retrieval (MIR), los algoritmos llamados Bag-of-Frames (BoF) han sido aplicados con éxito en la clasificación y evaluación de similitud de señales de audio monofónicas. Por otra parte, investigaciones recientes han señalado problemas importantes a la hora de aplicar dichos algoritmos a señales de música polifónica. Estos algoritmos suponen que las descripciones numéricas extraídas de los fragmentos de audio de corta duración (o frames ) son capaces de capturar la información necesaria para la realización de las tareas planteadas, que el orden temporal de estos fragmentos de audio es irrelevante y que las descripciones extraídas de los segmentos de audio pueden ser correctamente descritas usando estadísticas Gaussianas. Por lo tanto, si se pretende mejorar los algoritmos BoF actuales se podría intentar: i) mejorar los descriptores de audio, ii) incluir información temporal en los algoritmos que trabajan con música polifónica y iii) estudiar y caracterizar las propiedades estadísticas reales de los descriptores de audio. La bibliografía actual sobre el tema refleja la existencia de un número considerable de trabajos centrados en las dos primeras opciones de mejora, pero sorprendentemente, hay una carencia de trabajos de investigación focalizados en la tercera opción. Por lo tanto, esta tesis se centra en el análisis y caracterización de la distribución estadística de descriptores de audio comúnmente utilizados para representar información tímbrica, tonal y de volumen. Al contrario de lo que se asume habitualmente, nuestro trabajo muestra que los descriptores de audio estudiados se distribuyen de acuerdo a una distribución de “cola pesada” y por lo tanto no pertenecen a un universo Gaussiano. Este descubrimiento nos permite proponer nuevos algoritmos que evidencian mejoras importantes sobre los algoritmos BoF actualmente utilizados en diversas tareas de MIR tales como clasificación de género, detección de instrumentos musicales y etiquetado automático de música. También nos permite proponer nuevas tareas tales como la medición de la evolución temporal de la música popular occidental. Finalmente, presentamos algunas prometedoras líneas de investigación para tareas de MIR ubicadas, a partir de ahora, en un universo de “cola pesada”.En l’àmbit de la extracció de la informació musical o Music Information Retrieval (MIR), els algorismes anomenats Bag-of-Frames (BoF) han estat aplicats amb èxit en la classificació i avaluació de similitud entre senyals monofòniques. D’altra banda, investigacions recents han assenyalat importants inconvenients a l’hora d’aplicar aquests mateixos algorismes en senyals de música polifònica. Aquests algorismes BoF suposen que les descripcions numèriques extretes dels fragments d’àudio de curta durada (frames) son suficients per capturar la informació rellevant per als algorismes, que els descriptors basats en els fragments son independents del temps i que l’estadística Gaussiana descriu correctament aquests descriptors. Per a millorar els algorismes BoF actuals doncs, es poden i) millorar els descriptors, ii) incorporar informació temporal dins els algorismes que treballen amb música polifònica i iii) estudiar i caracteritzar les propietats estadístiques reals d’aquests descriptors basats en fragments d’àudio. Sorprenentment, de la revisió bibliogràfica es desprèn que la majoria d’investigacions s’han centrat en els dos primers punts de millora mentre que hi ha una mancança quant a la recerca en l’àmbit del tercer punt. És per això que en aquesta tesi, s’analitza i caracteritza la distribució estadística dels descriptors més comuns de timbre, to i volum. El nostre treball mostra que contràriament al què s’assumeix, els descriptors no pertanyen a l’univers Gaussià sinó que es distribueixen segons una distribució de “cua pesada”. Aquest descobriment ens permet proposar nous algorismes que evidencien millores importants sobre els algorismes BoF utilitzats actualment en diferents tasques com la classificació del gènere, la detecció d’instruments musicals i l’etiquetatge automàtic de música. Ens permet també proposar noves tasques com la mesura de l’evolució temporal de la música popular occidental. Finalment, presentem algunes prometedores línies d’investigació per a tasques de MIR ubicades a partir d’ara en un univers de “cua pesada”
    corecore