55 research outputs found

    Automatic cymbal classification

    Get PDF
    Dissertação apresentada na Faculdade de CiĂȘncias e Tecnologia da Universidade Nova de Lisboa para a obtenção do grau de Mestre em Engenharia InformĂĄticaMost of the research on automatic music transcription is focused on the transcription of pitched instruments, like the guitar and the piano. Little attention has been given to unpitched instruments, such as the drum kit, which is a collection of unpitched instruments. Yet, over the last few years this type of instrument started to garner more attention, perhaps due to increasing popularity of the drum kit in the western music. There has been work on automatic music transcription of the drum kit, especially the snare drum, bass drum, and hi-hat. Still, much work has to be done in order to achieve automatic music transcription of all unpitched instruments. An example of a type of unpitched instrument that has very particular acoustic characteristics and that has deserved almost no attention by the research community is the drum kit cymbals. A drum kit contains several cymbals and usually these are treated as a single instrument or are totally disregarded by automatic music classificators of unpitched instruments. We propose to fill this gap and as such, the goal of this dissertation is automatic music classification of drum kit cymbal events, and the identification of which class of cymbals they belong to. As stated, the majority of work developed on this area is mostly done with very different percussive instruments, like the snare drum, bass drum, and hi-hat. On the other hand, cymbals are very similar between them. Their geometry, type of alloys, spectral and sound traits shows us just that. Thus, the great achievement of this work is not only being able to correctly classify the different cymbals, but to be able to identify such similar instruments, which makes this task even harder

    Automatic Drum Transcription and Source Separation

    Get PDF
    While research has been carried out on automated polyphonic music transcription, to-date the problem of automated polyphonic percussion transcription has not received the same degree of attention. A related problem is that of sound source separation, which attempts to separate a mixture signal into its constituent sources. This thesis focuses on the task of polyphonic percussion transcription and sound source separation of a limited set of drum instruments, namely the drums found in the standard rock/pop drum kit. As there was little previous research on polyphonic percussion transcription a broad review of music information retrieval methods, including previous polyphonic percussion systems, was also carried out to determine if there were any methods which were of potential use in the area of polyphonic drum transcription. Following on from this a review was conducted of general source separation and redundancy reduction techniques, such as Independent Component Analysis and Independent Subspace Analysis, as these techniques have shown potential in separating mixtures of sources. Upon completion of the review it was decided that a combination of the blind separation approach, Independent Subspace Analysis (ISA), with the use of prior knowledge as used in music information retrieval methods, was the best approach to tackling the problem of polyphonic percussion transcription as well as that of sound source separation. A number of new algorithms which combine the use of prior knowledge with the source separation abilities of techniques such as ISA are presented. These include sub-band ISA, Prior Subspace Analysis (PSA), and an automatic modelling and grouping technique which is used in conjunction with PSA to perform polyphonic percussion transcription. These approaches are demonstrated to be effective in the task of polyphonic percussion transcription, and PSA is also demonstrated to be capable of transcribing drums in the presence of pitched instruments

    Quantifying Dynamic Pitch Adjustment Decision Structures in String Quartet Performance

    Get PDF
    What does it mean to have a unique group sound? Is such a thing quantifiable? If so, are there noticeable differences between groups, and any correlations to the time each group spends together? It is important to note a caveat right off the bat: music is generally understood to be created by and listened to by humans, and thus any attempts at quantifiable answers to the above questions will be, at best, orthogonal to its main purpose. It is also clear from anecdotes and interviews with professional musicians that qualitatively distinguishable characteristics of group sound and interpretation absolutely do exist and are noticeable to the listener. Paul Katz, cellist of the Cleveland Quartet, describes the multiple layers of such a group identity: “When one spends that many hours per day and years together, there is a meshing of taste, an unspoken unification of musical values, an intuitive understanding of each other's timings and shapings, and even a merging of how one produces sounds, makes a bow change, or varies vibrato, that is deeper than words or conscious decision making.” This dissertation concerns itself with the general question of whether or not it is possible to detect and define, in a quantifiable sense, the patterns and elements of a unique group sound identity, specifically in the intonation domain. Original research was carried out, consisting of recording four string quartets with high-quality equipment under controlled conditions, to begin to answer this question

    Deep sleep: deep learning methods for the acoustic analysis of sleep-disordered breathing

    Get PDF
    Sleep-disordered breathing (SDB) is a serious and prevalent condition that results from the collapse of the upper airway during sleep, which leads to oxygen desaturations, unphysiological variations in intrathoracic pressure, and sleep fragmentation. Its most common form is obstructive sleep apnoea (OSA). This has a big impact on quality of life, and is associated with cardiovascular morbidity. Polysomnography, the gold standard for diagnosing SDB, is obtrusive, time-consuming and expensive. Alternative diagnostic approaches have been proposed to overcome its limitations. In particular, acoustic analysis of sleep breathing sounds offers an unobtrusive and inexpensive means to screen for SDB, since it displays symptoms with unique acoustic characteristics. These include snoring, loud gasps, chokes, and absence of breathing. This thesis investigates deep learning methods, which have revolutionised speech and audio technology, to robustly screen for SDB in typical sleep conditions using acoustics. To begin with, the desirable characteristics for an acoustic corpus of SDB, and the acoustic definition of snoring are considered to create corpora for this study. Then three approaches are developed to tackle increasingly complex scenarios. Firstly, with the aim of leveraging a large amount of unlabelled SDB data, unsupervised learning is applied to learn novel feature representations with deep neural networks for the classification of SDB events such as snoring. The incorporation of contextual information to assist the classifier in producing realistic event durations is investigated. Secondly, the temporal pattern of sleep breathing sounds is exploited using convolutional neural networks to screen participants sleeping by themselves for OSA. The integration of acoustic features with physiological data for screening is examined. Thirdly, for the purpose of achieving robustness to bed partner breathing sounds, recurrent neural networks are used to screen a subject and their bed partner for SDB in the same session. Experiments conducted on the constructed corpora show that the developed systems accurately classify SDB events, screen for OSA with high sensitivity and specificity, and screen a subject and their bed partner for SDB with encouraging performance. In conclusion, this thesis makes promising progress in improving access to SDB diagnosis through low-cost and non-invasive methods

    Audio source separation for music in low-latency and high-latency scenarios

    Get PDF
    Aquesta tesi proposa mĂštodes per tractar les limitacions de les tĂšcniques existents de separaciĂł de fonts musicals en condicions de baixa i alta latĂšncia. En primer lloc, ens centrem en els mĂštodes amb un baix cost computacional i baixa latĂšncia. Proposem l'Ășs de la regularitzaciĂł de Tikhonov com a mĂštode de descomposiciĂł de l'espectre en el context de baixa latĂšncia. El comparem amb les tĂšcniques existents en tasques d'estimaciĂł i seguiment dels tons, que sĂłn passos crucials en molts mĂštodes de separaciĂł. A continuaciĂł utilitzem i avaluem el mĂštode de descomposiciĂł de l'espectre en tasques de separaciĂł de veu cantada, baix i percussiĂł. En segon lloc, proposem diversos mĂštodes d'alta latĂšncia que milloren la separaciĂł de la veu cantada, grĂ cies al modelatge de components especĂ­fics, com la respiraciĂł i les consonants. Finalment, explorem l'Ășs de correlacions temporals i anotacions manuals per millorar la separaciĂł dels instruments de percussiĂł i dels senyals musicals polifĂČnics complexes.Esta tesis propone mĂ©todos para tratar las limitaciones de las tĂ©cnicas existentes de separaciĂłn de fuentes musicales en condiciones de baja y alta latencia. En primer lugar, nos centramos en los mĂ©todos con un bajo coste computacional y baja latencia. Proponemos el uso de la regularizaciĂłn de Tikhonov como mĂ©todo de descomposiciĂłn del espectro en el contexto de baja latencia. Lo comparamos con las tĂ©cnicas existentes en tareas de estimaciĂłn y seguimiento de los tonos, que son pasos cruciales en muchos mĂ©todos de separaciĂłn. A continuaciĂłn utilizamos y evaluamos el mĂ©todo de descomposiciĂłn del espectro en tareas de separaciĂłn de voz cantada, bajo y percusiĂłn. En segundo lugar, proponemos varios mĂ©todos de alta latencia que mejoran la separaciĂłn de la voz cantada, gracias al modelado de componentes que a menudo no se toman en cuenta, como la respiraciĂłn y las consonantes. Finalmente, exploramos el uso de correlaciones temporales y anotaciones manuales para mejorar la separaciĂłn de los instrumentos de percusiĂłn y señales musicales polifĂłnicas complejas.This thesis proposes specific methods to address the limitations of current music source separation methods in low-latency and high-latency scenarios. First, we focus on methods with low computational cost and low latency. We propose the use of Tikhonov regularization as a method for spectrum decomposition in the low-latency context. We compare it to existing techniques in pitch estimation and tracking tasks, crucial steps in many separation methods. We then use the proposed spectrum decomposition method in low-latency separation tasks targeting singing voice, bass and drums. Second, we propose several high-latency methods that improve the separation of singing voice by modeling components that are often not accounted for, such as breathiness and consonants. Finally, we explore using temporal correlations and human annotations to enhance the separation of drums and complex polyphonic music signals

    Making music through real-time voice timbre analysis: machine learning and timbral control

    Get PDF
    PhDPeople can achieve rich musical expression through vocal sound { see for example human beatboxing, which achieves a wide timbral variety through a range of extended techniques. Yet the vocal modality is under-exploited as a controller for music systems. If we can analyse a vocal performance suitably in real time, then this information could be used to create voice-based interfaces with the potential for intuitive and ful lling levels of expressive control. Conversely, many modern techniques for music synthesis do not imply any particular interface. Should a given parameter be controlled via a MIDI keyboard, or a slider/fader, or a rotary dial? Automatic vocal analysis could provide a fruitful basis for expressive interfaces to such electronic musical instruments. The principal questions in applying vocal-based control are how to extract musically meaningful information from the voice signal in real time, and how to convert that information suitably into control data. In this thesis we address these questions, with a focus on timbral control, and in particular we develop approaches that can be used with a wide variety of musical instruments by applying machine learning techniques to automatically derive the mappings between expressive audio input and control output. The vocal audio signal is construed to include a broad range of expression, in particular encompassing the extended techniques used in human beatboxing. The central contribution of this work is the application of supervised and unsupervised machine learning techniques to automatically map vocal timbre to synthesiser timbre and controls. Component contributions include a delayed decision-making strategy for low-latency sound classi cation, a regression-tree method to learn associations between regions of two unlabelled datasets, a fast estimator of multidimensional di erential entropy and a qualitative method for evaluating musical interfaces based on discourse analysis

    Sparse and Nonnegative Factorizations For Music Understanding

    Get PDF
    In this dissertation, we propose methods for sparse and nonnegative factorization that are specifically suited for analyzing musical signals. First, we discuss two constraints that aid factorization of musical signals: harmonic and co-occurrence constraints. We propose a novel dictionary learning method that imposes harmonic constraints upon the atoms of the learned dictionary while allowing the dictionary size to grow appropriately during the learning procedure. When there is significant spectral-temporal overlap among the musical sources, our method outperforms popular existing matrix factorization methods as measured by the recall and precision of learned dictionary atoms. We also propose co-occurrence constraints -- three simple and convenient multiplicative update rules for nonnegative matrix factorization (NMF) that enforce dependence among atoms. Using examples in music transcription, we demonstrate the ability of these updates to represent each musical note with multiple atoms and cluster the atoms for source separation purposes. Second, we study how spectral and temporal information extracted by nonnegative factorizations can improve upon musical instrument recognition. Musical instrument recognition in melodic signals is difficult, especially for classification systems that rely entirely upon spectral information instead of temporal information. Here, we propose a simple and effective method of combining spectral and temporal information for instrument recognition. While existing classification methods use traditional features such as statistical moments, we extract novel features from spectral and temporal atoms generated by NMF using a biologically motivated multiresolution gamma filterbank. Unlike other methods that require thresholds, safeguards, and hierarchies, the proposed spectral-temporal method requires only simple filtering and a flat classifier. Finally, we study how to perform sparse factorization when a large dictionary of musical atoms is already known. Sparse coding methods such as matching pursuit (MP) have been applied to problems in music information retrieval such as transcription and source separation with moderate success. However, when the set of dictionary atoms is large, identification of the best match in the dictionary with the residual is slow -- linear in the size of the dictionary. Here, we propose a variant called approximate matching pursuit (AMP) that is faster than MP while maintaining scalability and accuracy. Unlike MP, AMP uses an approximate nearest-neighbor (ANN) algorithm to find the closest match in a dictionary in sublinear time. One such ANN algorithm, locality-sensitive hashing (LSH), is a probabilistic hash algorithm that places similar, yet not identical, observations into the same bin. While the accuracy of AMP is comparable to similar MP methods, the computational complexity is reduced. Also, by using LSH, this method scales easily; the dictionary can be expanded without reorganizing any data structures

    Computational Models of Auditory Scene Analysis: A Review

    Get PDF
    Auditory scene analysis (ASA) refers to the process(es) of parsing the complex acoustic input into auditory perceptual objects representing either physical sources or temporal sound patterns, such as melodies, which contributed to the sound waves reaching the ears. A number of new computational models accounting for some of the perceptual phenomena of ASA have been published recently. Here we provide a theoretically motivated review of these computational models, aiming to relate their guiding principles to the central issues of the theoretical framework of ASA. Specifically, we ask how they achieve the grouping and separation of sound elements and whether they implement some form of competition between alternative interpretations of the sound input. We consider the extent to which they include predictive processes, as important current theories suggest that perception is inherently predictive, and also how they have been evaluated. We conclude that current computational models of ASA are fragmentary in the sense that rather than providing general competing interpretations of ASA, they focus on assessing the utility of specific processes (or algorithms) for finding the causes of the complex acoustic signal. This leaves open the possibility for integrating complementary aspects of the models into a more comprehensive theory of ASA

    Computer Models for Musical Instrument Identification

    Get PDF
    PhDA particular aspect in the perception of sound is concerned with what is commonly termed as texture or timbre. From a perceptual perspective, timbre is what allows us to distinguish sounds that have similar pitch and loudness. Indeed most people are able to discern a piano tone from a violin tone or able to distinguish different voices or singers. This thesis deals with timbre modelling. Specifically, the formant theory of timbre is the main theme throughout. This theory states that acoustic musical instrument sounds can be characterised by their formant structures. Following this principle, the central point of our approach is to propose a computer implementation for building musical instrument identification and classification systems. Although the main thrust of this thesis is to propose a coherent and unified approach to the musical instrument identification problem, it is oriented towards the development of algorithms that can be used in Music Information Retrieval (MIR) frameworks. Drawing on research in speech processing, a complete supervised system taking into account both physical and perceptual aspects of timbre is described. The approach is composed of three distinct processing layers. Parametric models that allow us to represent signals through mid-level physical and perceptual representations are considered. Next, the use of the Line Spectrum Frequencies as spectral envelope and formant descriptors is emphasised. Finally, the use of generative and discriminative techniques for building instrument and database models is investigated. Our system is evaluated under realistic recording conditions using databases of isolated notes and melodic phrases
    • 

    corecore