22 research outputs found

    A Cross-Cultural Analysis of Music Structure

    Get PDF
    PhDMusic signal analysis is a research field concerning the extraction of meaningful information from musical audio signals. This thesis analyses the music signals from the note-level to the song-level in a bottom-up manner and situates the research in two Music information retrieval (MIR) problems: audio onset detection (AOD) and music structural segmentation (MSS). Most MIR tools are developed for and evaluated on Western music with specific musical knowledge encoded. This thesis approaches the investigated tasks from a cross-cultural perspective by developing audio features and algorithms applicable for both Western and non-Western genres. Two Chinese Jingju databases are collected to facilitate respectively the AOD and MSS tasks investigated. New features and algorithms for AOD are presented relying on fusion techniques. We show that fusion can significantly improve the performance of the constituent baseline AOD algorithms. A large-scale parameter analysis is carried out to identify the relations between system configurations and the musical properties of different music types. Novel audio features are developed to summarise music timbre, harmony and rhythm for its structural description. The new features serve as effective alternatives to commonly used ones, showing comparable performance on existing datasets, and surpass them on the Jingju dataset. A new segmentation algorithm is presented which effectively captures the structural characteristics of Jingju. By evaluating the presented audio features and different segmentation algorithms incorporating different structural principles for the investigated music types, this thesis also identifies the underlying relations between audio features, segmentation methods and music genres in the scenario of music structural analysis.China Scholarship Council EPSRC C4DM Travel Funding, EPSRC Fusing Semantic and Audio Technologies for Intelligent Music Production and Consumption (EP/L019981/1), EPSRC Platform Grant on Digital Music (EP/K009559/1), European Research Council project CompMusic, International Society for Music Information Retrieval Student Grant, QMUL Postgraduate Research Fund, QMUL-BUPT Joint Programme Funding Women in Music Information Retrieval Grant

    Constrained structure of ancient Chinese poetry facilitates speech content grouping

    No full text
    Ancient Chinese poetry is constituted by structured language that deviates from ordinary language usage [1, 2]; its poetic genres impose unique combinatory constraints on linguistic elements [3]. How does the constrained poetic structure facilitate speech segmentation when common linguistic [4, 5, 6, 7, 8] and statistical cues [5, 9] are unreliable to listeners in poems? We generated artificial Jueju, which arguably has the most constrained structure in ancient Chinese poetry, and presented each poem twice as an isochronous sequence of syllables to native Mandarin speakers while conducting magnetoencephalography (MEG) recording. We found that listeners deployed their prior knowledge of Jueju to build the line structure and to establish the conceptual flow of Jueju. Unprecedentedly, we found a phase precession phenomenon indicating predictive processes of speech segmentation—the neural phase advanced faster after listeners acquired knowledge of incoming speech. The statistical co-occurrence of monosyllabic words in Jueju negatively correlated with speech segmentation, which provides an alternative perspective on how statistical cues facilitate speech segmentation. Our findings suggest that constrained poetic structures serve as a temporal map for listeners to group speech contents and to predict incoming speech signals. Listeners can parse speech streams by using not only grammatical and statistical cues but also their prior knowledge of the form of language

    Algorithms and representations for supporting online music creation with large-scale audio databases

    Get PDF
    The rapid adoption of Internet and web technologies has created an opportunity for making music collaboratively by sharing information online. However, current applications for online music making do not take advantage of the potential of shared information. The goal of this dissertation is to provide and evaluate algorithms and representations for interacting with large audio databases that facilitate music creation by online communities. This work has been developed in the context of Freesound, a large-scale, community-driven database of audio recordings shared under Creative Commons (CC) licenses. The diversity of sounds available through this kind of platform is unprecedented. At the same time, the unstructured nature of community-driven processes poses new challenges for indexing and retrieving information to support musical creativity. In this dissertation we propose and evaluate algorithms and representations for dealing with the main elements required by online music making applications based on large-scale audio databases: sound files, including time-varying and aggregate representations, taxonomies for retrieving sounds, music representations and community models. As a generic low-level representation for audio signals, we analyze the framework of cepstral coefficients, evaluating their performance with example classification tasks. We found that switching to more recent auditory filter such as gammatone filters improves, at large scales, on traditional representations based on the mel scale. We then consider common types of sounds for obtaining aggregated representations. We show that several time series analysis features computed from the cepstral coefficients complement traditional statistics for improved performance. For interacting with large databases of sounds, we propose a novel unsupervised algorithm that automatically generates taxonomical organizations based on the low-level signal representations. Based on user studies, we show that our approach can be used in place of traditional supervised classification approaches for providing a lexicon of acoustic categories suitable for creative applications. Next, a computational representation is described for music based on audio samples. We demonstrate through a user experiment that it facilitates collaborative creation and supports computational analysis using the lexicons generated by sound taxonomies. Finally, we deal with representation and analysis of user communities. We propose a method for measuring collective creativity in audio sharing. By analyzing the activity of the Freesound community over a period of more than 5 years, we show that the proposed creativity measures can be significantly related to social structure characterized by network analysis.La ràpida adopció dInternet i de les tecnologies web ha creat una oportunitat per fer música col•laborativa mitjançant l'intercanvi d'informació en línia. No obstant això, les aplicacions actuals per fer música en línia no aprofiten el potencial de la informació compartida. L'objectiu d'aquesta tesi és proporcionar i avaluar algorismes i representacions per a interactuar amb grans bases de dades d'àudio que facilitin la creació de música per part de comunitats virtuals. Aquest treball ha estat desenvolupat en el context de Freesound, una base de dades d'enregistraments sonors compartits sota llicència Creative Commons (CC) a gran escala, impulsada per la comunitat d'usuaris. La diversitat de sons disponibles a través d'aquest tipus de plataforma no té precedents. Alhora, la naturalesa desestructurada dels processos impulsats per comunitats planteja nous reptes per a la indexació i recuperació d'informació que dona suport a la creativitat musical. En aquesta tesi proposem i avaluem algorismes i representacions per tractar amb els principals elements requerits per les aplicacions de creació musical en línia basades en bases de dades d'àudio a gran escala: els arxius de so, incloent representacions temporals i agregades, taxonomies per a cercar sons, representacions musicals i models de comunitat. Com a representació de baix nivell genèrica per a senyals d'àudio, s'analitza el marc dels coeficients cepstrum, avaluant el seu rendiment en tasques de classificació d'exemple. Hem trobat que el canvi a un filtre auditiu més recent com els filtres de gammatons millora, a gran escala, respecte de les representacions tradicionals basades en l'escala mel. Després considerem tres tipus comuns de sons per a l'obtenció de representacions agregades. Es demostra que diverses funcions d'anàlisi de sèries temporals calculades a partir dels coeficients cepstrum complementen les estadístiques tradicionals per a un millor rendiment. Per interactuar amb grans bases de dades de sons, es proposa un nou algorisme no supervisat que genera automàticament organitzacions taxonòmiques basades en les representacions de senyal de baix nivell. Em base a estudis amb usuaris, mostrem que el sistema proposat es pot utilitzar en lloc dels sistemes tradicionals de classificació supervisada per proporcionar un lèxic de categories acústiques adequades per a aplicacions creatives. A continuació, es descriu una representació computacional per a música creada a partir de mostres d'àudio. Demostrem a través d'un experiment amb usuaris que facilita la creació col•laborativa i dóna suport l'anàlisi computacional usant els lèxics generats per les taxonomies de so. Finalment, ens centrem en la representació i anàlisi de comunitats d'usuaris. Proposem un mètode per mesurar la creativitat col•lectiva en l'intercanvi d'àudio. Mitjançant l'anàlisi de l'activitat de la comunitat Freesound durant un període de més de 5 anys, es mostra que les mesures proposades de creativitat es poden relacionar significativament amb l'estructura social descrita mitjançant l'anàlisi de xarxes.La rápida adopción de Internet y de las tecnologías web ha creado una oportunidad para hacer música colaborativa mediante el intercambio de información en línea. Sin embargo, las aplicaciones actuales para hacer música en línea no aprovechan el potencial de la información compartida. El objetivo de esta tesis es proporcionar y evaluar algoritmos y representaciones para interactuar con grandes bases de datos de audio que faciliten la creación de música por parte de comunidades virtuales. Este trabajo ha sido desarrollado en el contexto de Freesound, una base de datos de grabaciones sonoras compartidos bajo licencia Creative Commons (CC) a gran escala, impulsada por la comunidad de usuarios. La diversidad de sonidos disponibles a través de este tipo de plataforma no tiene precedentes. Al mismo tiempo, la naturaleza desestructurada de los procesos impulsados por comunidades plantea nuevos retos para la indexación y recuperación de información en apoyo de la creatividad musical. En esta tesis proponemos y evaluamos algoritmos y representaciones para tratar con los principales elementos requeridos por las aplicaciones de creación musical en línea basadas en bases de datos de audio a gran escala: archivos de sonido, incluyendo representaciones temporales y agregadas, taxonomías para buscar sonidos, representaciones musicales y modelos de comunidad. Como representación de bajo nivel genérica para señales de audio, se analiza el marco de los coeficientes cepstrum, evaluando su rendimiento en tareas de clasificación. Encontramos que el cambio a un filtro auditivo más reciente como los filtros de gammatonos mejora, a gran escala, respecto de las representaciones tradicionales basadas en la escala mel. Después consideramos tres tipos comunes de sonidos para la obtención de representaciones agregadas. Se demuestra que varias funciones de análisis de series temporales calculadas a partir de los coeficientes cepstrum complementan las estadísticas tradicionales para un mejor rendimiento. Para interactuar con grandes bases de datos de sonidos, se propone un nuevo algoritmo no supervisado que genera automáticamente organizaciones taxonómicas basadas en las representaciones de señal de bajo nivel. En base a estudios con usuarios, mostramos que nuestro enfoque se puede utilizar en lugar de los sistemas tradicionales de clasificación supervisada para proporcionar un léxico de categorías acústicas adecuadas para aplicaciones creativas. A continuación, se describe una representación computacional para música creada a partir de muestras de audio. Demostramos, a través de un experimento con usuarios, que facilita la creación colaborativa y posibilita el análisis computacional usando los léxicos generados por las taxonomías de sonido. Finalmente, nos centramos en la representación y análisis de comunidades de usuarios. Proponemos un método para medir la creatividad colectiva en el intercambio de audio. Mediante un análisis de la actividad de la comunidad Freesound durante un periodo de más de 5 años, se muestra que las medidas propuestas de creatividad se pueden relacionar significativamente con la estructura social descrita mediante análisis de redes

    Music-listening systems

    Get PDF
    Thesis (Ph.D.)--Massachusetts Institute of Technology, Dept. of Architecture, 2000.Includes bibliographical references (p. [235]-248).When human listeners are confronted with musical sounds, they rapidly and automatically orient themselves in the music. Even musically untrained listeners have an exceptional ability to make rapid judgments about music from very short examples, such as determining the music's style, performer, beat, complexity, and emotional impact. However, there are presently no theories of music perception that can explain this behavior, and it has proven very difficult to build computer music-analysis tools with similar capabilities. This dissertation examines the psychoacoustic origins of the early stages of music listening in humans, using both experimental and computer-modeling approaches. The results of this research enable the construction of automatic machine-listening systems that can make human-like judgments about short musical stimuli. New models are presented that explain the perception of musical tempo, the perceived segmentation of sound scenes into multiple auditory images, and the extraction of musical features from complex musical sounds. These models are implemented as signal-processing and pattern-recognition computer programs, using the principle of understanding without separation. Two experiments with human listeners study the rapid assignment of high-level judgments to musical stimuli, and it is demonstrated that many of the experimental results can be explained with a multiple-regression model on the extracted musical features. From a theoretical standpoint, the thesis shows how theories of music perception can be grounded in a principled way upon psychoacoustic models in a computational-auditory-scene-analysis framework. Further, the perceptual theory presented is more relevant to everyday listeners and situations than are previous cognitive-structuralist approaches to music perception and cognition. From a practical standpoint, the various models form a set of computer signal-processing and pattern-recognition tools that can mimic human perceptual abilities on a variety of musical tasks such as tapping along with the beat, parsing music into sections, making semantic judgments about musical examples, and estimating the similarity of two pieces of music.Eric D. Scheirer.Ph.D

    Feature extraction of musical content for automatic music transcription

    Get PDF
    The purpose of this thesis is to develop new methods for automatic transcription of melody and harmonic parts of real-life music signal. Music transcription is here defined as an act of analyzing a piece of music signal and writing down the parameter representations, which indicate the pitch, onset time and duration of each pitch, loudness and instrument applied in the analyzed music signal. The proposed algorithms and methods aim at resolving two key sub-problems in automatic music transcription: music onset detection and polyphonic pitch estimation. There are three original contributions in this thesis. The first is an original frequency-dependent time-frequency analysis tool called the Resonator Time-Frequency Image (RTFI). By simply defining a parameterized function mapping frequency to the exponent decay factor of the complex resonator filter bank, the RTFI can easily and flexibly implement the time-frequency analysis with different time-frequency resolutions such as ear-like (similar to human ear frequency analyzer), constant-Q or uniform (evenly-spaced) time-frequency resolutions. The corresponding multi-resolution fast implementation of RTFI has also been developed. The second original contribution consists of two new music onset detection algorithms: Energy-based detection algorithm and Pitch-based detection algorithm. The Energy-based detection algorithm performs well on the detection of hard onsets. The Pitch-based detection algorithm is the first one, which successfully exploits the pitch change clue for the onset detection in real polyphonic music, and achieves a much better performance than the other existing detection algorithms for the detection of soft onsets. The third contribution is the development of two new polyphonic pitch estimation methods. They are based on the RTFI analysis. The first proposed estimation method mainly makes best of the harmonic relation and spectral smoothing principle, consequently achieves an excellent performance on the real polyphonic music signals. The second proposed polyphonic pitch estimation method is based on the combination of signal processing and machine learning. The basic idea behind this method is to transform the polyphonic pitch estimation as a pattern recognition problem. The proposed estimation method is mainly composed by a signal processing block followed by a learning machine. Multi-resolution fast RTFI analysis is used as a signal processing component, and support vector machine (SVM) is selected as learning machine. The experimental result of the first approach show clear improvement versus the other state of the art methods

    Deep Learning for Audio Effects Modeling

    Get PDF
    PhD Thesis.Audio effects modeling is the process of emulating an audio effect unit and seeks to recreate the sound, behaviour and main perceptual features of an analog reference device. Audio effect units are analog or digital signal processing systems that transform certain characteristics of the sound source. These transformations can be linear or nonlinear, time-invariant or time-varying and with short-term and long-term memory. Most typical audio effect transformations are based on dynamics, such as compression; tone such as distortion; frequency such as equalization; and time such as artificial reverberation or modulation based audio effects. The digital simulation of these audio processors is normally done by designing mathematical models of these systems. This is often difficult because it seeks to accurately model all components within the effect unit, which usually contains mechanical elements together with nonlinear and time-varying analog electronics. Most existing methods for audio effects modeling are either simplified or optimized to a very specific circuit or type of audio effect and cannot be efficiently translated to other types of audio effects. This thesis aims to explore deep learning architectures for music signal processing in the context of audio effects modeling. We investigate deep neural networks as black-box modeling strategies to solve this task, i.e. by using only input-output measurements. We propose different DSP-informed deep learning models to emulate each type of audio effect transformations. Through objective perceptual-based metrics and subjective listening tests we explore the performance of these models when modeling various analog audio effects. Also, we analyze how the given tasks are accomplished and what the models are actually learning. We show virtual analog models of nonlinear effects, such as a tube preamplifier; nonlinear effects with memory, such as a transistor-based limiter; and electromechanical nonlinear time-varying effects, such as a Leslie speaker cabinet and plate and spring reverberators. We report that the proposed deep learning architectures represent an improvement of the state-of-the-art in black-box modeling of audio effects and the respective directions of future work are given

    Flamenco music information retrieval.

    Get PDF
    El flamenco, un género musical centrado en la improvisación y la espontaneidad, tiene su origen en el sur de España y atrae a una creciente comunidad de aficionados de países de todo el mundo. El aumento constante y la accesibilidad a colecciones digitales de flamenco, en archivos de música y plataformas online, exige el desarrollo de métodos de análisis y descripción computacionales con el fin de indexar y analizar el contenido musical de manera automática. Music Information Retrieval (MIR) es un área de investigación multidisciplinaria dedicada a la extracción automática de información musical desde grabaciones de audio y partituras. Sin embargo, la gran mayoría de las herramientas existentes se dirigen a la música clásica y la música popular occidental y, a menudo, no se generalizan bien a las tradiciones musicales no occidentales, particularmente cuando las suposiciones relacionadas con la teoría musical no son válidas para estos géneros. Por otro lado, las características y los conceptos musicales específicos de una tradición musical pueden implicar nuevos desafíos computacionales, para los cuales no existen métodos adecuados. Esta tesis enfoca estas limitaciones existentes en el área abordando varios desafíos computacionales que surgen en el contexto de la música flamenca. Con este fin, se realizan una serie de contribuciones en forma de algoritmos novedosos, evaluaciones comparativas y estudios basados en datos, dirigidos a varias dimensiones musicales y que abarcan varias subáreas de ingeniería, matemática computacional, estadística, optimización y musicología computacional. Una particularidad del género, que influye enormemente en el trabajo presentado en esta tesis, es la ausencia de partituras para el cante flamenco. En consecuencia, los métodos computacionales deben basarse únicamente en el análisis de grabaciones, o de transcripciones extraídas automáticamente, lo que genera una colección de nuevos problemas computacionales. Un aspecto clave del flamenco es la presencia de patrones melódicos recurrentes, que esán sujetos a variación y ornamentación durante su interpretación. Desde la perspectiva computacional, identificamos tres tareas relacionadas a esta característica que se abordan en esta tesis: la clasificación por melodía, la búsqueda de secuencias melódicas y la extracción de patrones melódicos. Además, nos acercamos a la tarea de la detección no supervisada de frases melódicas repetidas y exploramos el uso de métodos de deep learning para la identificación de cantaores en grabaciones de video y la segmentación estructural de grabaciones de audio. Finalmente, demostramos en un estudio de minería de datos, cómo una exploración de anotaciones extraídas de manera automática de un corpus amplio de grabaciones nos ayuda a descubrir correlaciones interesantes y asimilar conocimientos sobre este género mayormente indocumentado.Flamenco is a rich performance-oriented art music genre from Southern Spain, which attracts a growing community of aficionados around the globe. The constantly increasing number of digitally available flamenco recordings in music archives, video sharing platforms and online music services calls for the development of genre-specific description and analysis methods, capable of automatically indexing and examining these collections in a content-driven manner. Music Information Retrieval is a multi-disciplinary research area dedicated to the automatic extraction of musical information from audio recordings and scores. Most existing approaches were however developed in the context of popular or classical music and do often not generalise well to non-Western music traditions, in particular when the underlying music theoretical assumptions do not hold for these genres. The specific characteristics and concepts of a music tradition can furthermore imply newcomputational challenges, for which no suitable methods exist. This thesis addresses these current shortcomings of Music Information Retrieval by tackling several computational challenge which arise in the context of flamenco music. To this end, a number of contributions to the field are made in form of novel algorithms, comparative evaluations and data-driven studies, directed at various musical dimensions and encompassing several sub-areas of computer science, computational mathematics, statistics, optimisation and computational musicology. A particularity of flamenco, which immensely shapes the work presented in this thesis, is the absence of written scores. Consequently, computational approaches can solely rely on the direct analysis of raw audio recordings or automatically extracted transcriptions, and this restriction generates set of new computational challenges. A key aspect of flamenco is the presence of reoccurring melodic templates, which are subject to heavy variation during performance. From a computational perspective, we identify three tasks related to this characteristic - melody classification, melody retrieval and melodic template extraction - which are addressed in this thesis. We furthermore approach the task of detecting repeated sung phrases in an unsupervised manner and explore the use of deep learning methods for image-based singer identification in flamenco videos and structural segmentation of flamenco recordings. Finally, we demonstrate in a data-driven corpus study, how automatic annotations can be mined to discover interesting correlations and gain insights into a largely undocumented genre

    DNN-Assisted Speech Enhancement Approaches Incorporating Phase Information

    Get PDF
    Speech enhancement is a widely adopted technique that removes the interferences in a corrupted speech to improve the speech quality and intelligibility. Speech enhancement methods can be implemented in either time domain or time-frequency (T-F) domain. Among various proposed methods, the time-frequency domain methods, which synthesize the enhanced speech with the estimated magnitude spectrogram and the noisy phase spectrogram, gain the most popularity in the past few decades. However, this kind of techniques tend to ignore the importance of phase processing. To overcome this problem, the thesis aims to jointly enhance the magnitude and phase spectra by means of the most recent deep neural networks (DNNs). More specifically, three major contributions are presented in this thesis. First, we present new schemes based on the basic Kalman filter (KF) to remove the background noise in the noisy speech in time domain, where the KF acts as joint estimator for both the magnitude and phase spectra of speech. A DNN-augmented basic KF is first proposed, where DNN is applied for estimating key parameters in the KF, namely the linear prediction coefficients (LPCs). By training the DNN with a large database and making use of the powerful learning ability of DNN, the proposed algorithm is able to estimate LPCs from noisy speech more accurately and robustly, leading to an improved performance as compared to traditional KF based approaches in speech enhancement. We further present a high-frequency (HF) component restoration algorithm to extenuate the degradation in the HF regions of the Kalman-filtered speech, in which the DNN-based bandwidth extension is applied to estimate the magnitude of HF component from the low-frequency (LF) counterpart. By incorporating the restoration algorithm, the enhanced speech suffers less distortion in the HF component. Moreover, we propose a hybrid speech enhancement system that exploits DNN for speech reconstruction and Kalman filtering for further denoising. Two separate networks are adopted in the estimation of magnitude spectrogram and LPCs of the clean speech, respectively. The estimated clean magnitude spectrogram is combined with the phase of the noisy speech to reconstruct the estimated clean speech. A KF with the estimated parameters is then utilized to remove the residual noise in the reconstructed speech. The proposed hybrid system takes advantages of both the DNN-based reconstruction and traditional Kalman filtering, and can work reliably in either matched or unmatched acoustic environments. Next, we incorporate the DNN-based parameter estimation scheme in two advanced KFs: subband KF and colored-noise KF. The DNN-augmented subband KF method decomposes the noisy speech into several subbands, and performs Kalman filtering to each subband speech, where the parameters of the KF are estimated by the trained DNN. The final enhanced speech is then obtained by synthesizing the enhanced subband speeches. In the DNN-augmented colored-noise KF system, both clean speech and noise are modelled as autoregressive (AR) processes, whose parameters comprise the LPCs and the driving noise variances. The LPCs are obtained through training a multi-objective DNN, while the driving noise variances are obtained by solving an optimization problem aiming to minimize the difference between the modelled and observed AR spectra of the noisy speech. The colored-noise Kalman filter with DNN-estimated parameters is then applied to the noisy speech for denoising. A post-subtraction technique is adopted to further remove the residual noise in the Kalman-filtered speech. Extensive computer simulations show that the two proposed advanced KF systems achieve significant performance gains when compared to conventional Kalman filter based algorithms as well as recent DNN-based methods under both seen and unseen noise conditions. Finally, we focus on the T-F domain speech enhancement with masking technique, which aims to retain the speech dominant components and suppress the noise dominant parts of the noisy speech. We first derive a new type of mask, namely constrained ratio mask (CRM), to better control the trade-off between speech distortion and residual noise in the enhanced speech. The CRM is estimated with a trained DNN based on the input noisy feature set and is applied to the noisy magnitude spectrogram for denoising. We further extend the CRM to the complex spectrogram estimation, where the enhanced magnitude spectrogram is obtained with the CRM, while the estimated phase spectrogram is reconstructed with the noisy phase spectrogram and the phase derivatives. Performance evaluation reveals our proposed CRM outperforms several traditional masks in terms of objective metrics. Moreover, the enhanced speech resulting from the CRM based complex spectrogram estimation has a better speech quality than that obtained without using phase reconstruction
    corecore