418 research outputs found

    Comparison for Improvements of Singing Voice Detection System Based on Vocal Separation

    Full text link
    Singing voice detection is the task to identify the frames which contain the singer vocal or not. It has been one of the main components in music information retrieval (MIR), which can be applicable to melody extraction, artist recognition, and music discovery in popular music. Although there are several methods which have been proposed, a more robust and more complete system is desired to improve the detection performance. In this paper, our motivation is to provide an extensive comparison in different stages of singing voice detection. Based on the analysis a novel method was proposed to build a more efficiently singing voice detection system. In the proposed system, there are main three parts. The first is a pre-process of singing voice separation to extract the vocal without the music. The improvements of several singing voice separation methods were compared to decide the best one which is integrated to singing voice detection system. And the second is a deep neural network based classifier to identify the given frames. Different deep models for classification were also compared. The last one is a post-process to filter out the anomaly frame on the prediction result of the classifier. The median filter and Hidden Markov Model (HMM) based filter as the post process were compared. Through the step by step module extension, the different methods were compared and analyzed. Finally, classification performance on two public datasets indicates that the proposed approach which based on the Long-term Recurrent Convolutional Networks (LRCN) model is a promising alternative.Comment: 15 page

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    From heuristics-based to data-driven audio melody extraction

    Get PDF
    The identification of the melody from a music recording is a relatively easy task for humans, but very challenging for computational systems. This task is known as "audio melody extraction", more formally defined as the automatic estimation of the pitch sequence of the melody directly from the audio signal of a polyphonic music recording. This thesis investigates the benefits of exploiting knowledge automatically derived from data for audio melody extraction, by combining digital signal processing and machine learning methods. We extend the scope of melody extraction research by working with a varied dataset and multiple definitions of melody. We first present an overview of the state of the art, and perform an evaluation focused on a novel symphonic music dataset. We then propose melody extraction methods based on a source-filter model and pitch contour characterisation and evaluate them on a wide range of music genres. Finally, we explore novel timbre, tonal and spatial features for contour characterisation, and propose a method for estimating multiple melodic lines. The combination of supervised and unsupervised approaches leads to advancements on melody extraction and shows a promising path for future research and applications

    Oktoechos Classification and Generation of Liturgical Music using Deep Learning Frameworks

    Get PDF
    An important feature of the music repertoire of the Syrian tradition is the system of classifying melodies into eight tunes, called ’oktoechos’. In the oktoechos tradition, liturgical hymns are sung in eight modes or eight colours (known as eight ’niram’ in Indian tradition). In this paper, recurrent neural network (RNN) models are used for oktoechos genre classification with the help of musical texture features (MTF) and i-vectors. The performance of the proposed approaches is evaluated using a newly created corpus of liturgical music in the South Indian language, Malayalam. Long short-term memory (LSTM)-based and gated recurrent unit(GRU)-based experiments report the average classification accuracy of 83.76% and 77.77%, respectively, with a significant margin over the i-vector-DNN framework. The experiments demonstrate the potential of RNN models in learning temporal information through MTF in recognizing eight modes of oktoechos system. Furthermore, since the Greek liturgy and Gregorian chant also share similar musical traits with Syrian tradition, the musicological insights observed can potentially be applied to those traditions. Generation of oktoechos genre music style has also been discussed using an encoder-decoder framework. The quality of the generated files is evaluated using a perception test

    Proceedings of the 6th International Workshop on Folk Music Analysis, 15-17 June, 2016

    Get PDF
    The Folk Music Analysis Workshop brings together computational music analysis and ethnomusicology. Both symbolic and audio representations of music are considered, with a broad range of scientific approaches being applied (signal processing, graph theory, deep learning). The workshop features a range of interesting talks from international researchers in areas such as Indian classical music, Iranian singing, Ottoman-Turkish Makam music scores, Flamenco singing, Irish traditional music, Georgian traditional music and Dutch folk songs. Invited guest speakers were Anja Volk, Utrecht University and Peter Browne, Technological University Dublin

    Trennung und SchĂ€tzung der Anzahl von Audiosignalquellen mit Zeit- und FrequenzĂŒberlappung

    Get PDF
    Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe — for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") — highlighting the need for new representations and more powerful models. To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We ïŹrst address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model. Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, conïŹrming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task. In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of “modulation”, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to ïŹnd modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. FĂŒr diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine hĂ€uïŹge Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollstĂ€ndig ĂŒberlappen. In dieser Arbeit betrachten wir jedoch einige FĂ€lle, in denen die Überlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) —, so dass neue Signal-ReprĂ€sentationen und leistungsfĂ€higere Modelle notwendig sind. Um die zwei genannten Probleme zu bewĂ€ltigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunĂ€chst auf das Problem der Quellentrennung fĂŒr Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine SchĂ€tzung der Grundfrequenz als zusĂ€tzliche Information nutzt. FĂŒr FĂ€lle, in denen diese SchĂ€tzungen nicht verfĂŒgbar sind, stellen wir ein unĂŒberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitverĂ€nderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthĂ€lt eine neuartige ReprĂ€sentation, die die Separierbarkeit fĂŒr ĂŒberlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird. Im Weiteren beschĂ€ftigen wir uns mit der SchĂ€tzung der Anzahl von Quellen in einer Mischung, was fĂŒr reale Szenarien wichtig ist. Unsere Arbeit an der SchĂ€tzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzufĂŒhren, die bestĂ€tigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschĂ€tzen. Um nun die Frage zu beantworten, ob Maschinen dies Ă€hnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern. Sowohl bei der Quellentrennung als auch bei der SchĂ€tzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der “Modulation”, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adĂ€quate Darstellung, um die Überlappung von Signalen fĂŒr die Trennung zugĂ€nglich zu machen und eine Inspektion unseres DNN-ZĂ€hlmodells ergab schließlich, dass sich auch hier modulationsĂ€hnliche Merkmale ïŹnden lassen

    Models and analysis of vocal emissions for biomedical applications

    Get PDF
    This book of Proceedings collects the papers presented at the 3rd International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications, MAVEBA 2003, held 10-12 December 2003, Firenze, Italy. The workshop is organised every two years, and aims to stimulate contacts between specialists active in research and industrial developments, in the area of voice analysis for biomedical applications. The scope of the Workshop includes all aspects of voice modelling and analysis, ranging from fundamental research to all kinds of biomedical applications and related established and advanced technologies

    Controllable music performance synthesis via hierarchical modelling

    Full text link
    L’expression musicale requiert le contrĂŽle sur quelles notes sont jouĂ©es ainsi que comment elles se jouent. Les synthĂ©tiseurs audios conventionnels offrent des contrĂŽles expressifs dĂ©taillĂ©s, cependant au dĂ©triment du rĂ©alisme. La synthĂšse neuronale en boĂźte noire des audios et les Ă©chantillonneurs concatĂ©natifs sont capables de produire un son rĂ©aliste, pourtant, nous avons peu de mĂ©canismes de contrĂŽle. Dans ce travail, nous introduisons MIDI-DDSP, un modĂšle hiĂ©rarchique des instruments musicaux qui permet tant la synthĂšse neuronale rĂ©aliste des audios que le contrĂŽle sophistiquĂ© de la part des utilisateurs. À partir des paramĂštres interprĂ©tables de synthĂšse provenant du traitement diffĂ©rentiable des signaux numĂ©riques (Differentiable Digital Signal Processing, DDSP), nous infĂ©rons les notes musicales et la propriĂ©tĂ© de haut niveau de leur performance expressive (telles que le timbre, le vibrato, l’intensitĂ© et l’articulation). Ceci donne naissance Ă  une hiĂ©rarchie de trois niveaux (notes, performance, synthĂšse) qui laisse aux individus la possibilitĂ© d’intervenir Ă  chaque niveau, ou d’utiliser la distribution prĂ©alable entraĂźnĂ©e (notes Ă©tant donnĂ© performance, synthĂšse Ă©tant donnĂ© performance) pour une assistance crĂ©ative. À l’aide des expĂ©riences quantitatives et des tests d’écoute, nous dĂ©montrons que cette hiĂ©rarchie permet de reconstruire des audios de haute fidĂ©litĂ©, de prĂ©dire avec prĂ©cision les attributs de performance d’une sĂ©quence de notes, mais aussi de manipuler indĂ©pendamment les attributs Ă©tant donnĂ© la performance. Comme il s’agit d’un systĂšme complet, la hiĂ©rarchie peut aussi gĂ©nĂ©rer des audios rĂ©alistes Ă  partir d’une nouvelle sĂ©quence de notes. En utilisant une hiĂ©rarchie interprĂ©table avec de multiples niveaux de granularitĂ©, MIDI-DDSP ouvre la porte aux outils auxiliaires qui renforce la capacitĂ© des individus Ă  travers une grande variĂ©tĂ© d’expĂ©rience musicale.Musical expression requires control of both what notes are played, and how they are performed. Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism. Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control. In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control. Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation). This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence. By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience

    Computational analysis of world music corpora

    Get PDF
    PhDThe comparison of world music cultures has been considered in musicological research since the end of the 19th century. Traditional methods from the field of comparative musicology typically involve the process of manual music annotation. While this provides expert knowledge, the manual input is timeconsuming and limits the potential for large-scale research. This thesis considers computational methods for the analysis and comparison of world music cultures. In particular, Music Information Retrieval (MIR) tools are developed for processing sound recordings, and data mining methods are considered to study similarity relationships in world music corpora. MIR tools have been widely used for the study of (mainly) Western music. The first part of this thesis focuses on assessing the suitability of audio descriptors for the study of similarity in world music corpora. An evaluation strategy is designed to capture challenges in the automatic processing of world music recordings and different state-of-the-art descriptors are assessed. Following this evaluation, three approaches to audio feature extraction are considered, each addressing a different research question. First, a study of singing style similarity is presented. Singing is one of the most common forms of musical expression and it has played an important role in the oral transmission of world music. Hand-designed pitch descriptors are used to model aspects of the singing voice and clustering methods reveal singing style similarities in world music. Second, a study on music dissimilarity is performed. While musical exchange is evident in the history of world music it might be possible that some music cultures have resisted external musical influence. Low-level audio features are combined with machine learning methods to find music examples that stand out in a world music corpus, and geographical patterns are examined. The last study models music similarity using descriptors learned automatically with deep neural networks. It focuses on identifying music examples that appear to be similar in their audio content but share no (obvious) geographical or cultural links in their metadata. Unexpected similarities modelled in this way uncover possible hidden links between world music cultures. This research investigates whether automatic computational analysis can uncover meaningful similarities between recordings of world music. Applications derive musicological insights from one of the largest world music corpora studied so far. Computational analysis as proposed in this thesis advances the state-of-the-art in the study of world music and expands the knowledge and understanding of musical exchange in the world.Queen Mary Principal’s research studentship
    • 

    corecore