8 research outputs found

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    A New Method for Tracking Modulations in Tonal Music in Audio Data Format

    No full text
    Cq-profiles are 12-dimensional vectors, each component referring to a pitch class. They can be employed to represent keys. Cq-profiles are calculated with the constant Q filter bank [4]. They have the following advantages: (i) They correspond to probe tone ratings. (ii) Calculation is possible in real-time. (iii) Stability is obtained with respect to sound quality. (iv) They are transposable. By using the cq-profile technique as a simple auditory model in combination with the SOM [11] an arrangement of keys emerges, that resembles results from psychological experiments [13], and from music theory [1]. Cq-profiles are reliably applied to modulation tracking by introducing a special distance measure

    Instantaneous Harmonic Analysis and its Applications in Automatic Music Transcription

    Get PDF
    This thesis presents a novel short-time frequency analysis algorithm, namely Instantaneous Harmonic Analysis (IHA), using a decomposition scheme based on sinusoidals. An estimate for instantaneous amplitude and phase elements of the constituent components of real-valued signals with respect to a set of reference frequencies is provided. In the context of musical audio analysis, the instantaneous amplitude is interpreted as presence of the pitch in time. The thesis examines the potential of improving the automated music analysis process by utilizing the proposed algorithm. For that reason, it targets the following two areas: Multiple Fundamental Frequency Estimation (MFFE), and note on-set/off-set detection. The IHA algorithm uses constant-Q filtering by employing Windowed Sinc Filters (WSFs) and a novel phasor construct. An implementation of WSFs in the continuous model is used. A new relation between the Constant-Q Transform (CQT) and WSFs is presented. It is demonstrated that CQT can alternatively be implemented by applying a series of logarithmically scaled WSFs while its window function is adjusted, accordingly. The relation between the window functions is provided as well. A comparison of the proposed IHA algorithm with WSFs and CQT demonstrates that the IHA phasor construct delivers better estimates for instantaneous amplitude and phase lags of the signal components. The thesis also extends the IHA algorithm by employing a generalized kernel function, which in nature, yields a non-orthonormal basis. The kernel function represents the timbral information and is used in the MFFE process. An effective algorithm is proposed to overcome the non-orthonormality issue of the decomposition scheme. To examine the performance improvement of the note on-set/off-set detection process, the proposed algorithm is used in the context of Automatic Music Transcription (AMT). A prototype of an audioto-MIDI system is developed and applied on synthetic and real music signals. The results of the experiments on real and synthetic music signals are reported. Additionally, a multi-dimensional generalization of the IHA algorithm is presented. The IHA phasor construct is extended into the hyper-complex space, in order to deliver the instantaneous amplitude and multiple phase elements for each dimension

    Exploiting prior knowledge during automatic key and chord estimation from musical audio

    Get PDF
    Chords and keys are two ways of describing music. They are exemplary of a general class of symbolic notations that musicians use to exchange information about a music piece. This information can range from simple tempo indications such as “allegro” to precise instructions for a performer of the music. Concretely, both keys and chords are timed labels that describe the harmony during certain time intervals, where harmony refers to the way music notes sound together. Chords describe the local harmony, whereas keys offer a more global overview and consequently cover a sequence of multiple chords. Common to all music notations is that certain characteristics of the music are described while others are ignored. The adopted level of detail depends on the purpose of the intended information exchange. A simple description such as “menuet”, for example, only serves to roughly describe the character of a music piece. Sheet music on the other hand contains precise information about the pitch, discretised information pertaining to timing and limited information about the timbre. Its goal is to permit a performer to recreate the music piece. Even so, the information about timing and timbre still leaves some space for interpretation by the performer. The opposite of a symbolic notation is a music recording. It stores the music in a way that allows for a perfect reproduction. The disadvantage of a music recording is that it does not allow to manipulate a single aspect of a music piece in isolation, or at least not without degrading the quality of the reproduction. For instance, it is not possible to change the instrumentation in a music recording, even though this would only require the simple change of a few symbols in a symbolic notation. Despite the fundamental differences between a music recording and a symbolic notation, the two are of course intertwined. Trained musicians can listen to a music recording (or live music) and write down a symbolic notation of the played piece. This skill allows one, in theory, to create a symbolic notation for each recording in a music collection. In practice however, this would be too labour intensive for the large collections that are available these days through online stores or streaming services. Automating the notation process is therefore a necessity, and this is exactly the subject of this thesis. More specifically, this thesis deals with the extraction of keys and chords from a music recording. A database with keys and chords opens up applications that are not possible with a database of music recordings alone. On one hand, chords can be used on their own as a compact representation of a music piece, for example to learn how to play an accompaniment for singing. On the other hand, keys and chords can also be used indirectly to accomplish another goal, such as finding similar pieces. Because music theory has been studied for centuries, a great body of knowledge about keys and chords is available. It is known that consecutive keys and chords form sequences that are all but random. People happen to have certain expectations that must be fulfilled in order to experience music as pleasant. Keys and chords are also strongly intertwined, as a given key implies that certain chords will likely occur and a set of given chords implies an encompassing key in return. Consequently, a substantial part of this thesis is concerned with the question whether musicological knowledge can be embedded in a technical framework in such a way that it helps to improve the automatic recognition of keys and chords. The technical framework adopted in this thesis is built around a hidden Markov model (HMM). This facilitates an easy separation of the different aspects involved in the automatic recognition of keys and chords. Most experiments reviewed in the thesis focus on taking into account musicological knowledge about the musical context and about the expected chord duration. Technically speaking, this involves a manipulation of the transition probabilities in the HMMs. To account for the interaction between keys and chords, every HMM state is actually representing the combination of a key and a chord label. In the first part of the thesis, a number of alternatives for modelling the context are proposed. In particular, separate key change and chord change models are defined such that they closely mirror the way musicians conceive harmony. Multiple variants are considered that differ in the size of the context that is accounted for and in the knowledge source from which they were compiled. Some models are derived from a music corpus with key and chord notations whereas others follow directly from music theory. In the second part of the thesis, the contextual models are embedded in a system for automatic key and chord estimation. The features used in that system are so-called chroma profiles, which represent the saliences of the pitch classes in the audio signal. These chroma profiles are acoustically modelled by means of templates (idealised profiles) and a distance measure. In addition to these acoustic models and the contextual models developed in the first part, durational models are also required. The latter ensure that the chord and key estimations attain specified mean durations. The resulting system is then used to conduct experiments that provide more insight into how each system component contributes to the ultimate key and chord output quality. During the experimental study, the system complexity gets gradually increased, starting from a system containing only an acoustic model of the features that gets subsequently extended, first with duration models and afterwards with contextual models. The experiments show that taking into account the mean key and mean chord duration is essential to arrive at acceptable results for both key and chord estimation. The effect of using contextual information, however, is highly variable. On one hand, the chord change model has only a limited positive impact on the chord estimation accuracy (two to three percentage points), but this impact is fairly stable across different model variants. On the other hand, the chord change model has a much larger potential to improve the key output quality (up to seventeen percentage points), but only on the condition that the variant of the model is well adapted to the tested music material. Lastly, the key change model has only a negligible influence on the system performance. In the final part of this thesis, a couple of extensions to the formerly presented system are proposed and assessed. First, the global mean chord duration is replaced by key-chord specific values, which has a positive effect on the key estimation performance. Next, the HMM system is modified such that the prior chord duration distribution is no longer a geometric distribution but one that better approximates the observed durations in an appropriate data set. This modification leads to a small improvement of the chord estimation performance, but of course, it requires the availability of a suitable data set with chord notations from which to retrieve a target durational distribution. A final experiment demonstrates that increasing the scope of the contextual model only leads to statistically insignificant improvements. On top of that, the required computational load increases greatly

    Analyse de structures répétitives dans les séquences musicales

    Get PDF
    Cette thèse rend compte de travaux portant sur l inférence de structures répétitives à partir du signal audio à l aide d algorithmes du texte. Son objectif principal est de proposer et d évaluer des algorithmes d inférence à partir d une étude formelle des notions de similarité et de répétition musicale.Nous présentons d abord une méthode permettant d obtenir une représentation séquentielle à partir du signal audio. Nous introduisons des outils d alignement permettant d estimer la similarité entre de telles séquences musicales, et évaluons l application de ces outils pour l identi cation automatique de reprises. Nous adaptons alors une technique d indexation de séquences biologiques permettant une estimation e cace de la similarité musicale au sein de bases de données conséquentes.Nous introduisons ensuite plusieurs répétitions musicales caractéristiques et employons les outils d alignement pour identi er ces répétitions. Une première structure, la répétition d un segment choisi, est analysée et évaluée dans le cadre dela reconstruction de données manquantes. Une deuxième structure, la répétition majeure, est dé nie, analysée et évaluée par rapport à un ensemble d annotations d experts, puis en tant qu alternative d indexation pour l identi cation de reprises.Nous présentons en n la problématique d inférence de structures répétitives telle qu elle est traitée dans la littérature, et proposons notre propre formalisation du problème. Nous exposons alors notre modélisation et proposons un algorithme permettant d identi er une hiérarchie de répétitions. Nous montrons la pertinence de notre méthode à travers plusieurs exemples et en l évaluant par rapport à l état de l art.The work presented in this thesis deals with repetitive structure inference from audio signal using string matching techniques. It aims at proposing and evaluating inference algorithms from a formal study of notions of similarity and repetition in music.We rst present a method for representing audio signals by symbolic strings. We introduce alignment tools enabling similarity estimation between such musical strings, and evaluate the application of these tools for automatic cover song identi cation. We further adapt a bioinformatics indexing technique to allow e cient assessments of music similarity in large-scale datasets. We then introduce several speci c repetitive structures and use alignment tools to analyse these repetitions. A rst structure, namely the repetition of a chosen segment, is retrieved and evaluated in the context of automatic assignment of missingaudio data. A second structure, namely the major repetition, is de ned, retrieved and evaluated regarding expert annotations, and as an alternative indexing method for cover song identi cation.We nally present the problem of repetitive structure inference as addressed in literature, and propose our own problem statement. We further describe our model and propose an algorithm enabling the identi cation of a hierarchical music structure. We emphasize the relevance of our method through several examples and by comparing it to the state of the art.BORDEAUX1-Bib.electronique (335229901) / SudocSudocFranceF
    corecore