351 research outputs found

    Automatic transcription of polyphonic music exploiting temporal evolution

    Get PDF
    PhDAutomatic music transcription is the process of converting an audio recording into a symbolic representation using musical notation. It has numerous applications in music information retrieval, computational musicology, and the creation of interactive systems. Even for expert musicians, transcribing polyphonic pieces of music is not a trivial task, and while the problem of automatic pitch estimation for monophonic signals is considered to be solved, the creation of an automated system able to transcribe polyphonic music without setting restrictions on the degree of polyphony and the instrument type still remains open. In this thesis, research on automatic transcription is performed by explicitly incorporating information on the temporal evolution of sounds. First efforts address the problem by focusing on signal processing techniques and by proposing audio features utilising temporal characteristics. Techniques for note onset and offset detection are also utilised for improving transcription performance. Subsequent approaches propose transcription models based on shift-invariant probabilistic latent component analysis (SI-PLCA), modeling the temporal evolution of notes in a multiple-instrument case and supporting frequency modulations in produced notes. Datasets and annotations for transcription research have also been created during this work. Proposed systems have been privately as well as publicly evaluated within the Music Information Retrieval Evaluation eXchange (MIREX) framework. Proposed systems have been shown to outperform several state-of-the-art transcription approaches. Developed techniques have also been employed for other tasks related to music technology, such as for key modulation detection, temperament estimation, and automatic piano tutoring. Finally, proposed music transcription models have also been utilized in a wider context, namely for modeling acoustic scenes

    Towards the automated analysis of simple polyphonic music : a knowledge-based approach

    Get PDF
    PhDMusic understanding is a process closely related to the knowledge and experience of the listener. The amount of knowledge required is relative to the complexity of the task in hand. This dissertation is concerned with the problem of automatically decomposing musical signals into a score-like representation. It proposes that, as with humans, an automatic system requires knowledge about the signal and its expected behaviour to correctly analyse music. The proposed system uses the blackboard architecture to combine the use of knowledge with data provided by the bottom-up processing of the signal's information. Methods are proposed for the estimation of pitches, onset times and durations of notes in simple polyphonic music. A method for onset detection is presented. It provides an alternative to conventional energy-based algorithms by using phase information. Statistical analysis is used to create a detection function that evaluates the expected behaviour of the signal regarding onsets. Two methods for multi-pitch estimation are introduced. The first concentrates on the grouping of harmonic information in the frequency-domain. Its performance and limitations emphasise the case for the use of high-level knowledge. This knowledge, in the form of the individual waveforms of a single instrument, is used in the second proposed approach. The method is based on a time-domain linear additive model and it presents an alternative to common frequency-domain approaches. Results are presented and discussed for all methods, showing that, if reliably generated, the use of knowledge can significantly improve the quality of the analysis.Joint Information Systems Committee (JISC) in the UK National Science Foundation (N.S.F.) in the United states. Fundacion Gran Mariscal Ayacucho in Venezuela

    Separation of musical sources and structure from single-channel polyphonic recordings

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Automated page turner for musicians

    Get PDF
    An increasing number of musicians are opting to use tablet devices instead of traditional print media for their music sheets since the digital medium offers the benefit of storing a lot of music in a compact space. The limited screen size of the tablet devices makes the music difficult to read and musicians often opt to display part of the music page at a time. With fewer music lines on display, the musician will then have to resort to scrolling through the music to read the entire score. This scrolling is annoying since the musicians will need to remove their hands from the instrument to interact with the tablet, causing a break in the music if this is not done quickly enough, or if the tablet is not sufficiently responsive. In this paper, we describe an alternative page turning system which automates the page turning event of the musician. By actively monitoring the musician's on-screen point of regard, the system retains the musician in the loop and thus, the page turns are attuned to the musician's position on the score. By analysing the way the musician's gaze changes between attention to the score and the instrument as well as the way musicians fixate on different parts of the score, we note that musicians often look away from the score and toward their hands, or elsewhere, when playing the instrument. As a result, the eye regions fall outside the field-of-view of the eye-gaze tracker, giving rise to erratic page-turns. To counteract this problem, we create a gaze prediction model that uses Kalman filtering to predict where the musician would be looking on the score. We evaluate our hands-free page turning system using 15 different piano songs containing different levels of difficulty, various repeats, and which also required playing in different registers on the piano, thus, evaluating the applicability of the page-turner under different conditions. Performance of the page-turner was quantified through the number of correct page turns, the number of delayed page turns, and the number of mistaken page turns. Of the 289 page turns involved in the experiment, 98.3% were successfully executed, 1.7% were delayed, while no mistaken page turns were observed.peer-reviewe

    On the automatic segmentation of transcribed words

    Get PDF

    Proceedings of the 7th Sound and Music Computing Conference

    Get PDF
    Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010

    Real-time Sound Source Separation For Music Applications

    Get PDF
    Sound source separation refers to the task of extracting individual sound sources from some number of mixtures of those sound sources. In this thesis, a novel sound source separation algorithm for musical applications is presented. It leverages the fact that the vast majority of commercially recorded music since the 1950s has been mixed down for two channel reproduction, more commonly known as stereo. The algorithm presented in Chapter 3 in this thesis requires no prior knowledge or learning and performs the task of separation based purely on azimuth discrimination within the stereo field. The algorithm exploits the use of the pan pot as a means to achieve image localisation within stereophonic recordings. As such, only an interaural intensity difference exists between left and right channels for a single source. We use gain scaling and phase cancellation techniques to expose frequency dependent nulls across the azimuth domain, from which source separation and resynthesis is carried out. The algorithm is demonstrated to be state of the art in the field of sound source separation but also to be a useful pre-process to other tasks such as music segmentation and surround sound upmixing

    Group-wise automatic music transcription

    Get PDF
    Background: Music transcription is the conversion of musical audio into notation such that a musician can recreate the piece. Automatic music transcription (AMT) is the automation of this process. Current AMT algorithms produce a less musically meaningful transcription than human transcribers. However, AMT performs better at predicting notes present in a short time frame. Group-wise Automatic Music Transcription, (GWAMT) is when several renditions of a piece are used to give a single transcription. Aims: The main aim was to perform investigations into GWAMT. Secondary aims included: Comparing methods for GWAMT on the frame level; Considering the impact of GWAMT on the broader field of AMT. Method(s)/Procedures: GWAMT transcription is split into three stages: transcription, alignment and combination. Transcription is performed by splitting the piece into frames, and using a classifier to identify the notes present. Convolutional Neural Networks (CNNs) are used with a novel training methodology and architecture. Different renditions of the same piece have corresponding notes occurring at different times. In order to match corresponding frames, methods for the alignment of multiple renditions are used. Several methods were compared, pairwise alignment, progressive alignment and a new method, iterative alignment. The effect of when the aligned features are combined (early/late), and how (majority vote, linear opinion pool, logarithmic opinion pool, max, median), is investigated. Results: The developed method for frame-level transcription achieves state-of-the-art transcription accuracy on the MAPS database with an F1-score of 76.67%. Experiments on GWAMT show that the F1-score can be improved by between 0.005 to 0.01 using the majority vote and logarithmic pool combination methods. Conclusions/Implications: These experiments show that group-wise frame-level transcription can improve the transcription when there are different tempos, noise levels, dynamic ranges and reverbs between the clips. They also demonstrate a future application of GWAMT to individual pieces with repeated segments

    Reverberation: models, estimation and application

    No full text
    The use of reverberation models is required in many applications such as acoustic measurements, speech dereverberation and robust automatic speech recognition. The aim of this thesis is to investigate different models and propose a perceptually-relevant reverberation model with suitable parameter estimation techniques for different applications. Reverberation can be modelled in both the time and frequency domain. The model parameters give direct information of both physical and perceptual characteristics. These characteristics create a multidimensional parameter space of reverberation, which can be to a large extent captured by a time-frequency domain model. In this thesis, the relationship between physical and perceptual model parameters will be discussed. In the first application, an intrusive technique is proposed to measure the reverberation or reverberance, perception of reverberation and the colouration. The room decay rate parameter is of particular interest. In practical applications, a blind estimate of the decay rate of acoustic energy in a room is required. A statistical model for the distribution of the decay rate of the reverberant signal named the eagleMax distribution is proposed. The eagleMax distribution describes the reverberant speech decay rates as a random variable that is the maximum of the room decay rates and anechoic speech decay rates. Three methods were developed to estimate the mean room decay rate from the eagleMax distributions alone. The estimated room decay rates form a reverberation model that will be discussed in the context of room acoustic measurements, speech dereverberation and robust automatic speech recognition individually
    • …
    corecore