24,807 research outputs found

    Sound Event Detection by Exploring Audio Sequence Modelling

    Get PDF
    Everyday sounds in real-world environments are a powerful source of information by which humans can interact with their environments. Humans can infer what is happening around them by listening to everyday sounds. At the same time, it is a challenging task for a computer algorithm in a smart device to automatically recognise, understand, and interpret everyday sounds. Sound event detection (SED) is the process of transcribing an audio recording into sound event tags with onset and offset time values. This involves classification and segmentation of sound events in the given audio recording. SED has numerous applications in everyday life which include security and surveillance, automation, healthcare monitoring, multimedia information retrieval, and assisted living technologies. SED is to everyday sounds what automatic speech recognition (ASR) is to speech and automatic music transcription (AMT) is to music. The fundamental questions in designing a sound recognition system are, which portion of a sound event should the system analyse, and what proportion of a sound event should the system process in order to claim a confident detection of that particular sound event. While the classification of sound events has improved a lot in recent years, it is considered that the temporal-segmentation of sound events has not improved in the same extent. The aim of this thesis is to propose and develop methods to improve the segmentation and classification of everyday sound events in SED models. In particular, this thesis explores the segmentation of sound events by investigating audio sequence encoding-based and audio sequence modelling-based methods, in an effort to improve the overall sound event detection performance. In the first phase of this thesis, efforts are put towards improving sound event detection by explicitly conditioning the audio sequence representations of an SED model using sound activity detection (SAD) and onset detection. To achieve this, we propose multi-task learning-based SED models in which SAD and onset detection are used as auxiliary tasks for the SED task. The next part of this thesis explores self-attention-based audio sequence modelling, which aggregates audio representations based on temporal relations within and between sound events, scored on the basis of the similarity of sound event portions in audio event sequences. We propose SED models that include memory-controlled, adaptive, dynamic, and source separation-induced self-attention variants, with the aim to improve overall sound recognition

    Mandarin Singing Voice Synthesis Based on Harmonic Plus Noise Model and Singing Expression Analysis

    Full text link
    The purpose of this study is to investigate how humans interpret musical scores expressively, and then design machines that sing like humans. We consider six factors that have a strong influence on the expression of human singing. The factors are related to the acoustic, phonetic, and musical features of a real singing signal. Given real singing voices recorded following the MIDI scores and lyrics, our analysis module can extract the expression parameters from the real singing signals semi-automatically. The expression parameters are used to control the singing voice synthesis (SVS) system for Mandarin Chinese, which is based on the harmonic plus noise model (HNM). The results of perceptual experiments show that integrating the expression factors into the SVS system yields a notable improvement in perceptual naturalness, clearness, and expressiveness. By one-to-one mapping of the real singing signal and expression controls to the synthesizer, our SVS system can simulate the interpretation of a real singer with the timbre of a speaker.Comment: 8 pages, technical repor

    Age and schooling effects on early literacy and phoneme awareness

    Get PDF
    Previous research on age and schooling effects is largely restricted to studies of children who begin formal schooling from the age of 6 and the measures of phoneme awareness used have typically lacked sensitivity for beginning readers. Our study addresses these issues by testing children aged 4-6 (first two years of formal schooling in the UK) on a sensitive dynamic measure of phoneme awareness and tests of early literacy. There were significant effects of both age and schooling on dynamic and static measures of phoneme awareness, word reading, spelling and letter-name knowledge but no significant age Ă— time interactions. This indicates that older children within this age group generally outperform their younger classmates (although they do not make faster progress), and that this advantage is developed prior to the start of school

    A joint separation-classification model for sound event detection of weakly labelled data

    Get PDF
    Source separation (SS) aims to separate individual sources from an audio recording. Sound event detection (SED) aims to detect sound events from an audio recording. We propose a joint separation-classification (JSC) model trained only on weakly labelled audio data, that is, only the tags of an audio recording are known but the time of the events are unknown. First, we propose a separation mapping from the time-frequency (T-F) representation of an audio to the T-F segmentation masks of the audio events. Second, a classification mapping is built from each T-F segmentation mask to the presence probability of each audio event. In the source separation stage, sources of audio events and time of sound events can be obtained from the T-F segmentation masks. The proposed method achieves an equal error rate (EER) of 0.14 in SED, outperforming deep neural network baseline of 0.29. Source separation SDR of 8.08 dB is obtained by using global weighted rank pooling (GWRP) as probability mapping, outperforming the global max pooling (GMP) based probability mapping giving SDR at 0.03 dB. Source code of our work is published.Comment: Accepted by ICASSP 201

    A new approach to onset detection: towards an empirical grounding of theoretical and speculative ideologies of musical performance

    Get PDF
    This article assesses aspects of the current state of a project which aims, with the help of computers and computer software, to segment soundfiles of vocal melodies into their component notes, identifying precisely when the onset of each note occurs, and then tracking the pitch trajectory of each note, especially in melodies employing a variety of non-standard temperaments, in which musical intervals smaller than 100 cents are ubiquitous. From there, we may proceed further, to describe many other “micro-features” of each of the notes, but for now our focus is on the onset times and pitch trajectories

    Surfing the Waves: Live Audio Mosaicing of an Electric Bass Performance as a Corpus Browsing Interface

    Get PDF
    In this paper, the authors describe how they use an electric bass as a subtle, expressive and intuitive interface to browse the rich sample bank available to most laptop owners. This is achieved by audio mosaicing of the live bass performance audio, through corpus-based concatenative synthesis (CBCS) techniques, allowing a mapping of the multi-dimensional expressivity of the performance onto foreign audio material, thus recycling the virtuosity acquired on the electric instrument with a trivial learning curve. This design hypothesis is contextualised and assessed within the Sandbox#n series of bass+laptop meta-instruments, and the authors describe technical means of the implementation through the use of the open-source CataRT CBCS system adapted for live mosaicing. They also discuss their encouraging early results and provide a list of further explorations to be made with that rich new interface
    • …
    corecore