3,056 research outputs found

    On Using Backpropagation for Speech Texture Generation and Voice Conversion

    Full text link
    Inspired by recent work on neural network image generation which rely on backpropagation towards the network inputs, we present a proof-of-concept system for speech texture synthesis and voice conversion based on two mechanisms: approximate inversion of the representation learned by a speech recognition neural network, and on matching statistics of neuron activations between different source and target utterances. Similar to image texture synthesis and neural style transfer, the system works by optimizing a cost function with respect to the input waveform samples. To this end we use a differentiable mel-filterbank feature extraction pipeline and train a convolutional CTC speech recognition network. Our system is able to extract speaker characteristics from very limited amounts of target speaker data, as little as a few seconds, and can be used to generate realistic speech babble or reconstruct an utterance in a different voice.Comment: Accepted to ICASSP 201

    Contextual modulation of primary visual cortex by auditory signals

    Get PDF
    Early visual cortex receives non-feedforward input from lateral and top-down connections (Muckli & Petro 2013 Curr. Opin. Neurobiol. 23, 195–201. (doi:10.1016/j.conb.2013.01.020)), including long-range projections from auditory areas. Early visual cortex can code for high-level auditory information, with neural patterns representing natural sound stimulation (Vetter et al. 2014 Curr. Biol. 24, 1256–1262. (doi:10.1016/j.cub.2014.04.020)). We discuss a number of questions arising from these findings. What is the adaptive function of bimodal representations in visual cortex? What type of information projects from auditory to visual cortex? What are the anatomical constraints of auditory information in V1, for example, periphery versus fovea, superficial versus deep cortical layers? Is there a putative neural mechanism we can infer from human neuroimaging data and recent theoretical accounts of cortex? We also present data showing we can read out high-level auditory information from the activation patterns of early visual cortex even when visual cortex receives simple visual stimulation, suggesting independent channels for visual and auditory signals in V1. We speculate which cellular mechanisms allow V1 to be contextually modulated by auditory input to facilitate perception, cognition and behaviour. Beyond cortical feedback that facilitates perception, we argue that there is also feedback serving counterfactual processing during imagery, dreaming and mind wandering, which is not relevant for immediate perception but for behaviour and cognition over a longer time frame. This article is part of the themed issue ‘Auditory and visual scene analysis’

    A Highly Robust Audio Monitoring System for Radio Broadcasting

    Get PDF
    Proposing a novel approach for monitoringsongs for the radio broadcasting channels is veryimportant for the interest of singers, writers andmusicians in the musical industry. Singers, writers andmusicians have a claim to intellectual property rightsfor their songs broadcast over all the radio channels.According to this intellectual property rights actsingers, writers and musicians should be paid for theirsongs broadcast over all the radio channels. Therefore wepropose a real time audio monitoring approach to solvethis problem which includes our own audio recognitionalgorithm. It is easy to recognize a song, when you providethe original high quality blueprint of the song as input. Butwe can’t expect such kind of audio input from radiochannels since lots of transformations are possible beforereaching the end user or listener. For example, addingenvironmental effects such as noise, adding commercialson the song as watermarks, playing more than one songas a chain without adding any silence between them,playing a part of the song, playing same song in variousspeeds and so on. These transformations cause change inthe uniqueness of particular song and make the problemeven more difficult. The algorithm we proposing is resistantto noise and distortion as well as it is capable of recognizingshort segment of song when broadcasting over the radiochannels. At the end of the processing our system generatesa descriptive report including title of the song, singer of thesong, writer of the song, composer of the song, number oftimes it was played and when it was played for all songs fora particular period for all radio broadcasting channels. Weevaluate our system against various types of real timescenarios and achieved overall higher level of accuracy(96%) at the end

    Vocal Source Separation for Carnatic Music

    Get PDF
    Carnatic Music is a Classical music form that originates from the South of India and is extremely varied from Western genres. Music Information Retrieval (MIR) has predominantly been used to tackle problems in western musical genres and cannot be adapted to non western musical styles like Carnatic Music due to the fundamental difference in melody, rhythm, instrumentation, nature of compositions and improvisations. Due to these conceptual differences emerged MIR tasks specific for the use case of Carnatic Music. Researchers have constantly been using domain knowledge and technology driven ideas to tackle tasks like Melodic analysis, Rhythmic analysis and Structural segmentation. Melodic analysis of Carnatic Music has been a cornerstone in MIR research and heavily relies on the singing voice because the singer offers the main melody. The problem is that the singing voice is not isolated and has melodic, percussion and drone instruments as accompaniment. Separating the singing voice from the accompanying instruments usually comes with issues like bleeding of the accompanying instruments and loss of melodic information. This in turn has an adverse effect on the melodic analysis. The datasets used for Carnatic-MIR are concert recordings of different artistes with accompanying instruments and there is a lack of clean isolated singing voice tracks. Existing Source Separation models are trained extensively on multi-track audio of the rock and pop genre and do not generalize well for the use case of Carnatic music. How do we improve Singing Voice Source Separation for Carnatic Music given the above constraints? In this work, the possible contributions to mitigate the existing issue are ; 1) Creating a dataset of isolated Carnatic music stems. 2) Reusing multi-track audio with bleeding from the Saraga dataset. 3) Retraining and fine tuning existing State of the art Source Separation models. We hope that this effort to improve Source Separation for Carnatic Music can help overcome existing shortcomings and generalize well for Carnatic music datasets in the literature and in turn improve melodic analysis of this music culture

    Proceedings of the second "international Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST'14)

    Get PDF
    The implicit objective of the biennial "international - Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST) is to foster collaboration between international scientific teams by disseminating ideas through both specific oral/poster presentations and free discussions. For its second edition, the iTWIST workshop took place in the medieval and picturesque town of Namur in Belgium, from Wednesday August 27th till Friday August 29th, 2014. The workshop was conveniently located in "The Arsenal" building within walking distance of both hotels and town center. iTWIST'14 has gathered about 70 international participants and has featured 9 invited talks, 10 oral presentations, and 14 posters on the following themes, all related to the theory, application and generalization of the "sparsity paradigm": Sparsity-driven data sensing and processing; Union of low dimensional subspaces; Beyond linear and convex inverse problem; Matrix/manifold/graph sensing/processing; Blind inverse problems and dictionary learning; Sparsity and computational neuroscience; Information theory, geometry and randomness; Complexity/accuracy tradeoffs in numerical methods; Sparsity? What's next?; Sparse machine learning and inference.Comment: 69 pages, 24 extended abstracts, iTWIST'14 website: http://sites.google.com/site/itwist1

    Predicting Audio Advertisement Quality

    Full text link
    Online audio advertising is a particular form of advertising used abundantly in online music streaming services. In these platforms, which tend to host tens of thousands of unique audio advertisements (ads), providing high quality ads ensures a better user experience and results in longer user engagement. Therefore, the automatic assessment of these ads is an important step toward audio ads ranking and better audio ads creation. In this paper we propose one way to measure the quality of the audio ads using a proxy metric called Long Click Rate (LCR), which is defined by the amount of time a user engages with the follow-up display ad (that is shown while the audio ad is playing) divided by the impressions. We later focus on predicting the audio ad quality using only acoustic features such as harmony, rhythm, and timbre of the audio, extracted from the raw waveform. We discuss how the characteristics of the sound can be connected to concepts such as the clarity of the audio ad message, its trustworthiness, etc. Finally, we propose a new deep learning model for audio ad quality prediction, which outperforms the other discussed models trained on hand-crafted features. To the best of our knowledge, this is the first large-scale audio ad quality prediction study.Comment: WSDM '18 Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 9 page
    • …
    corecore