2,177 research outputs found
Deep Learning Techniques for Music Generation -- A Survey
This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective - What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint. - For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation - What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat. - What format is to be
used? Examples are: MIDI, piano roll or text. - How will the representation be
encoded? Examples are: scalar, one-hot or many-hot.
Architecture - What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge - What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy - How do we model and control the process of generation? Examples
are: single-step feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottom-up, based on the analysis of many existing deep-learning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P.
Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music
Generation, Computational Synthesis and Creative Systems, Springer, 201
English Conversational Telephone Speech Recognition by Humans and Machines
One of the most difficult speech recognition tasks is accurate recognition of
human to human communication. Advances in deep learning over the last few years
have produced major speech recognition improvements on the representative
Switchboard conversational corpus. Word error rates that just a few years ago
were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now
believed to be within striking range of human performance. This then raises two
issues - what IS human performance, and how far down can we still drive speech
recognition error rates? A recent paper by Microsoft suggests that we have
already achieved human performance. In trying to verify this statement, we
performed an independent set of human performance measurements on two
conversational tasks and found that human performance may be considerably
better than what was earlier reported, giving the community a significantly
harder goal to achieve. We also report on our own efforts in this area,
presenting a set of acoustic and language modeling techniques that lowered the
word error rate of our own English conversational telephone LVCSR system to the
level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000
evaluation, which - at least at the writing of this paper - is a new
performance milestone (albeit not at what we measure to be human performance!).
On the acoustic side, we use a score fusion of three models: one LSTM with
multiple feature inputs, a second LSTM trained with speaker-adversarial
multi-task learning and a third residual net (ResNet) with 25 convolutional
layers and time-dilated convolutions. On the language modeling side, we use
word and character LSTMs and convolutional WaveNet-style language models
On Using Backpropagation for Speech Texture Generation and Voice Conversion
Inspired by recent work on neural network image generation which rely on
backpropagation towards the network inputs, we present a proof-of-concept
system for speech texture synthesis and voice conversion based on two
mechanisms: approximate inversion of the representation learned by a speech
recognition neural network, and on matching statistics of neuron activations
between different source and target utterances. Similar to image texture
synthesis and neural style transfer, the system works by optimizing a cost
function with respect to the input waveform samples. To this end we use a
differentiable mel-filterbank feature extraction pipeline and train a
convolutional CTC speech recognition network. Our system is able to extract
speaker characteristics from very limited amounts of target speaker data, as
little as a few seconds, and can be used to generate realistic speech babble or
reconstruct an utterance in a different voice.Comment: Accepted to ICASSP 201
Generating Audio Using Recurrent Neural Networks
Long Short Term Memory cells are a type of recurrent neural network that perform well when predicting sequence data. This works presents four approaches to modeling audio data. The models are trained to predict either raw audio samples or magnitude spectrum windows based on prior input audio. In a process called sampling, the models can then be employed to generate new audio using what they learned about the data they were trained on.
Four methods for sampling are presented. The first has the model predict a vector for each vector in the input. The second has the model predict one magnitude spectrum window for each magnitude spectrum window in the input. The third has the model predict a single audio sample for each input vector. The fourth has the model predict a single sample for the entire input matrix.
The end result is a method by which novel audio can be generated. This method differs from other synthesis techniques in that the content of the generated audio is not easily foreseeable. The various approaches to sampling afford different levels of control over the output. The biggest drawback to creating audio using this method is that the process is often too slow to run in real time.
Finally two applications are presented that allow the user to easily create and launch jobs. Four musical compositions are presented that use the sounds created using predictions from the models. The process for generating the source materials for each piece and the manipulations that were made to them after they were generated are also described
Analyzing and Interpreting Neural Networks for NLP: A Report on the First BlackboxNLP Workshop
The EMNLP 2018 workshop BlackboxNLP was dedicated to resources and techniques
specifically developed for analyzing and understanding the inner-workings and
representations acquired by neural models of language. Approaches included:
systematic manipulation of input to neural networks and investigating the
impact on their performance, testing whether interpretable knowledge can be
decoded from intermediate representations acquired by neural networks,
proposing modifications to neural network architectures to make their knowledge
state or generated output more explainable, and examining the performance of
networks on simplified or formal languages. Here we review a number of
representative studies in each category
- …