34,102 research outputs found
Style Separation and Synthesis via Generative Adversarial Networks
Style synthesis attracts great interests recently, while few works focus on
its dual problem "style separation". In this paper, we propose the Style
Separation and Synthesis Generative Adversarial Network (S3-GAN) to
simultaneously implement style separation and style synthesis on object
photographs of specific categories. Based on the assumption that the object
photographs lie on a manifold, and the contents and styles are independent, we
employ S3-GAN to build mappings between the manifold and a latent vector space
for separating and synthesizing the contents and styles. The S3-GAN consists of
an encoder network, a generator network, and an adversarial network. The
encoder network performs style separation by mapping an object photograph to a
latent vector. Two halves of the latent vector represent the content and style,
respectively. The generator network performs style synthesis by taking a
concatenated vector as input. The concatenated vector contains the style half
vector of the style target image and the content half vector of the content
target image. Once obtaining the images from the generator network, an
adversarial network is imposed to generate more photo-realistic images.
Experiments on CelebA and UT Zappos 50K datasets demonstrate that the S3-GAN
has the capacity of style separation and synthesis simultaneously, and could
capture various styles in a single model
Deep Learning Techniques for Music Generation -- A Survey
This paper is a survey and an analysis of different ways of using deep
learning (deep artificial neural networks) to generate musical content. We
propose a methodology based on five dimensions for our analysis:
Objective - What musical content is to be generated? Examples are: melody,
polyphony, accompaniment or counterpoint. - For what destination and for what
use? To be performed by a human(s) (in the case of a musical score), or by a
machine (in the case of an audio file).
Representation - What are the concepts to be manipulated? Examples are:
waveform, spectrogram, note, chord, meter and beat. - What format is to be
used? Examples are: MIDI, piano roll or text. - How will the representation be
encoded? Examples are: scalar, one-hot or many-hot.
Architecture - What type(s) of deep neural network is (are) to be used?
Examples are: feedforward network, recurrent network, autoencoder or generative
adversarial networks.
Challenge - What are the limitations and open challenges? Examples are:
variability, interactivity and creativity.
Strategy - How do we model and control the process of generation? Examples
are: single-step feedforward, iterative feedforward, sampling or input
manipulation.
For each dimension, we conduct a comparative analysis of various models and
techniques and we propose some tentative multidimensional typology. This
typology is bottom-up, based on the analysis of many existing deep-learning
based systems for music generation selected from the relevant literature. These
systems are described and are used to exemplify the various choices of
objective, representation, architecture, challenge and strategy. The last
section includes some discussion and some prospects.Comment: 209 pages. This paper is a simplified version of the book: J.-P.
Briot, G. Hadjeres and F.-D. Pachet, Deep Learning Techniques for Music
Generation, Computational Synthesis and Creative Systems, Springer, 201
Optimizing expected word error rate via sampling for speech recognition
State-level minimum Bayes risk (sMBR) training has become the de facto
standard for sequence-level training of speech recognition acoustic models. It
has an elegant formulation using the expectation semiring, and gives large
improvements in word error rate (WER) over models trained solely using
cross-entropy (CE) or connectionist temporal classification (CTC). sMBR
training optimizes the expected number of frames at which the reference and
hypothesized acoustic states differ. It may be preferable to optimize the
expected WER, but WER does not interact well with the expectation semiring, and
previous approaches based on computing expected WER exactly involve expanding
the lattices used during training. In this paper we show how to perform
optimization of the expected WER by sampling paths from the lattices used
during conventional sMBR training. The gradient of the expected WER is itself
an expectation, and so may be approximated using Monte Carlo sampling. We show
experimentally that optimizing WER during acoustic model training gives 5%
relative improvement in WER over a well-tuned sMBR baseline on a 2-channel
query recognition task (Google Home)
- …