48 research outputs found
Automated Audio Captioning with Recurrent Neural Networks
We present the first approach to automated audio captioning. We employ an
encoder-decoder scheme with an alignment model in between. The input to the
encoder is a sequence of log mel-band energies calculated from an audio file,
while the output is a sequence of words, i.e. a caption. The encoder is a
multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a
multi-layered GRU with a classification layer connected to the last GRU of the
decoder. The classification layer and the alignment model are fully connected
layers with shared weights between timesteps. The proposed method is evaluated
using data drawn from a commercial sound effects library, ProSound Effects. The
resulting captions were rated through metrics utilized in machine translation
and image captioning fields. Results from metrics show that the proposed method
can predict words appearing in the original caption, but not always correctly
ordered.Comment: Presented at the 11th IEEE Workshop on Applications of Signal
Processing to Audio and Acoustics (WASPAA), 201
Advertising On The Internet: Perspectives From Advertising Agencies And Advertisers
While a significant portion of companies have invested in Internet advertising and clearly digital media continues to grow in popularity, a noteworthy segment of corporations is still uncertain of Internet advertising efficiency. The purpose of this study is to gain insights into the perceptions of both advertisers and advertising agencies of the Internet as an advertising medium in order to record the factors that inhibit or reinforce the integration of new technology into their strategies and to spot future changes and trends. One hundred twenty-four managers participated in this study and their responses indicate that currently Internet advertising is a questionable and ineffective marketing channel; however marketers and media planners are willing to exploit the targeted nature of Internet advertising in the short run
A Recurrent Encoder-Decoder Approach with Skip-filtering Connections for Monaural Singing Voice Separation
The objective of deep learning methods based on encoder-decoder architectures
for music source separation is to approximate either ideal time-frequency masks
or spectral representations of the target music source(s). The spectral
representations are then used to derive time-frequency masks. In this work we
introduce a method to directly learn time-frequency masks from an observed
mixture magnitude spectrum. We employ recurrent neural networks and train them
using prior knowledge only for the magnitude spectrum of the target source. To
assess the performance of the proposed method, we focus on the task of singing
voice separation. The results from an objective evaluation show that our
proposed method provides comparable results to deep learning based methods
which operate over complicated signal representations. Compared to previous
methods that approximate time-frequency masks, our method has increased
performance of signal to distortion ratio by an average of 3.8 dB
Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning
Audio captioning is the task of automatically creating a textual description
for the contents of a general audio signal. Typical audio captioning methods
rely on deep neural networks (DNNs), where the target of the DNN is to map the
input audio sequence to an output sequence of words, i.e. the caption. Though,
the length of the textual description is considerably less than the length of
the audio signal, for example 10 words versus some thousands of audio feature
vectors. This clearly indicates that an output word corresponds to multiple
input feature vectors. In this work we present an approach that focuses on
explicitly taking advantage of this difference of lengths between sequences, by
applying a temporal sub-sampling to the audio input sequence. We employ a
sequence-to-sequence method, which uses a fixed-length vector as an output from
the encoder, and we apply temporal sub-sampling between the RNNs of the
encoder. We evaluate the benefit of our approach by employing the freely
available dataset Clotho and we evaluate the impact of different factors of
temporal sub-sampling. Our results show an improvement to all considered
metrics
Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition
This paper studies the emotion recognition from musical tracks in the
2-dimensional valence-arousal (V-A) emotional space. We propose a method based
on convolutional (CNN) and recurrent neural networks (RNN), having
significantly fewer parameters compared with the state-of-the-art method for
the same task. We utilize one CNN layer followed by two branches of RNNs
trained separately for arousal and valence. The method was evaluated using the
'MediaEval2015 emotion in music' dataset. We achieved an RMSE of 0.202 for
arousal and 0.268 for valence, which is the best result reported on this
dataset.Comment: Accepted for Sound and Music Computing (SMC 2017