48 research outputs found

    Automated Audio Captioning with Recurrent Neural Networks

    Full text link
    We present the first approach to automated audio captioning. We employ an encoder-decoder scheme with an alignment model in between. The input to the encoder is a sequence of log mel-band energies calculated from an audio file, while the output is a sequence of words, i.e. a caption. The encoder is a multi-layered, bi-directional gated recurrent unit (GRU) and the decoder a multi-layered GRU with a classification layer connected to the last GRU of the decoder. The classification layer and the alignment model are fully connected layers with shared weights between timesteps. The proposed method is evaluated using data drawn from a commercial sound effects library, ProSound Effects. The resulting captions were rated through metrics utilized in machine translation and image captioning fields. Results from metrics show that the proposed method can predict words appearing in the original caption, but not always correctly ordered.Comment: Presented at the 11th IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 201

    Advertising On The Internet: Perspectives From Advertising Agencies And Advertisers

    Get PDF
    While a significant portion of companies have invested in Internet advertising and clearly digital media continues to grow in popularity, a noteworthy segment of corporations is still uncertain of Internet advertising efficiency. The purpose of this study is to gain insights into the perceptions of both advertisers and advertising agencies of the Internet as an advertising medium in order to record the factors that inhibit or reinforce the integration of new technology into their strategies and to spot future changes and trends. One hundred twenty-four managers participated in this study and their responses indicate that currently Internet advertising is a questionable and ineffective marketing channel; however marketers and media planners are willing to exploit the targeted nature of Internet advertising in the short run

    A Recurrent Encoder-Decoder Approach with Skip-filtering Connections for Monaural Singing Voice Separation

    Full text link
    The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB

    Temporal Sub-sampling of Audio Feature Sequences for Automated Audio Captioning

    Get PDF
    Audio captioning is the task of automatically creating a textual description for the contents of a general audio signal. Typical audio captioning methods rely on deep neural networks (DNNs), where the target of the DNN is to map the input audio sequence to an output sequence of words, i.e. the caption. Though, the length of the textual description is considerably less than the length of the audio signal, for example 10 words versus some thousands of audio feature vectors. This clearly indicates that an output word corresponds to multiple input feature vectors. In this work we present an approach that focuses on explicitly taking advantage of this difference of lengths between sequences, by applying a temporal sub-sampling to the audio input sequence. We employ a sequence-to-sequence method, which uses a fixed-length vector as an output from the encoder, and we apply temporal sub-sampling between the RNNs of the encoder. We evaluate the benefit of our approach by employing the freely available dataset Clotho and we evaluate the impact of different factors of temporal sub-sampling. Our results show an improvement to all considered metrics

    Stacked Convolutional and Recurrent Neural Networks for Music Emotion Recognition

    Get PDF
    This paper studies the emotion recognition from musical tracks in the 2-dimensional valence-arousal (V-A) emotional space. We propose a method based on convolutional (CNN) and recurrent neural networks (RNN), having significantly fewer parameters compared with the state-of-the-art method for the same task. We utilize one CNN layer followed by two branches of RNNs trained separately for arousal and valence. The method was evaluated using the 'MediaEval2015 emotion in music' dataset. We achieved an RMSE of 0.202 for arousal and 0.268 for valence, which is the best result reported on this dataset.Comment: Accepted for Sound and Music Computing (SMC 2017
    corecore