3 research outputs found

    Time-continuous Estimation of Emotion in Music with Recurrent Neural Networks

    Get PDF
    International audienceIn this paper, we describe the IRIT's approach used for the MediaEval 2015 "Emotion in Music" task. The goal was to predict two real-valued emotion dimensions, namely valence and arousal, in a time-continuous fashion. We chose to use recurrent neural networks (RNN) for their sequence modeling capabilities. Hyperparameter tuning was performed through a 10-fold cross-validation setup on the 431 songs of the development subset. With the baseline set of 260 acoustic features, our best system achieved averaged root mean squared errors of 0.250 and 0.238, and Pearson's correlation coefficients of 0.703 and 0.692, for valence and arousal, respectively. These results were obtained by first making predictions with an RNN comprised of only 10 hidden units, smoothed by a moving average filter, and used as input to a second RNN to generate the final predictions. This system gave our best results on the official test data subset for arousal (RMSE=0.247, r=0.588), but not for Valence. Valence predictions were much worse (RMSE=0.365, r=0.029). This may be explained by the fact that in the development subset, valence and arousal values were very correlated (r=0.626), and this was not the case with the test data. Finally, slight improvements over these figures were obtained by adding spectral atness and spectral valley features to the baseline set

    Sous-continents Estimation of Emotion in Music with Recurrent Neural Networks

    Get PDF
    In this paper, we describe the IRIT's approach used for the MediaEval 2015 "Emotion in Music" task. The goal was to predict two real-valued emotion dimensions, namely valence and arousal, in a time-continuous fashion. We chose to use recurrent neural networks (RNN) for their sequence modeling capabilities. Hyperparameter tuning was performed through a 10-fold cross-validation setup on the 431 songs of the development subset. With the baseline set of 260 acoustic features, our best system achieved averaged root mean squared errors of 0.250 and 0.238, and Pearson's correlation coefficients of 0.703 and 0.692, for valence and arousal, respectively. These results were obtained by first making predictions with an RNN comprised of only 10 hidden units, smoothed by a moving average filter, and used as input to a second RNN to generate the final predictions. This system gave our best results on the official test data subset for arousal (RMSE=0.247, r=0.588), but not for Valence. Valence predictions were much worse (RMSE=0.365, r=0.029). This may be explained by the fact that in the development subset, valence and arousal values were very correlated (r=0.626), and this was not the case with the test data. Finally, slight improvements over these figures were obtained by adding spectral atness and spectral valley features to the baseline set
    corecore