5 research outputs found
Music Artist Classification with WaveNet Classifier for Raw Waveform Audio Data
Models for music artist classification usually were operated in the frequency
domain, in which the input audio samples are processed by the spectral
transformation. The WaveNet architecture, originally designed for speech and
music generation. In this paper, we propose an end-to-end architecture in the
time domain for this task. A WaveNet classifier was introduced which directly
models the features from a raw audio waveform. The WaveNet takes the waveform
as the input and several downsampling layers are subsequent to discriminate
which artist the input belongs to. In addition, the proposed method is applied
to singer identification. The model achieving the best performance obtains an
average F1 score of 0.854 on benchmark dataset of Artist20, which is a
significant improvement over the related works. In order to show the
effectiveness of feature learning of the proposed method, the bottleneck layer
of the model is visualized.Comment: 12 page
A Comparative Analysis of Neural-Based Visual Recognisers for Speech Activity Detection
Recent advances in Neural network has offered great solutions to automation of various detections including speech activity detection (SAD). However, existing literature on SAD highlights different approaches within neural networks, but do not provide a comprehensive comparison of the approaches. This is important because such neural approaches often require hardware-intensive resources.
As a result, the project provides a comparative analysis of three different approaches: classification with still images (CNN), classification based on previous images (CRNN) and classification based on a sequence of images (Seq2Seq). The project aims to find a modest approach-one that provides the highest accuracy but yet does not require expensive computation whilst providing the quickest output prediction times. Such approach can then be adapted for real-time application such as activation of infotainment systems or interactive robots etc.
Results show that within the problem domain (dataset, resources etc.) the use of still images can achieve an accuracy of 97% for SAD. With the addition of RNN, the classification accuracy is increased further by 2%, as both architectures (classification based on previous images and classification of a sequence of images) achieve 99% classification accuracy.
These results show that the use of history/previous images improves accuracy compared to the use of still images. Furthermore, with the RNNs ability of memory, the network can be defined smaller which results in quicker training and prediction times. Experiments also showed that CRNN is almost as accurate as the Seq2Seq architecture (99.1% vs 99.6% classification accuracy, respectively) but faster to train (326s vs 761s per epoch) and 28% faster output predictions (3.7s vs 5.19s per prediction). These results indicate that the CRNN can be a suitable choice for real-time application such as activation of infotainment systems based on classification accuracy, training and prediction times