14,538 research outputs found

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Large-Scale Music Genre Analysis and Classification Using Machine Learning with Apache Spark

    Get PDF
    The trend for listening to music online has greatly increased over the past decade due to the number of online musical tracks. The large music databases of music libraries that are provided by online music content distribution vendors make music streaming and downloading services more accessible to the end-user. It is essential to classify similar types of songs with an appropriate tag or index (genre) to present similar songs in a convenient way to the end-user. As the trend of online music listening continues to increase, developing multiple machine learning models to classify music genres has become a main area of research. In this research paper, a popular music dataset GTZAN which contains ten music genres is analysed to study various types of music features and audio signals. Multiple scalable machine learning algorithms supported by Apache Spark, including naĂŻve Bayes, decision tree, logistic regression, and random forest, are investigated for the classification of music genres. The performance of these classifiers is compared, and the random forest performs as the best classifier for the classification of music genres. Apache Spark is used in this paper to reduce the computation time for machine learning predictions with no computational cost, as it focuses on parallel computation. The present work also demonstrates that the perfect combination of Apache Spark and machine learning algorithms reduces the scalability problem of the computation of machine learning predictions. Moreover, different hyperparameters of the random forest classifier are optimized to increase the performance efficiency of the classifier in the domain of music genre classification. The experimental outcome shows that the developed random forest classifier can establish a high level of performance accuracy, especially for the mislabelled, distorted GTZAN dataset. This classifier has outperformed other machine learning classifiers supported by Apache Spark in the present work. The random forest classifier manages to achieve 90% accuracy for music genre classification compared to other work in the same domain

    Affective Music Information Retrieval

    Full text link
    Much of the appeal of music lies in its power to convey emotions/moods and to evoke them in listeners. In consequence, the past decade witnessed a growing interest in modeling emotions from musical signals in the music information retrieval (MIR) community. In this article, we present a novel generative approach to music emotion modeling, with a specific focus on the valence-arousal (VA) dimension model of emotion. The presented generative model, called \emph{acoustic emotion Gaussians} (AEG), better accounts for the subjectivity of emotion perception by the use of probability distributions. Specifically, it learns from the emotion annotations of multiple subjects a Gaussian mixture model in the VA space with prior constraints on the corresponding acoustic features of the training music pieces. Such a computational framework is technically sound, capable of learning in an online fashion, and thus applicable to a variety of applications, including user-independent (general) and user-dependent (personalized) emotion recognition and emotion-based music retrieval. We report evaluations of the aforementioned applications of AEG on a larger-scale emotion-annotated corpora, AMG1608, to demonstrate the effectiveness of AEG and to showcase how evaluations are conducted for research on emotion-based MIR. Directions of future work are also discussed.Comment: 40 pages, 18 figures, 5 tables, author versio

    Sentiment Analysis for Social Media

    Get PDF
    Sentiment analysis is a branch of natural language processing concerned with the study of the intensity of the emotions expressed in a piece of text. The automated analysis of the multitude of messages delivered through social media is one of the hottest research fields, both in academy and in industry, due to its extremely high potential applicability in many different domains. This Special Issue describes both technological contributions to the field, mostly based on deep learning techniques, and specific applications in areas like health insurance, gender classification, recommender systems, and cyber aggression detection

    Learning Fast and Slow: PROPEDEUTICA for Real-time Malware Detection

    Full text link
    In this paper, we introduce and evaluate PROPEDEUTICA, a novel methodology and framework for efficient and effective real-time malware detection, leveraging the best of conventional machine learning (ML) and deep learning (DL) algorithms. In PROPEDEUTICA, all software processes in the system start execution subjected to a conventional ML detector for fast classification. If a piece of software receives a borderline classification, it is subjected to further analysis via more performance expensive and more accurate DL methods, via our newly proposed DL algorithm DEEPMALWARE. Further, we introduce delays to the execution of software subjected to deep learning analysis as a way to "buy time" for DL analysis and to rate-limit the impact of possible malware in the system. We evaluated PROPEDEUTICA with a set of 9,115 malware samples and 877 commonly used benign software samples from various categories for the Windows OS. Our results show that the false positive rate for conventional ML methods can reach 20%, and for modern DL methods it is usually below 6%. However, the classification time for DL can be 100X longer than conventional ML methods. PROPEDEUTICA improved the detection F1-score from 77.54% (conventional ML method) to 90.25%, and reduced the detection time by 54.86%. Further, the percentage of software subjected to DL analysis was approximately 40% on average. Further, the application of delays in software subjected to ML reduced the detection time by approximately 10%. Finally, we found and discussed a discrepancy between the detection accuracy offline (analysis after all traces are collected) and on-the-fly (analysis in tandem with trace collection). Our insights show that conventional ML and modern DL-based malware detectors in isolation cannot meet the needs of efficient and effective malware detection: high accuracy, low false positive rate, and short classification time.Comment: 17 pages, 7 figure

    Productivity Measurement of Call Centre Agents using a Multimodal Classification Approach

    Get PDF
    Call centre channels play a cornerstone role in business communications and transactions, especially in challenging business situations. Operations’ efficiency, service quality, and resource productivity are core aspects of call centres’ competitive advantage in rapid market competition. Performance evaluation in call centres is challenging due to human subjective evaluation, manual assortment to massive calls, and inequality in evaluations because of different raters. These challenges impact these operations' efficiency and lead to frustrated customers. This study aims to automate performance evaluation in call centres using various deep learning approaches. Calls recorded in a call centre are modelled and classified into high- or low-performance evaluations categorised as productive or nonproductive calls. The proposed conceptual model considers a deep learning network approach to model the recorded calls as text and speech. It is based on the following: 1) focus on the technical part of agent performance, 2) objective evaluation of the corpus, 3) extension of features for both text and speech, and 4) combination of the best accuracy from text and speech data using a multimodal structure. Accordingly, the diarisation algorithm extracts that part of the call where the agent is talking from which the customer is doing so. Manual annotation is also necessary to divide the modelling corpus into productive and nonproductive (supervised training). Krippendorff’s alpha was applied to avoid subjectivity in the manual annotation. Arabic speech recognition is then developed to transcribe the speech into text. The text features are the words embedded using the embedding layer. The speech features make several attempts to use the Mel Frequency Cepstral Coefficient (MFCC) upgraded with Low-Level Descriptors (LLD) to improve classification accuracy. The data modelling architectures for speech and text are based on CNNs, BiLSTMs, and the attention layer. The multimodal approach follows the generated models to improve performance accuracy by concatenating the text and speech models using the joint representation methodology. The main contributions of this thesis are: • Developing an Arabic Speech recognition method for automatic transcription of speech into text. • Drawing several DNN architectures to improve performance evaluation using speech features based on MFCC and LLD. • Developing a Max Weight Similarity (MWS) function to outperform the SoftMax function used in the attention layer. • Proposing a multimodal approach for combining the text and speech models for best performance evaluation
    • …
    corecore