113,587 research outputs found

    Audio segmentation-by-classification approach based on factor analysis in broadcast news domain

    Get PDF
    This paper studies a novel audio segmentation-by-classification approach based on factor analysis. The proposed technique compensates the within-class variability by using class-dependent factor loading matrices and obtains the scores by computing the log-likelihood ratio for the class model to a non-class model over fixed-length windows. Afterwards, these scores are smoothed to yield longer contiguous segments of the same class by means of different back-end systems. Unlike previous solutions, our proposal does not make use of specific acoustic features and does not need a hierarchical structure. The proposed method is applied to segment and classify audios coming from TV shows into five different acoustic classes: speech, music, speech with music, speech with noise, and others. The technique is compared to a hierarchical system with specific acoustic features achieving a significant error reduction

    Deep Learning for Audio Signal Processing

    Full text link
    Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

    Using the beat histogram for speech rhythm description and language identification

    Get PDF
    In this paper we present a novel approach for the description of speech rhythm and the extraction of rhythm-related features for automatic language identification (LID). Previous methods have extracted speech rhythm through the calculation of features based on salient elements of speech such as consonants, vowels and syllables. We present how an automatic rhythm extraction method borrowed from music information retrieval, the beat histogram, can be adapted for the analysis of speech rhythm by defining the most relevant novelty functions in the speech signal and extracting features describing their periodicities. We have evaluated those features in a rhythm-based LID task for two multilingual speech corpora using support vector machines, including feature selection methods to identify the most informative descriptors. Results suggest that the method is successful in describing speech rhythm and provides LID classification accuracy comparable to or better than that of other approaches, without the need for a preceding segmentation or annotation of the speech signal. Concerning rhythm typology, the rhythm class hypothesis in its original form seems to be only partly confirmed by our results

    Feature extraction for speech and music discrimination

    Get PDF
    Driven by the demand of information retrieval, video editing and human-computer interface, in this paper we propose a novel spectral feature for music and speech discrimination. This scheme attempts to simulate a biological model using the averaged cepstrum, where human perception tends to pick up the areas of large cepstral changes. The cepstrum data that is away from the mean value will be exponentially reduced in magnitude. We conduct experiments of music/speech discrimination by comparing the performance of the proposed feature with that of previously proposed features in classification. The dynamic time warping based classification verifies that the proposed feature has the best quality of music/speech classification in the test database

    Optimal feature selection and machine learning for high-level audio classification : a random forests approach

    Get PDF
    Content related information, metadata, and semantics can be extracted from soundtracks of multimedia files. Speech recognition, music information retrieval and environmental sound detection techniques have been developed into a fairly mature technology enabling a final text mining process to obtain semantics for the audio scene. An efficient speech, music and environmental sound classification system, which correctly identify these three types of audio signals and feed them into dedicated recognisers, is a critical pre-processing stage for such a content analysis system. The performance and computational efficiency of such a system is predominately dependent on the selected features. This thesis presents a detailed study to identify the suitable classification features and associate a suitable machine learning technique for the intended classification task. In particular, a systematic feature selection procedure is developed to employ the random forests classifier to rank the features according to their importance and reduces the dimensionality of the feature space accordingly. This new technique avoids the trial-and-error approach used by many authors researchers. The implemented feature selection produces results related to individual classification tasks instead of the commonly used statistical distance criteria based approaches that does not consider the intended classification task, which makes it more suitable for supervised learning with specific purposes. A final collective decision-making stage is employed to combine multiple class detectors patterns into one to produce a single classification result for each input frames. The performance of the proposed feature selection technique has been compared with the techniques proposed by MPEG-7 standard to extract the reduced feature space. The results show a significant improvement in the resulted classification accuracy, at the same time, the feature space is simplified and computational overhead reduced. The proposed feature selection and machine learning technique enable the use of only 30 out of the 47 features without degrading the classification accuracy while the classification accuracy lowered by 1.7% only while just 10 features were utilised. The validation shows good performance also and the last stage of collective decision making was able to improve the classification result even after selecting only a small number of classification features. The work represents a successful attempt to determine audio feature importance and classify the audio contents into speech, music and environmental sound using a selected feature subset. The result shows a high degree of accuracy by utilising the random forests for both feature importance ranking and audio content classification

    Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset

    Full text link
    Audio signals represent a wide diversity of acoustic events, from background environmental noise to spoken communication. Machine learning models such as neural networks have already been proposed for audio signal modeling, where recurrent structures can take advantage of temporal dependencies. This work aims to study the implementation of several neural network-based systems for speech and music event detection over a collection of 77,937 10-second audio segments (216 h), selected from the Google AudioSet dataset. These segments belong to YouTube videos and have been represented as mel-spectrograms. We propose and compare two approaches. The first one is the training of two different neural networks, one for speech detection and another for music detection. The second approach consists on training a single neural network to tackle both tasks at the same time. The studied architectures include fully connected, convolutional and LSTM (long short-term memory) recurrent networks. Comparative results are provided in terms of classification performance and model complexity. We would like to highlight the performance of convolutional architectures, specially in combination with an LSTM stage. The hybrid convolutional-LSTM models achieve the best overall results (85% accuracy) in the three proposed tasks. Furthermore, a distractor analysis of the results has been carried out in order to identify which events in the ontology are the most harmful for the performance of the models, showing some difficult scenarios for the detection of music and speechThis work has been supported by project “DSSL: Redes Profundas y Modelos de Subespacios para Deteccion y Seguimiento de Locutor, Idioma y Enfermedades Degenerativas a partir de la Voz” (TEC2015-68172-C2-1-P), funded by the Ministry of Economy and Competitivity of Spain and FEDE

    Beat histogram features for rhythm-based musical genre classification using multiple novelty functions

    Get PDF
    In this paper we present beat histogram features for multiple level rhythm description and evaluate them in a musical genre classification task. Audio features pertaining to various musical content categories and their related novelty functions are extracted as a basis for the creation of beat histograms. The proposed features capture not only amplitude, but also tonal and general spectral changes in the signal, aiming to represent as much rhythmic information as possible. The most and least informative features are identified through feature selection methods and are then tested using Support Vector Machines on five genre datasets concerning classification accuracy against a baseline feature set. Results show that the presented features provide comparable classification accuracy with respect to other genre classification approaches using periodicity histograms and display a performance close to that of much more elaborate up-to-date approaches for rhythm description. The use of bar boundary annotations for the texture frames has provided an improvement for the dance-oriented Ballroom dataset. The comparably small number of descriptors and the possibility of evaluating the influence of specific signal components to the general rhythmic content encourage the further use of the method in rhythm description tasks
    corecore