9 research outputs found

    A generic audio classification and segmentation approach for multimedia indexing and retrieval

    Full text link

    Audio-Video Detection and Fusion of Broad Casting Information

    Get PDF
    In the last few decade of multimedia information systems, audio-video data has become an glowing part in many digital computer applications. Audio-video classification has been becoming a focus in the research of audio-video processing and pattern recognition. Automatic audio-video classification is very useful to audio-video indexing, content-based audio-video retrieval and on-line audio-video distribution such as online audio-video shopping, but it is a challenge to extract the most similar and salient themes from huge data of audio-video. In this paper, we propose effective algorithms to automatically segmentation and classify audio-video clips into one of  Six classes: advertisement, cartoon, songs, serial,  movie and news. For these categories a number of acoustic and visual features that include Mel Frequency Cepstral Coefficients, Color Histogram are extracted to characterize the audio and video data. The autoassociative neural network model (AANN) is used to capture the distribution of the acoustic and visual feature vectors. The AANN model captures the distribution of the acoustic and visual features of a class, and the back propagation learning algorithm is used to adjust the weights of the network to minimize the mean square error for each feature vector. Keywords: - Audio and Video detection, Audio and Video fusion, Mel Frequency Cepstral Coefficient, Color Histogram, Autoassociative Neural Network Model(AANN

    Sistema de classificació automàtica per a continguts audiovisuals mitjançant Low-Level Descriptors d'MPEG-7

    Get PDF
    Es poden definir unes fites: - Construir un discriminador entre veu, música, silenci, i soroll. - Construir un discriminador de gènere. - Construir un discriminador de diferents locutors. - Construcció d'un segmentador automàtic dels diferents esdeveniments sonors. Per una altra part, l'objectiu acadèmic i de recerca serà estudiar si es pot construir un classificador eficient mitjançant únicament les eines que proporciona l'estàndard MPEG-7

    Sistema de classificació automàtica per a continguts audiovisuals mitjançant Low-Level Descriptors d'MPEG-7

    Get PDF
    Es poden definir unes fites: - Construir un discriminador entre veu, música, silenci, i soroll. - Construir un discriminador de gènere. - Construir un discriminador de diferents locutors. - Construcció d'un segmentador automàtic dels diferents esdeveniments sonors. Per una altra part, l'objectiu acadèmic i de recerca serà estudiar si es pot construir un classificador eficient mitjançant únicament les eines que proporciona l'estàndard MPEG-7

    Speech data analysis for semantic indexing of video of simulated medical crises.

    Get PDF
    The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatrics at the University of Louisville, was established to enhance the care of children by using simulation based educational methodologies to improve patient safety and strengthen clinician-patient interactions. After each simulation session, the physician must manually review and annotate the recordings and then debrief the trainees. The physician responsible for the simulation has recorded 100s of videos, and is seeking solutions that can automate the process. This dissertation introduces our developed system for efficient segmentation and semantic indexing of videos of medical simulations using machine learning methods. It provides the physician with automated tools to review important sections of the simulation by identifying who spoke, when and what was his/her emotion. Only audio information is extracted and analyzed because the quality of the image recording is low and the visual environment is static for most parts. Our proposed system includes four main components: preprocessing, speaker segmentation, speaker identification, and emotion recognition. The preprocessing consists of first extracting the audio component from the video recording. Then, extracting various low-level audio features to detect and remove silence segments. We investigate and compare two different approaches for this task. The first one is threshold-based and the second one is classification-based. The second main component of the proposed system consists of detecting speaker changing points for the purpose of segmenting the audio stream. We propose two fusion methods for this task. The speaker identification and emotion recognition components of our system are designed to provide users the capability to browse the video and retrieve shots that identify ”who spoke, when, and the speaker’s emotion” for further analysis. For this component, we propose two feature representation methods that map audio segments of arbitary length to a feature vector with fixed dimensions. The first one is based on soft bag-of-word (BoW) feature representations. In particular, we define three types of BoW that are based on crisp, fuzzy, and possibilistic voting. The second feature representation is a generalization of the BoW and is based on Fisher Vector (FV). FV uses the Fisher Kernel principle and combines the benefits of generative and discriminative approaches. The proposed feature representations are used within two learning frameworks. The first one is supervised learning and assumes that a large collection of labeled training data is available. Within this framework, we use standard classifiers including K-nearest neighbor (K-NN), support vector machine (SVM), and Naive Bayes. The second framework is based on semi-supervised learning where only a limited amount of labeled training samples are available. We use an approach that is based on label propagation. Our proposed algorithms were evaluated using 15 medical simulation sessions. The results were analyzed and compared to those obtained using state-of-the-art algorithms. We show that our proposed speech segmentation fusion algorithms and feature mappings outperform existing methods. We also integrated all proposed algorithms and developed a GUI prototype system for subjective evaluation. This prototype processes medical simulation video and provides the user with a visual summary of the different speech segments. It also allows the user to browse videos and retrieve scenes that provide answers to semantic queries such as: who spoke and when; who interrupted who? and what was the emotion of the speaker? The GUI prototype can also provide summary statistics of each simulation video. Examples include: for how long did each person spoke? What is the longest uninterrupted speech segment? Is there an unusual large number of pauses within the speech segment of a given speaker

    A generic audio classification and segmentation approach for multimedia indexing and retrieval

    No full text
    We focus the attention on the area of generic and automatic audio classification and segmentation for audio-based multimedia indexing and retrieval applications. In particular, we present a fuzzy approach toward hierarchic audio classification and global segmentation framework based on automatic audio analysis providing robust, bi-modal, efficient and parameter invariant classification over global audio segments. The input audio is split into segments, which are classified as speech, music, fuzzy or silent. The proposed method minimizes critical errors of misclassification by fuzzy region modeling, thus increasing the efficiency of both pure and fuzzy classification. The experimental results show that the critical errors are minimized and the proposed framework significantly increases the efficiency and the accuracy of audio-based retrieval especially in large multimedia databases

    Spatial and Content-based Audio Processing using Stochastic Optimization Methods

    Get PDF
    Stochastic optimization (SO) represents a category of numerical optimization approaches, in which the search for the optimal solution involves randomness in a constructive manner. As shown also in this thesis, the stochastic optimization techniques and models have become an important and notable paradigm in a wide range of application areas, including transportation models, financial instruments, and network design. Stochastic optimization is especially developed for solving the problems that are either too difficult or impossible to solve analytically by deterministic optimization approaches. In this thesis, the focus is put on applying several stochastic optimization algorithms to two audio-specific application areas, namely sniper positioning and content-based audio classification and retrieval. In short, the first application belongs to an area of spatial audio, whereas the latter is a topic of machine learning and, more specifically, multimedia information retrieval. The SO algorithms considered in the thesis are particle filtering (PF), particle swarm optimization (PSO), and simulated annealing (SA), which are extended, combined and applied to the specified problems in a novel manner. Based on their iterative and evolving nature, especially the PSO algorithms are often included to the category of evolutionary algorithms. Considering the sniper positioning application, in this thesis the PF and SA algorithms are employed to optimize the parameters of a mathematical shock wave model based on observed firing event wavefronts. Such an inverse problem is suitable for Bayesian approach, which is the main motivation for including the PF approach among the considered optimization methods. It is shown – also with SA – that by applying the stated shock wave model, the proposed stochastic parameter estimation approach provides statistically reliable and qualified results. The content-based audio classification part of the thesis is based on a dedicated framework consisting of several individual binary classifiers. In this work, artificial neural networks (ANNs) are used within the framework, for which the parameters and network structures are optimized based the desired item outputs, i.e. the ground truth class labels. The optimization process is carried out using a multi-dimensional extension of the regular PSO algorithm (MD PSO). The audio retrieval experiments are performed in the context of feature generation (synthesis), which is an approach for generating new audio features/attributes based on some conventional features originally extracted from a particular audio database. Here the MD PSO algorithm is applied to optimize the parameters of the feature generation process, wherein the dimensionality of the generated feature vector is also optimized. Both from practical perspective and the viewpoint of complexity theory, stochastic optimization techniques are often computationally demanding. Because of this, the practical implementations discussed in this thesis are designed as directly applicable to parallel computing. This is an important and topical issue considering the continuous increase of computing grids and cloud services. Indeed, many of the results achieved in this thesis are computed using a grid of several computers. Furthermore, since also personal computers and mobile handsets include an increasing number of processor cores, such parallel implementations are not limited to grid servers only
    corecore