351 research outputs found

    Unsupervised mining of audiovisually consistent segments in videos with application to structure analysis

    Get PDF
    International audienceIn this paper, a multimodal event mining technique is proposed to discover repeating video segments exhibiting audio and visual consistency in a totally unsupervised manner. The mining strategy first exploits independent audio and visual cluster analysis to provide segments which are consistent in both their visual and audio modalities, thus likely corresponding to a unique underlying event. A subsequent modeling stage using discriminative models enables accurate detection of the underlying event throughout the video. Event mining is applied to unsupervised video structure analysis, using simple heuristics on occurrence patterns of the events discovered to select those relevant to the video structure. Results on TV programs ranging from news to talk shows and games, show that structurally relevant events are discovered with precisions ranging from 87% to 98% and recalls from 59% to 94%

    Hierarchical topic structuring: from dense segmentation to topically focused fragments via burst analysis

    Get PDF
    International audienceTopic segmentation traditionally relies on lexical cohesion measured through word re-occurrences to output a dense segmen-tation, either linear or hierarchical. In this paper, a novel organization of the topical structure of textual content is proposed. Rather than searching for topic shifts to yield dense segmentation, we propose an algorithm to extract topically focused fragments organized in a hierarchical manner. This is achieved by leveraging the temporal distribution of word re-occurrences, searching for bursts, to skirt the limits imposed by a global counting of lexical re-occurrences within segments. Comparison to a reference dense segmentation on varied datasets indicates that we can achieve a better topic focus while retrieving all of the important aspects of a text

    Zero-resource audio-only spoken term detection based on a combination of template matching techniques

    Get PDF
    spoken term detection, template matching, unsupervised learning, posterior featuresInternational audienceSpoken term detection is a well-known information retrieval task that seeks to extract contentful information from audio by locating occurrences of known query words of interest. This paper describes a zero-resource approach to such task based on pattern matching of spoken term queries at the acoustic level. The template matching module comprises the cascade of a segmental variant of dynamic time warping and a self-similarity matrix comparison to further improve robustness to speech variability. This solution notably differs from more traditional train and test methods that, while shown to be very accurate, rely upon the availability of large amounts of linguistic resources. We evaluate our framework on different parameterizations of the speech templates: raw MFCC features and Gaussian posteriorgrams, French and English phonetic posteriorgrams output by two different state of the art phoneme recognizers

    L'adaptation thématique d'un modÚle de langue fait-elle apparaßtre des mots thématiques?

    Get PDF
    International audienceWhereas topic-based adaptation of language models (LM) claims to increase the accuracy of topic-specific words within automatic speech recognition, this paper investigates why this wish is not always verified. After outlining the mechanisms of LM adaptation and automatic speech recognition, diagnosing elements are proposed along with solutions. In addition to a better accuracy on topic-specific words, results show better graph error rates and word error rates on a set of spoken documents with various topic

    Audio Event Detection in Movies using Multiple Audio Words and Contextual Bayesian Networks

    Get PDF
    International audienceThis article investigates a novel use of the well known audio words representations to detect specific audio events, namely gunshots and explosions, in order to get more robustness towards soundtrack variability in Hollywood movies. An audio stream is processed as a sequence of stationary segments. Each segment is described by one or several audio words obtained by applying product quantization to standard features. Such a representation using multiple audio words constructed via product quantisation is one of the novelties described in this work. Based on this representation, Bayesian networks are used to exploit the contextual information in order to detect audio events. Experiments are performed on a comprehensive set of 15 movies, made publicly available. Results are comparable to the state of the art results obtained on the same dataset but show increased robustness to decision thresholds, however limiting the range of possible operating points in some conditions. Late fusion provides a solution to this issue

    The Model of Reading : Modelling principles, Definitions, Schema, Alignments

    Get PDF
    READ-IT Model of Reading -V2Executive Summary This technical report introduces the data model developed to address the systematic collection and use of reading experiences in READ-IT project. The model of reading presented in this document is meant to inform the development of the READ-IT database and tools. This document describes the methodological approach and design principles adopted in the development of the model of reading. Furthermore, this technical report describes the content of the first version of the data model of the reading experience, including a preliminary analysis of the alignments between READ-IT model of reading with CIDOC-CRM, FRBRoo, FoaF and Schema.org

    Irisa MediaEval 2011 Spoken Web Search System

    Get PDF
    These working notes describe the main aspects of IRISA submission for the Spoken Web Search at the MediaEval 2011 campaign. We test a language-independent audio-only system based on a combination of template matching techniques. A brief overview of the main components of the architecture is followed by reporting on the evaluation on the development and test data provided by the organizers

    De la détection d'évÚnements sonores violents par SVM dans les films

    Get PDF
    National audienceThis article studies the behaviour of a state-of-the-art support vector machine audio event detection approach, applied to violent event detection in movies. The events we are trying to detect are screams, gunshots, explosions. Contrary to others studies, we show that the state-of-theart approach does not lead to good results on this task. A study on the repartition of samples into subsets in a cross validation protocol helps explain those results and highlights a generalisation problem due to a polymorphism of considered classes. This polymorphism is demonstrated by the computation the divergence between the samples of the test database and the training database.Cet article étudie le comportement d'une approche classique, à l'état de l'art, pour la détection d'événements sonores par machines à vecteurs supports, appliquée à la détection d'événements violents dans les films. Les événements sonores considérés, liés à la présence de violence, sont les Cris, les Coups de feu et les Explosions. Nous montrons que, contrairement aux résultats d'autres études, l'approche état de l'art ne donne pas de bons résultats sur cette tùche. Une étude sur la répartition des échantillons en sous-ensembles dans un protocole de validation croisée permet d'expliquer ces résultats et met en évidence un problÚme de généralisation, dû au polymorphisme des classes considérées. Ce polymorphisme est démontré par un calcul de divergence entre les échantillons de la base de test et ceux de la base d'apprentissage

    Investigating domain-independent NLP techniques for precise target selection in video hyperlinking

    Get PDF
    International audienceAutomatic generation of hyperlinks in multimedia video data is a subject with growing interest, as demonstrated by recent work undergone in the framework of the Search and Hyperlinking task within the Mediaeval benchmark initiative. In this paper, we compare NLP-based strategies for precise target selection in video hyperlinking exploiting speech material, with the goal of providing hyperlinks from a specified anchor to help information retrieval. We experimentally compare two approaches enabling to select short portions of videos which are relevant and possibly complementary with respect to the anchor. The first approach exploits a bipartite graph relating utterances and words to find the most relevant utterances. The second one uses explicit topic segmentation, whether hierarchical or not, to select the target segments. Experimental results are reported on the Mediaeval 2013 Search and Hyperlinking dataset which consists of BBC videos, demonstrating the interest of hierarchical topic segmentation for precise target selection
