92 research outputs found

    A framework for dialogue detection in movies

    No full text
    In this paper, we investigate a novel framework for dialogue detection that is based on indicator functions. An indicator function defines that a particular actor is present at each time instant. Two dialogue detection rules are developed and assessed. The first rule relies on the value of the cross-correlation function at zero time lag that is compared to a threshold. The second rule is based on the cross-power in a particular frequency band that is also compared to a threshold. Experiments are carried out in order to validate the feasibility of the aforementioned dialogue detection rules by using ground-truth indicator functions determined by human observers from six different movies. A total of 25 dialogue scenes and another 8 non-dialogue scenes are employed. The probabilities of false alarm and detection are estimated by cross-validation, where 70% of the available scenes are used to learn the thresholds employed in the dialogue detection rules and the remaining 30% of the scenes are used for testing. An almost perfect dialogue detection is reported for every distinct threshold. © Springer-Verlag Berlin Heidelberg 2006

    A neural network approach to audio-assisted movie dialogue detection

    Get PDF
    A novel framework for audio-assisted dialogue detection based on indicator functions and neural networks is investigated. An indicator function defines that an actor is present at a particular time instant. The cross-correlation function of a pair of indicator functions and the magnitude of the corresponding cross-power spectral density are fed as input to neural networks for dialogue detection. Several types of artificial neural networks, including multilayer perceptrons, voted perceptrons, radial basis function networks, support vector machines, and particle swarm optimization-based multilayer perceptrons are tested. Experiments are carried out to validate the feasibility of the aforementioned approach by using ground-truth indicator functions determined by human observers on 6 different movies. A total of 41 dialogue instances and another 20 non-dialogue instances is employed. The average detection accuracy achieved is high, ranging between 84.78%±5.499% and 91.43%±4.239%

    Audio-assisted movie dialogue detection

    Get PDF
    An audio-assisted system is investigated that detects if a movie scene is a dialogue or not. The system is based on actor indicator functions. That is, functions which define if an actor speaks at a certain time instant. In particular, the crosscorrelation and the magnitude of the corresponding the crosspower spectral density of a pair of indicator functions are input to various classifiers, such as voted perceptrons, radial basis function networks, random trees, and support vector machines for dialogue/non-dialogue detection. To boost classifier efficiency AdaBoost is also exploited. The aforementioned classifiers are trained using ground truth indicator functions determined by human annotators for 41 dialogue and another 20 non-dialogue audio instances. For testing, actual indicator functions are derived by applying audio activity detection and actor clustering to audio recordings. 23 instances are randomly chosen among the aforementioned 41 dialogue instances, 17 of which correspond to dialogue scenes and 6 to non-dialogue ones. Accuracy ranging between 0.739 and 0.826 is reported

    Audio-assisted movie dialogue detection

    Get PDF
    An audio-assisted system is investigated that detects if a movie scene is a dialogue or not. The system is based on actor indicator functions. That is, functions which define if an actor speaks at a certain time instant. In particular, the cross-correlation and the magnitude of the corresponding the cross-power spectral density of a pair of indicator functions are input to various classifiers, such as voted perceptions, radial basis function networks, random trees, and support vector machines for dialogue/non-dialogue detection. To boost classifier efficiency AdaBoost is also exploited. The aforementioned classifiers are trained using ground truth indicator functions determined by human annotators for 41 dialogue and another 20 non-dialogue audio instances. For testing, actual indicator functions are derived by applying audio activity detection and actor clustering to audio recordings. 23 instances are randomly chosen among the aforementioned 41 dialogue instances, 17 of which correspond to dialogue scenes and 6 to non-dialogue ones. Accuracy ranging between 0.739 and 0.826 is reported. © 2008 IEEE

    Research and Practice on Fusion of Visual and Audio Perception

    Get PDF
    随着监控系统智能化的快速发展,监控数据在交通、环境、安防等领域发挥着越来越重要的作用。受人类感知模型的启发,利用音频数据与视频数据的互补效应对场景进行感知具有较好地研究价值。然而随之产生的海量监控数据越来越难以检索,这迫使人们寻找更加有效地分析方法,从而将人从重复的劳动中解脱出来。因此,音视频融合感知技术不仅具有重要的理论研究价值,在应用前景上也是大有可为。 本文研究了当前音视频融合感知领域发展的现状,以传统视频监控平台为基础,设计了音视频融合感知的体系结构。立足于音视频内容分析,研究了基于音视频融合感知的暴力场景分析模型。本文主要贡献如下: 1. 以音视频融合感知的监控平台为出发点,设计...With the rapid development of intelligent monitoring system, monitoring data is playing an increasingly important role in traffic, environment, security and the other fields. Inspired by the model of human perception, people use the complementary effect of audio and visual data to percept the scene. And then the huge amount of visual-audio data forces people to look for a more effective way to ana...学位:工学硕士院系专业:信息科学与技术学院_计算机科学与技术学号:2302012115292

    Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

    Get PDF
    Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines

    Language Factors Modulate Audiovisual Speech Perception. A Developmental Perspective

    Get PDF
    [eng] In most natural situations, adults look at the eyes of faces in seek of social information (Yarbus, 1967). However, when the auditory information becomes unclear (e.g. speech-in- noise) they switch their attention towards the mouth of a talking face and rely on the audiovisual redundant cues to help them process the speech signal (Barenholtz, Mavica, & Lewkowicz, 2016; Buchan, Paré, & Munhall, 2007; Lansing & McConkie, 2003; Vatikiotis- Bateson, Eigsti, Yano, & Munhall, 1998). Likewise, young infants are sensitive to the correspondence between acoustic and visual speech (Bahrick & Lickliter, 2012), and they also rely on the talker’s mouth during the second half of the first year of life, putatively to help them acquire language by the time they start babbling (Lewkowicz & Hansen-Tift, 2012), and also to aid language differentiation in the case of bilingual infants (Pons, Bosch & Lewkowicz, 2015). The current set of studies provides a detailed examination of the audiovisual (AV) speech cues contribution to speech processing at different language development stages, through the analysis of selective attention patterns when processing speech from talking faces. To do so, I compared different linguistic experience factors (i.e. types of bilingualism – distance between bilinguals’ two languages –, language familiarity and language proficiency) that modulate audiovisual speech perception in first language acquisition during infancy (Studies 1 and 2), early childhood (Studies 3 and 4), and in second language (L2) learning during adulthood (Studies 5, 6 and 7). The findings of the present work demonstrate that (1) perceiving speech audiovisually hampers close bilingual infants’ ability to discriminate their languages, that (2) 15-month-old and 5 year-old close language bilinguals rely more on the mouth cues of a talking face than do their distant bilingual peers, that (3) children’s attention to the mouth follows a clear temporal pattern: it is maximal in the beginning of the presentation and it diminishes gradually as speech continues, and that (4) adults also rely more on the mouth speech cues when they perceive fluent non-native vs. native speech, regardless of their L2 expertise. All in all, these studies shed new light into the field of audiovisual speech perception and language processing by showing that selective attention to a talker’s eyes and mouth is a dynamic, information-seeking process, which is largely modulated by perceivers’ early linguistic experience and the tasks’ demands. These results suggest that selectively attending the redundant speech cues of a talker’s mouth at the adequate moment enhances speech perception and is crucial for normal language development and speech processing, not only in infancy – during first language acquisition – but also in more advanced language stages in childhood, as well as in L2 learning during adulthood. Ultimately, they confirm that mouth reliance is greater in close bilingual environments, where the presence of two related languages increases the necessity for disambiguation and keeping separate language systems.[cat] Atendre selectivament a la boca d’un parlant ens ajuda a beneficiar-nos de la informació audiovisual i processar millor el senyal de la parla, quan el senyal auditiu es torna confús. Paral·lelament, els infants també atenen a la boca durant la segona meitat del primer any de vida, la qual cosa els ajuda en l'adquisició del llenguatge/s. Aquesta tesi examina la contribució del senyal audiovisual al processament de la parla, a través de les anàlisis d'atenció selectiva a una cara parlant. Es comparen diferents factors lingüístics (tipologies de bilingüisme, la familiaritat i la competència amb l'idioma) que modulen la percepció audiovisual de la parla en l'adquisició del llenguatge durant la primera infància (Estudis 1 i 2), en nens d’edat escolar (Estudis 3 i 4) i l’aprenentatge d'una segona llengua durant l'edat adulta (Estudis 5, 6 i 7). Els resultats demostren que (1) la percepció audiovisual de la parla dificulta la capacitat dels infants bilingües de discriminar les seves llengües properes, que (2) els bilingües de llengües properes de 15 mesos i de 5 anys d’edat posen més atenció a les pistes audiovisuals de la boca que els bilingües de llengües distants, que (3) l’atenció dels nens a la boca del parlant és màxima al començament i disminueix gradualment a mesura que continua la parla, i que (4) els adults també es recolzen més en els senyals audiovisuals de la boca quan perceben una llengua no nativa (L2), independentment de la seva competència en aquesta. Aquests estudis demostren que l'atenció selectiva a la cara d'un parlant és un procés dinàmic i de cerca d'informació, i que aquest és modulat per l'experiència lingüística primerenca i les exigències que comporten les situacions comunicatives. Aquests resultats suggereixen que atendre a les pistes audiovisuals de la boca en els moments adequats és crucial per al desenvolupament normal del llenguatge, tan durant la primera infància com en les etapes més avançades del llenguatge, així com en l’aprenentatge de segones llengües. Per últim, aquests resultats confirmen que l’estratègia de recolzar-se en les pistes audiovisuals s’utilitza en major mesura en entorns bilingües propers, on la presència de dues llengües relacionades augmenta la necessitat de desambiguació

    Depression and anxiety in the postnatal period : an examination of mother–infant interactions and infants’ language development

    Get PDF
    Infancy is a time period associated with significant and rapid social-emotional and cognitive development. Environmental influences, particularly the quality of the mother–infant interaction, assist in shaping these early capacities. Maternal factors such as depression and anxiety can have a negative impact on a mother’s sensitivity towards her infant and indirectly compromise child developmental outcomes. However, little is known about the impact of depression and anxiety on communicative interactions and language outcomes in young infants. This thesis reports a longitudinal study, which primary objective was to examine the mechanisms through which maternal depression and anxiety influence infant language development via the quantity and quality of mother–infant interactions. The second objective was to evaluate the effectiveness of a video feedback intervention aimed at promoting maternal responsiveness, a construct that captures the quality of early mother–infant interactions. To address these objectives this longitudinal study followed a sample of mother–infant dyads in which the mothers were or were not affected by anxiety and depression symptoms, between the infants’ ages of 6 to 18 months. The study included four components that measured the quantity and quality of the mother–infant interactions and infant developmental outcomes between groups and across time. The first component of the longitudinal study involved home recordings examining the quantity of maternal speech input to the infants at 6 and 12 months of age. The second component involved the assessment of infants’ lexical abilities at 18 months of age. The third component consisted of assessments of the quality of mother–infant interactions at 9 and 12 months. The final component involved the evaluation of a short intervention aimed at promoting maternal responsiveness within mother–infant interactions. Findings demonstrated that maternal depression and anxiety have an effect on infants’ early lexical abilities via both the quantity and quality of mother–infant interactions. These results suggest that variability in mothers’ emotional health influences infants’ home language experience, the concurrent frequency of vocalisations, and their later vocabulary size and lexical processing efficiency at 18 months. Maternal responsiveness, a measure of the quality of mother–infant interactions, emerged as the strongest predictor of infant vocabulary size

    MUSICAL COMPOSITION FOCUSING ON THE QUALITY OF PRESENCE IN PERFORMANCE

    Get PDF
    This practice-based research into the quality of presence in performance explores a compositional approach that originates from the question of what might lead a person to seek musical or sounding utterance. It aims at opening the awareness-space towards a listening not only to the musical-acoustic event, but to the performer as a whole. Consequently different forms of notation and processes of rehearsing that address the psycho-physical constitution of a performer are investigated; a strong focus lies on the sensorimotor aspect of playing an instrument. The portfolio comprises fourteen pieces (for soloists, chamber ensembles and orchestra) as well as four collaborative projects with performance artists. Most of the pieces have been performed live: documentation on CD and DVD is included. The written part of the thesis provides a commentary on the process of bringing these pieces into being. In particular, issues of notation and rehearsal are addressed here, which are of special concern as to the transmission of conceptions regarding presence, embodiment and kinaesthetic sensitivities. I explain how the body of compositions deals with various notions of listening: receptive listening and - in the chapter on the orchestral piece spun yam - listening as a sense of touch as well as listening in wonder. Illustrated by several performance projects I outline the concept of the audience as witness rather than as observer. Additionally, I describe how I use imagery to inscribe possible stimuli for musical or sounding utterance into my compositions. To demonstrate how this research contributes to new knowledge in the field of musical composition, I compare it with similar yet different positions exemplified by Mauricio Kagel's "instrumental theatre" as well as Helmut Lachenmann's "musique concrete instrumentale" and place it against more recent trends and developments. These evaluations will show that there is no other approach to the quality of presence within musical composition coinciding exactly with mine
    corecore