Search CORE

92 research outputs found

A framework for dialogue detection in movies

Author: Kotropoulos C
Kotti M
Moschou V
Pitas I
Ziòlko B
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

In this paper, we investigate a novel framework for dialogue detection that is based on indicator functions. An indicator function defines that a particular actor is present at each time instant. Two dialogue detection rules are developed and assessed. The first rule relies on the value of the cross-correlation function at zero time lag that is compared to a threshold. The second rule is based on the cross-power in a particular frequency band that is also compared to a threshold. Experiments are carried out in order to validate the feasibility of the aforementioned dialogue detection rules by using ground-truth indicator functions determined by human observers from six different movies. A total of 25 dialogue scenes and another 8 non-dialogue scenes are employed. The probabilities of false alarm and detection are estimated by cross-validation, where 70% of the available scenes are used to learn the thresholds employed in the dialogue detection rules and the remaining 30% of the scenes are used for testing. An almost perfect dialogue detection is reported for every distinct threshold. © Springer-Verlag Berlin Heidelberg 2006

CiteSeerX

Crossref

Spiral - Imperial College Digital Repository

A neural network approach to audio-assisted movie dialogue detection

Author: Alatan
Birge
Constantine Kotropoulos
Emmanouil Benetos
Freund
Freund
Hosmer
Ioannis Pitas
Jelinek
Kotti
Král
Lehane
Margarita Kotti
Papoulis
Platt
Reiss
Stoica
Trelea
Webb
Zhai
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

A novel framework for audio-assisted dialogue detection based on indicator functions and neural networks is investigated. An indicator function defines that an actor is present at a particular time instant. The cross-correlation function of a pair of indicator functions and the magnitude of the corresponding cross-power spectral density are fed as input to neural networks for dialogue detection. Several types of artificial neural networks, including multilayer perceptrons, voted perceptrons, radial basis function networks, support vector machines, and particle swarm optimization-based multilayer perceptrons are tested. Experiments are carried out to validate the feasibility of the aforementioned approach by using ground-truth indicator functions determined by human observers on 6 different movies. A total of 41 dialogue instances and another 20 non-dialogue instances is employed. The average detection accuracy achieved is high, ranging between 84.78%±5.499% and 91.43%±4.239%

CiteSeerX

City Research Online

Crossref

Spiral - Imperial College Digital Repository

Audio-assisted movie dialogue detection

Author: Kotropoulos C.
Kotropoulos C.
Kotti M.
Kotti M.
Maragos P.
Maragos P.
Panagakis Y.
Panagakis Y.
Pitas I.
Pitas I.
Ververidis D.
Ververidis D.
Publication venue: Institute of Electrical and Electronics Engineers (IEEE)
Publication date: 01/01/2008
Field of study

An audio-assisted system is investigated that detects if a movie scene is a dialogue or not. The system is based on actor indicator functions. That is, functions which define if an actor speaks at a certain time instant. In particular, the crosscorrelation and the magnitude of the corresponding the crosspower spectral density of a pair of indicator functions are input to various classifiers, such as voted perceptrons, radial basis function networks, random trees, and support vector machines for dialogue/non-dialogue detection. To boost classifier efficiency AdaBoost is also exploited. The aforementioned classifiers are trained using ground truth indicator functions determined by human annotators for 41 dialogue and another 20 non-dialogue audio instances. For testing, actual indicator functions are derived by applying audio activity detection and actor clustering to audio recordings. 23 instances are randomly chosen among the aforementioned 41 dialogue instances, 17 of which correspond to dialogue scenes and 6 to non-dialogue ones. Accuracy ranging between 0.739 and 0.826 is reported

Middlesex University Research Repository

Audio-assisted movie dialogue detection

Author: Evangelopoulos G
Kotropoulos C
Kotti M
Maragos P
Panagakis I
Pitas I
Ververidis D
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2008
Field of study

An audio-assisted system is investigated that detects if a movie scene is a dialogue or not. The system is based on actor indicator functions. That is, functions which define if an actor speaks at a certain time instant. In particular, the cross-correlation and the magnitude of the corresponding the cross-power spectral density of a pair of indicator functions are input to various classifiers, such as voted perceptions, radial basis function networks, random trees, and support vector machines for dialogue/non-dialogue detection. To boost classifier efficiency AdaBoost is also exploited. The aforementioned classifiers are trained using ground truth indicator functions determined by human annotators for 41 dialogue and another 20 non-dialogue audio instances. For testing, actual indicator functions are derived by applying audio activity detection and actor clustering to audio recordings. 23 instances are randomly chosen among the aforementioned 41 dialogue instances, 17 of which correspond to dialogue scenes and 6 to non-dialogue ones. Accuracy ranging between 0.739 and 0.826 is reported. © 2008 IEEE

Crossref

Middlesex University Research Repository

DSpace at NTUA

Spiral - Imperial College Digital Repository

Research and Practice on Fusion of Visual and Audio Perception

Author: 李剑
Publication venue
Publication date: 22/05/2015
Field of study

随着监控系统智能化的快速发展，监控数据在交通、环境、安防等领域发挥着越来越重要的作用。受人类感知模型的启发，利用音频数据与视频数据的互补效应对场景进行感知具有较好地研究价值。然而随之产生的海量监控数据越来越难以检索，这迫使人们寻找更加有效地分析方法，从而将人从重复的劳动中解脱出来。因此，音视频融合感知技术不仅具有重要的理论研究价值，在应用前景上也是大有可为。本文研究了当前音视频融合感知领域发展的现状，以传统视频监控平台为基础，设计了音视频融合感知的体系结构。立足于音视频内容分析，研究了基于音视频融合感知的暴力场景分析模型。本文主要贡献如下： 1. 以音视频融合感知的监控平台为出发点，设计...With the rapid development of intelligent monitoring system, monitoring data is playing an increasingly important role in traffic, environment, security and the other fields. Inspired by the model of human perception, people use the complementary effect of audio and visual data to percept the scene. And then the huge amount of visual-audio data forces people to look for a more effective way to ana...学位：工学硕士院系专业：信息科学与技术学院_计算机科学与技术学号：2302012115292

Xiamen University Institutional Repository

Learning Multimodal Temporal Representation for Dubbing Detection in Broadcast Media

Author: Bendris M.
Chetty G.
Farneback G.
Gay P.
Gay P.
Giraudel A.
Hershey J.
Iyengar G.
Le N.
Ngiam J.
Patterson E. K.
Pigou L.
Potamianos G.
Ren J. S.
Rúa E. A.
Srivastava N.
Sutskever I.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 19/08/2016
Field of study

Person discovery in the absence of prior identity knowledge requires accurate association of visual and auditory cues. In broadcast data, multimodal analysis faces additional challenges due to narrated voices over muted scenes or dubbing in different languages. To address these challenges, we define and analyze the problem of dubbing detection in broadcast data, which has not been explored before. We propose a method to represent the temporal relationship between the auditory and visual streams. This method consists of canonical correlation analysis to learn a joint multimodal space, and long short term memory (LSTM) networks to model cross-modality temporal dependencies. Our contributions also include the introduction of a newly acquired dataset of face-speech segments from TV data, which we have made publicly available. The proposed method achieves promising performance on this real world dataset as compared to several baselines

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Language Factors Modulate Audiovisual Speech Perception. A Developmental Perspective

Author: Birulés Muntané Joan
Publication venue: 'Edicions de la Universitat de Barcelona'
Publication date: 17/01/2020
Field of study

[eng] In most natural situations, adults look at the eyes of faces in seek of social information (Yarbus, 1967). However, when the auditory information becomes unclear (e.g. speech-in- noise) they switch their attention towards the mouth of a talking face and rely on the audiovisual redundant cues to help them process the speech signal (Barenholtz, Mavica, & Lewkowicz, 2016; Buchan, Paré, & Munhall, 2007; Lansing & McConkie, 2003; Vatikiotis- Bateson, Eigsti, Yano, & Munhall, 1998). Likewise, young infants are sensitive to the correspondence between acoustic and visual speech (Bahrick & Lickliter, 2012), and they also rely on the talker’s mouth during the second half of the first year of life, putatively to help them acquire language by the time they start babbling (Lewkowicz & Hansen-Tift, 2012), and also to aid language differentiation in the case of bilingual infants (Pons, Bosch & Lewkowicz, 2015). The current set of studies provides a detailed examination of the audiovisual (AV) speech cues contribution to speech processing at different language development stages, through the analysis of selective attention patterns when processing speech from talking faces. To do so, I compared different linguistic experience factors (i.e. types of bilingualism – distance between bilinguals’ two languages –, language familiarity and language proficiency) that modulate audiovisual speech perception in first language acquisition during infancy (Studies 1 and 2), early childhood (Studies 3 and 4), and in second language (L2) learning during adulthood (Studies 5, 6 and 7). The findings of the present work demonstrate that (1) perceiving speech audiovisually hampers close bilingual infants’ ability to discriminate their languages, that (2) 15-month-old and 5 year-old close language bilinguals rely more on the mouth cues of a talking face than do their distant bilingual peers, that (3) children’s attention to the mouth follows a clear temporal pattern: it is maximal in the beginning of the presentation and it diminishes gradually as speech continues, and that (4) adults also rely more on the mouth speech cues when they perceive fluent non-native vs. native speech, regardless of their L2 expertise. All in all, these studies shed new light into the field of audiovisual speech perception and language processing by showing that selective attention to a talker’s eyes and mouth is a dynamic, information-seeking process, which is largely modulated by perceivers’ early linguistic experience and the tasks’ demands. These results suggest that selectively attending the redundant speech cues of a talker’s mouth at the adequate moment enhances speech perception and is crucial for normal language development and speech processing, not only in infancy – during first language acquisition – but also in more advanced language stages in childhood, as well as in L2 learning during adulthood. Ultimately, they confirm that mouth reliance is greater in close bilingual environments, where the presence of two related languages increases the necessity for disambiguation and keeping separate language systems.[cat] Atendre selectivament a la boca d’un parlant ens ajuda a beneficiar-nos de la informació audiovisual i processar millor el senyal de la parla, quan el senyal auditiu es torna confús. Paral·lelament, els infants també atenen a la boca durant la segona meitat del primer any de vida, la qual cosa els ajuda en l'adquisició del llenguatge/s. Aquesta tesi examina la contribució del senyal audiovisual al processament de la parla, a través de les anàlisis d'atenció selectiva a una cara parlant. Es comparen diferents factors lingüístics (tipologies de bilingüisme, la familiaritat i la competència amb l'idioma) que modulen la percepció audiovisual de la parla en l'adquisició del llenguatge durant la primera infància (Estudis 1 i 2), en nens d’edat escolar (Estudis 3 i 4) i l’aprenentatge d'una segona llengua durant l'edat adulta (Estudis 5, 6 i 7). Els resultats demostren que (1) la percepció audiovisual de la parla dificulta la capacitat dels infants bilingües de discriminar les seves llengües properes, que (2) els bilingües de llengües properes de 15 mesos i de 5 anys d’edat posen més atenció a les pistes audiovisuals de la boca que els bilingües de llengües distants, que (3) l’atenció dels nens a la boca del parlant és màxima al començament i disminueix gradualment a mesura que continua la parla, i que (4) els adults també es recolzen més en els senyals audiovisuals de la boca quan perceben una llengua no nativa (L2), independentment de la seva competència en aquesta. Aquests estudis demostren que l'atenció selectiva a la cara d'un parlant és un procés dinàmic i de cerca d'informació, i que aquest és modulat per l'experiència lingüística primerenca i les exigències que comporten les situacions comunicatives. Aquests resultats suggereixen que atendre a les pistes audiovisuals de la boca en els moments adequats és crucial per al desenvolupament normal del llenguatge, tan durant la primera infància com en les etapes més avançades del llenguatge, així com en l’aprenentatge de segones llengües. Per últim, aquests resultats confirmen que l’estratègia de recolzar-se en les pistes audiovisuals s’utilitza en major mesura en entorns bilingües propers, on la presència de dues llengües relacionades augmenta la necessitat de desambiguació

Tesis Doctorals en Xarxa

Diposit Digital de la Universitat de Barcelona

Recommended from our members

The benefit received from visual information when listening to clear and degraded speech in background noise

Author: Blackburn CL
Publication venue
Publication date: 01/05/2019
Field of study

In order to improve speech understanding in background noise, visual information is used to enhance the incoming auditory signal. This enhancement is known as the visual speech benefit. Variation in the amount of visual speech benefit that is received by participants is the focus of this research project and is examined for both clear and vocoded speech. Vocoded speech simulates the type of speech experienced by cochlear-implant users. Experiments 1 and 2 examined variation in the amount of visual speech benefit gained if the type of background noise in the speech test changed. The key results from Experiment 1 and 2 were that the visual information provided was not enough to enhance speech understanding for particularly unintelligible speech. Experiment 3 assessed change to levels of visual speech benefit if the target talker in the stimuli changed. Significant differences in intelligibility between talkers was found. The amount of visual speech benefit increased as the audio intelligibility of the target talker decreased in clear speech. Overall, therefore, it is important that consideration is given to the levels of intelligibility provided by the stimuli used in speech perception testing as this may change outcomes. Experiments 4 and 5 examined individual differences that may predict the amount of visual speech benefit gained. In Experiment 4, the significant predictors of the amount of visual speech benefit gained in clear speech were general speech perception ability, ability to detect audio and visual synchrony, and tendency towards autistic traits. The results of Experiment 5 showed that general speech perception ability and time spent looking at the mouth area measured using eye-tracking were significant predictors of the amount of visual speech benefit gained in clear speech. Individual differences between participants may therefore predict differences in speech perception and should also be considered when testing speech perception

Nottingham Trent Institutional Repository (IRep)

Depression and anxiety in the postnatal period : an examination of mother–infant interactions and infants’ language development

Author: Brookman Ruth
Publication venue: 'American Psychological Association (APA)'
Publication date: 01/01/2019
Field of study

Infancy is a time period associated with significant and rapid social-emotional and cognitive development. Environmental influences, particularly the quality of the mother–infant interaction, assist in shaping these early capacities. Maternal factors such as depression and anxiety can have a negative impact on a mother’s sensitivity towards her infant and indirectly compromise child developmental outcomes. However, little is known about the impact of depression and anxiety on communicative interactions and language outcomes in young infants. This thesis reports a longitudinal study, which primary objective was to examine the mechanisms through which maternal depression and anxiety influence infant language development via the quantity and quality of mother–infant interactions. The second objective was to evaluate the effectiveness of a video feedback intervention aimed at promoting maternal responsiveness, a construct that captures the quality of early mother–infant interactions. To address these objectives this longitudinal study followed a sample of mother–infant dyads in which the mothers were or were not affected by anxiety and depression symptoms, between the infants’ ages of 6 to 18 months. The study included four components that measured the quantity and quality of the mother–infant interactions and infant developmental outcomes between groups and across time. The first component of the longitudinal study involved home recordings examining the quantity of maternal speech input to the infants at 6 and 12 months of age. The second component involved the assessment of infants’ lexical abilities at 18 months of age. The third component consisted of assessments of the quality of mother–infant interactions at 9 and 12 months. The final component involved the evaluation of a short intervention aimed at promoting maternal responsiveness within mother–infant interactions. Findings demonstrated that maternal depression and anxiety have an effect on infants’ early lexical abilities via both the quantity and quality of mother–infant interactions. These results suggest that variability in mothers’ emotional health influences infants’ home language experience, the concurrent frequency of vocalisations, and their later vocabulary size and lexical processing efficiency at 18 months. Maternal responsiveness, a measure of the quality of mother–infant interactions, emerged as the strongest predictor of infant vocabulary size

Western Sydney ResearchDirect

MUSICAL COMPOSITION FOCUSING ON THE QUALITY OF PRESENCE IN PERFORMANCE

Author: WIESENFELD RUTH
Publication venue: 'University of Plymouth'
Publication date: 01/01/2008
Field of study

This practice-based research into the quality of presence in performance explores a compositional approach that originates from the question of what might lead a person to seek musical or sounding utterance. It aims at opening the awareness-space towards a listening not only to the musical-acoustic event, but to the performer as a whole. Consequently different forms of notation and processes of rehearsing that address the psycho-physical constitution of a performer are investigated; a strong focus lies on the sensorimotor aspect of playing an instrument. The portfolio comprises fourteen pieces (for soloists, chamber ensembles and orchestra) as well as four collaborative projects with performance artists. Most of the pieces have been performed live: documentation on CD and DVD is included. The written part of the thesis provides a commentary on the process of bringing these pieces into being. In particular, issues of notation and rehearsal are addressed here, which are of special concern as to the transmission of conceptions regarding presence, embodiment and kinaesthetic sensitivities. I explain how the body of compositions deals with various notions of listening: receptive listening and - in the chapter on the orchestral piece spun yam - listening as a sense of touch as well as listening in wonder. Illustrated by several performance projects I outline the concept of the audience as witness rather than as observer. Additionally, I describe how I use imagery to inscribe possible stimuli for musical or sounding utterance into my compositions. To demonstrate how this research contributes to new knowledge in the field of musical composition, I compare it with similar yet different positions exemplified by Mauricio Kagel's "instrumental theatre" as well as Helmut Lachenmann's "musique concrete instrumentale" and place it against more recent trends and developments. These evaluations will show that there is no other approach to the quality of presence within musical composition coinciding exactly with mine

Plymouth Electronic Archive and Research Library

OpenGrey Repository