33 research outputs found
Towards auto-documentary: Tracking the evolution of news stories
News videos constitute an important source of information for tracking and documenting important events. In these videos, news stories are often accompanied by short video shots that tend to be repeated during the course of the event. Automatic detection of such repetitions is essential for creating auto-documentaries, for alleviating the limitation of traditional textual topic detection methods. In this paper, we propose novel methods for detecting and tracking the evolution of news over time. The proposed method exploits both visual cues and textual information to summarize evolving news stories. Experiments are carried on the TREC-VID data set consisting of 120 hours of news videos from two different channels
Gesture in Automatic Discourse Processing
Computers cannot fully understand spoken language without access to the wide range of modalities that accompany speech. This thesis addresses the particularly expressive modality of hand gesture, and focuses on building structured statistical models at the intersection of speech, vision, and meaning.My approach is distinguished in two key respects. First, gestural patterns are leveraged to discover parallel structures in the meaning of the associated speech. This differs from prior work that attempted to interpret individual gestures directly, an approach that was prone to a lack of generality across speakers. Second, I present novel, structured statistical models for multimodal language processing, which enable learning about gesture in its linguistic context, rather than in the abstract.These ideas find successful application in a variety of language processing tasks: resolving ambiguous noun phrases, segmenting speech into topics, and producing keyframe summaries of spoken language. In all three cases, the addition of gestural features -- extracted automatically from video -- yields significantly improved performance over a state-of-the-art text-only alternative. This marks the first demonstration that hand gesture improves automatic discourse processing
Multi-modal surrogates for retrieving and making sense of videos: is synchronization between the multiple modalities optimal?
Video surrogates can help people quickly make sense of the content of a video before downloading or seeking more detailed information. Visual and audio features of a video are primary information carriers and might become important components of video retrieval and video sense-making. In the past decades, most research and development efforts on video surrogates have focused on visual features of the video, and comparatively little work has been done on audio surrogates and examining their pros and cons in aiding users' retrieval and sense-making of digital videos. Even less work has been done on multi-modal surrogates, where more than one modality are employed for consuming the surrogates, for example, the audio and visual modalities. This research examined the effectiveness of a number of multi-modal surrogates, and investigated whether synchronization between the audio and visual channels is optimal. A user study was conducted to evaluate six different surrogates on a set of six recognition and inference tasks to answer two main research questions: (1) How do automatically-generated multi-modal surrogates compare to manually-generated ones in video retrieval and video sense-making? and (2) Does synchronization between multiple surrogate channels enhance or inhibit video retrieval and video sense-making? Forty-eight participants participated in the study, in which the surrogates were measured on the the time participants spent on experiencing the surrogates, the time participants spent on doing the tasks, participants' performance accuracy on the tasks, participants' confidence in their task responses, and participants' subjective ratings on the surrogates. On average, the uncoordinated surrogates were more helpful than the coordinated ones, but the manually-generated surrogates were only more helpful than the automatically-generated ones in terms of task completion time. Participants' subjective ratings were more favorable for the coordinated surrogate C2 (Magic A + V) and the uncoordinated surrogate U1 (Magic A + Storyboard V) with respect to usefulness, usability, enjoyment, and engagement. The post-session questionnaire comments demonstrated participants' preference for the coordinated surrogates, but the comments also revealed the value of having uncoordinated sensory channels
Deliverable D1.4 Visual, text and audio information analysis for hypervideo, final release
Having extensively evaluated the performance of the technologies included in the first release of WP1 multimedia analysis tools, using content from the LinkedTV scenarios and by participating in international benchmarking activities, concrete decisions regarding the appropriateness and the importance of each individual method or combination of methods were made, which, combined with an updated list of information needs for each scenario, led to a new set of analysis requirements that had to be addressed through the release of the final set of analysis techniques of WP1. To this end, coordinated efforts on three directions, including (a) the improvement of a number of methods in terms of accuracy and time efficiency, (b) the development of new technologies and (c) the definition of synergies between methods for obtaining new types of information via multimodal processing, resulted in the final bunch of multimedia analysis methods for video hyperlinking. Moreover, the different developed analysis modules have been integrated into a web-based infrastructure, allowing the fully automatic linking of the multitude of WP1 technologies and the overall LinkedTV platform
CONTENT BASED RETRIEVAL OF LECTURE VIDEO REPOSITORY: LITERATURE REVIEW
Multimedia has a significant role in communicating the information and a large amount of multimedia repositories make the browsing, retrieval and delivery of video contents. For higher education, using video as a tool for learning and teaching through multimedia application is a considerable promise. Many universities adopt educational systems where the teacher lecture is video recorded and the video lecture is made available to students with minimum post-processing effort. Since each video may cover many subjects, it is critical for an e-Learning environment to have content-based video searching capabilities to meet diverse individual learning needs. The present paper reviewed 120+ core research article on the content based retrieval of the lecture video repositories hosted on cloud by government academic and research organization of India
Social impact retrieval: measuring author inïŹuence on information retrieval
The increased presence of technologies collectively referred to as Web 2.0 mean the entire process of new media production and dissemination has moved away from an
authorcentric approach. Casual web users and browsers are increasingly able to play a more active role in the information creation process. This means that the traditional ways in which information sources may be validated and scored must adapt accordingly.
In this thesis we propose a new way in which to look at a user's contributions to the network in which they are present, using these interactions to provide a measure of
authority and centrality to the user. This measure is then used to attribute an query-independent interest score to each of the contributions the author makes, enabling us
to provide other users with relevant information which has been of greatest interest to a community of like-minded users. This is done through the development of two
algorithms; AuthorRank and MessageRank.
We present two real-world user experiments which focussed around multimedia annotation and browsing systems that we built; these systems were novel in themselves, bringing together video and text browsing, as well as free-text annotation. Using these systems as examples of real-world applications for our approaches, we then look at a
larger-scale experiment based on the author and citation networks of a ten year period of the ACM SIGIR conference on information retrieval between 1997-2007. We use the
citation context of SIGIR publications as a proxy for annotations, constructing large social networks between authors. Against these networks we show the eïŹectiveness of
incorporating user generated content, or annotations, to improve information retrieval
Evaluation of the inïŹuence of personality types on performance of shared tasks in a collaborative environment
Computer Supported Cooperative Work (CSCW) is an area of computing that has been receiving much attention in recent years. Developments in groupware technology, such as MERLâs Diamondtouch and Microsoftâs Surface, have presented us with new, challenging and exciting ways to carry out group tasks. However, these groupware technologies present us with a novel area of research in the ïŹeld of computing â that being multi-user Human-Computer Interaction (HCI). With multi-user HCI, we no longer have to cater for one person working on their own PC. We must now consider multiple users and their preferences as a group in order to design groupware applications that best suit the needs of that group.
In this thesis, we aim to identify how groups of two people (dyads), given their various personality types and preferences, work together on groupware technologies. We propose interface variants to both competitive and collaborative systems in an attempt to identify what aspects of an interface or task best suit the needs of the diïŹerent dyads, maximising their performance and producing high levels of user satisfaction. In order to determine this, we introduce a series of user experiments that we carried out with 18 dyads and analyse their performance, behaviour and responses to each of 5 systems and their respective variants. Our research and user experiments were facilitated by the DiamondTouch â a collaborative, multi-user tabletop device
Modelling the relationship between gesture motion and meaning
There are many ways to say âHello,â be it a wave, a nod, or a bow. We greet others not only with words, but also with our bodies. Embodied communication permeates our interactions. A fist bump, thumbs-up, or pat on the back can be even more meaningful than hearing âgood job!â A friend crossing their arms with a scowl, turning away from you, or stiffening up can feel like a harsh rejection. Social communication is not exclusively linguistic, but is a multi-sensory affair. Itâs not that communication without these bodily cues is impossible, but it is impoverished. Embodiment is a fundamental human experience.
Expressing ourselves through our bodies provides a powerful channel through which we express a plethora of meta-social information. And integral to communication, expression, and social engagement is our utilization of conversational gesture. We use gestures to express extra-linguistic information, to emphasize our point, and to embody mental and linguistic metaphors that add depth and color to social interaction.
The gesture behaviour of virtual humans when compared to human-human conversation is limited, depending on the approach taken to automate performances of these characters. The generation of nonverbal behaviour for virtual humans can be approximately classified as either: 1) data-driven approaches that learn a mapping from aspects of the verbal channel, such as prosody, to gestures; or 2) rule bases approaches that are often tailored by designers for specific applications.
This thesis is an interdisciplinary exploration that bridges these two approaches, and brings data-driven analyses to observational gesture research. By marrying a rich history of gesture research in behavioral psychology with data-driven techniques, this body of work brings rigorous computational methods to gesture classification, analysis, and generation. It addresses how researchers can exploit computational methods to make virtual humans gesture with the same richness, complexity, and apparent effortlessness as you and I. Throughout this work the central focus is on metaphoric gestures. These gestures are capable of conveying rich, nuanced, multi-dimensional meaning, and raise several challenges in their generation, including establishing and interpreting a gestureâs communicative meaning, and selecting a performance to convey it. As such, effectively utilizing these gestures remains an open challenge in virtual agent research. This thesis explores how metaphoric gestures are interpreted by an observer, how one can generate such rich gestures using a mapping between utterance meaning and gesture, as well as how one can use data driven techniques to explore the mapping between utterance and metaphoric gestures.
The thesis begins in Chapter 1 by outlining the interdisciplinary space of gesture research in psychology and generation in virtual agents. It then presents several studies that address presupposed assumptions raised about the need for rich, metaphoric gestures and the risk of false implicature when gestural meaning is ignored in gesture generation. In Chapter 2, two studies on metaphoric gestures that embody multiple metaphors argue three critical points that inform the rest of the thesis: that people form rich inferences from metaphoric gestures, these inferences are informed by cultural context and, more importantly, that any approach to analyzing the relation between utterance and metaphoric gesture needs to take into account that multiple metaphors may be conveyed by a single gesture. A third study presented in Chapter 3 highlights the risk of false implicature and discusses this in the context of current subjective evaluations of the qualitative influence of gesture on viewers.
Chapters 4 and 5 then present a data-driven analysis approach to recovering an interpretable explicit mapping from utterance to metaphor. The approach described in detail in Chapter 4 clusters gestural motion and relates those clusters to the semantic analysis of associated utterance. Then, Chapter 5 demonstrates how this approach can be used both as a framework for data-driven techniques in the study of gesture as well as form the basis of a gesture generation approach for virtual humans.
The framework used in the last two chapters ties together the main themes of this thesis: how we can use observational behavioral gesture research to inform data-driven analysis methods, how embodied metaphor relates to fine-grained gestural motion, and how to exploit this relationship to generate rich, communicatively nuanced gestures on virtual agents. While gestures show huge variation, the goal of this thesis is to start to characterize and codify that variation using modern data-driven techniques.
The final chapter of this thesis reflects on the many challenges and obstacles the field of gesture generation continues to face. The potential for applications of Virtual Agents to have broad impacts on our daily lives increases with the growing pervasiveness of digital interfaces, technical breakthroughs, and collaborative interdisciplinary research efforts. It concludes with an optimistic vision of applications for virtual agents with deep models of non-verbal social behaviour and their potential to encourage multi-disciplinary collaboration
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research