503 research outputs found

    A perceptually based computational framework for the interpretation of spatial language

    Get PDF
    The goal of this work is to develop a semantic framework to underpin the development of natural language (NL) interfaces for 3 Dimensional (3-D) simulated environments. The thesis of this work is that the computational interpretation of language in such environments should be based on a framework that integrates a model of visual perception with a model of discourse. When interacting with a 3-D environment, users have two main goals the first is to move around in the simulated environment and the second is to manipulate objects in the environment. In order to interact with an object through language, users need to be able to refer to the object. There are many different types of referring expressions including definite descriptions, pronominals, demonstratives, one-anaphora, other-expressions, and locative-expressions Some of these expressions are anaphoric (e g , pronominals, oneanaphora, other-expressions). In order to computationally interpret these, it is necessary to develop, and implement, a discourse model. Interpreting locative expressions requires a semantic model for prepositions and a mechanism for selecting the user’s intended frame of reference. Finally, many of these expressions presuppose a visual context. In order to interpret them this context must be modelled and utilised. This thesis develops a perceptually grounded discourse-based computational model of reference resolution capable of handling anaphoric and locative expressions. There are three novel contributions in this framework a visual saliency algorithm, a semantic model for locative expressions containing projective prepositions, and a discourse model. The visual saliency algorithm grades the prominence of the objects in the user's view volume at each frame. This algorithm is based on the assumption that objects which are larger and more central to the user's view are more prominent than objects which are smaller or on the periphery of their view. The resulting saliency ratings for each frame are stored in a data structure linked to the NL system’s context model. This approach gives the system a visual memory that may be drawn upon in order to resolve references. The semantic model for locative expressions defines a computational algorithm for interpreting locatives that contain a projective preposition. Specifically, the prepositions in front of behind, to the right of, and to the left of. There are several novel components within this model. First, there is a procedure for handling the issue of frame of reference selection. Second, there is an algorithm for modelling the spatial templates of projective prepositions. This algonthm integrates a topological model with visual perceptual cues. This approach allows us to correctly define the regions described by projective preposition in the viewer-centred frame of reference, in situations that previous models (Yamada 1993, Gapp 1994a, Olivier et al 1994, Fuhr et al 1998) have found problematic. Thirdly, the abstraction used to represent the candidate trajectors of a locative expression ensures that each candidate is ascribed the highest rating possible. This approach guarantees that the candidate trajector that occupies the location with the highest applicability in the prepositions spatial template is selected as the locative’s referent. The context model extends the work of Salmon-Alt and Romary (2001) by integrating the perceptual information created by the visual saliency algonthm with a model of discourse. Moreover, the context model defines an interpretation process that provides an explicit account of how the visual and linguistic information sources are utilised when attributing a referent to a nominal expression. It is important to note that the context model provides the set of candidate referents and candidate trajectors for the locative expression interpretation algorithm. These are restncted to those objects that the user has seen. The thesis shows that visual salience provides a qualitative control in NL interpretation for 3-D simulated environments and captures interesting and significant effects such as graded judgments. Moreover, it provides an account for how object occlusion impacts on the semantics of projective prepositions that are canonically aligned with the front-back axis in the viewer-centred frame of reference

    The sensory screen: phenomenology of visual perception in early European avant-garde film

    Get PDF
    At the beginning of the twentieth century, certain artists, writers, and philosophers became intrigued by the profound ways in which filmic images could pervade aspects of modern thought and experience. For them, film had the potential to reveal radical new dimensions of sensory phenomena. The early development of avant-garde film-making in Europe is culturally crucial not only for its historical and conceptual context of creative transition, but also for its dynamic exploration of processes of visual perception. The central objective of this thesis is to expose and engage these profound perceptual issues within the specific sphere of graphic abstract film. The structural formation of the thesis entails the confluencing of material for analysis into a sequence of key areas comprising the central components of avant-garde cinematic visualisation. The visual implications of each area are analysed in specific depth, whilst acknowledging their respective interactivity. Significantly, the research applies analytic theories of phenomenology in order to focus incisively upon relevant early European avant-garde filmic imagery. The potential vitality of a phenomenological theorisation of early avant-garde film resides not only within their historical contemporaneity, but at the epistemological level of the mind's cognitive engagement with the realms of creative visualisation. It is a system of analysis which aims to establish a nuanced phenomenological theory of visual perception as a matter of prime sustenance to historically crucial cinematic art forms

    Roles of language and thought in conceptual representations

    Get PDF

    Dynamic adaptation of streamed real-time E-learning videos over the internet

    Get PDF
    Even though the e-learning is becoming increasingly popular in the academic environment, the quality of synchronous e-learning video is still substandard and significant work needs to be done to improve it. The improvements have to be brought about taking into considerations both: the network requirements and the psycho- physical aspects of the human visual system. One of the problems of the synchronous e-learning video is that the head-and-shoulder video of the instructor is mostly transmitted. This video presentation can be made more interesting by transmitting shots from different angles and zooms. Unfortunately, the transmission of such multi-shot videos will increase packet delay, jitter and other artifacts caused by frequent changes of the scenes. To some extent these problems may be reduced by controlled reduction of the quality of video so as to minimise uncontrolled corruption of the stream. Hence, there is a need for controlled streaming of a multi-shot e-learning video in response to the changing availability of the bandwidth, while utilising the available bandwidth to the maximum. The quality of transmitted video can be improved by removing the redundant background data and utilising the available bandwidth for sending high-resolution foreground information. While a number of schemes exist to identify and remove the background from the foreground, very few studies exist on the identification and separation of the two based on the understanding of the human visual system. Research has been carried out to define foreground and background in the context of e-learning video on the basis of human psychology. The results have been utilised to propose methods for improving the transmission of e-learning videos. In order to transmit the video sequence efficiently this research proposes the use of Feed- Forward Controllers that dynamically characterise the ongoing scene and adjust the streaming of video based on the availability of the bandwidth. In order to satisfy a number of receivers connected by varied bandwidth links in a heterogeneous environment, the use of Multi-Layer Feed-Forward Controller has been researched. This controller dynamically characterises the complexity (number of Macroblocks per frame) of the ongoing video sequence and combines it with the knowledge of availability of the bandwidth to various receivers to divide the video sequence into layers in an optimal way before transmitting it into network. The Single-layer Feed-Forward Controller inputs the complexity (Spatial Information and Temporal Information) of the on-going video sequence along with the availability of bandwidth to a receiver and adjusts the resolution and frame rate of individual scenes to transmit the sequence optimised to give the most acceptable perceptual quality within the bandwidth constraints. The performance of the Feed-Forward Controllers have been evaluated under simulated conditions and have been found to effectively regulate the streaming of real-time e-learning videos in order to provide perceptually improved video quality within the constraints of the available bandwidth

    Learning from Teacher's Eye Movement: Expertise, Subject Matter and Video Modeling

    Full text link
    How teachers' eye movements can be used to understand and improve education is the central focus of the present paper. Three empirical studies were carried out to understand the nature of teachers' eye movements in natural settings and how they might be used to promote learning. The studies explored 1) the relationship between teacher expertise and eye movement in the course of teaching, 2) how individual differences and the demands of different subjects affect teachers' eye movement during literacy and mathematics instruction, 3) whether including an expert's eye movement and hand information in instructional videos can promote learning. Each study looked at the nature and use of teacher eye movements from a different angle but collectively converge on contributions to answering the question: what can we learn from teachers' eye movements? The paper also contains an independent methodology chapter dedicated to reviewing and comparing methods of representing eye movements in order to determine a suitable statistical procedure for representing the richness of current and similar eye tracking data. Results show that there are considerable differences between expert and novice teachers' eye movement in a real teaching situation, replicating similar patterns revealed by past studies on expertise and gaze behavior in athletics and other fields. This paper also identified the mix of person-specific and subject-specific eye movement patterns that occur when the same teacher teaches different topics to the same children. The final study reports evidence that eye movement can be useful in teaching; by showing increased learning when learners saw an expert model's eye movement in a video modeling example. The implications of these studies regarding teacher education and instruction are discussed.PHDEducation & PsychologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/145853/1/yizhenh_1.pd

    Semantic multimedia modelling & interpretation for annotation

    Get PDF
    The emergence of multimedia enabled devices, particularly the incorporation of cameras in mobile phones, and the accelerated revolutions in the low cost storage devices, boosts the multimedia data production rate drastically. Witnessing such an iniquitousness of digital images and videos, the research community has been projecting the issue of its significant utilization and management. Stored in monumental multimedia corpora, digital data need to be retrieved and organized in an intelligent way, leaning on the rich semantics involved. The utilization of these image and video collections demands proficient image and video annotation and retrieval techniques. Recently, the multimedia research community is progressively veering its emphasis to the personalization of these media. The main impediment in the image and video analysis is the semantic gap, which is the discrepancy among a user’s high-level interpretation of an image and the video and the low level computational interpretation of it. Content-based image and video annotation systems are remarkably susceptible to the semantic gap due to their reliance on low-level visual features for delineating semantically rich image and video contents. However, the fact is that the visual similarity is not semantic similarity, so there is a demand to break through this dilemma through an alternative way. The semantic gap can be narrowed by counting high-level and user-generated information in the annotation. High-level descriptions of images and or videos are more proficient of capturing the semantic meaning of multimedia content, but it is not always applicable to collect this information. It is commonly agreed that the problem of high level semantic annotation of multimedia is still far from being answered. This dissertation puts forward approaches for intelligent multimedia semantic extraction for high level annotation. This dissertation intends to bridge the gap between the visual features and semantics. It proposes a framework for annotation enhancement and refinement for the object/concept annotated images and videos datasets. The entire theme is to first purify the datasets from noisy keyword and then expand the concepts lexically and commonsensical to fill the vocabulary and lexical gap to achieve high level semantics for the corpus. This dissertation also explored a novel approach for high level semantic (HLS) propagation through the images corpora. The HLS propagation takes the advantages of the semantic intensity (SI), which is the concept dominancy factor in the image and annotation based semantic similarity of the images. As we are aware of the fact that the image is the combination of various concepts and among the list of concepts some of them are more dominant then the other, while semantic similarity of the images are based on the SI and concept semantic similarity among the pair of images. Moreover, the HLS exploits the clustering techniques to group similar images, where a single effort of the human experts to assign high level semantic to a randomly selected image and propagate to other images through clustering. The investigation has been made on the LabelMe image and LabelMe video dataset. Experiments exhibit that the proposed approaches perform a noticeable improvement towards bridging the semantic gap and reveal that our proposed system outperforms the traditional systems

    Audio-visual football video analysis, from structure detection to attention analysis

    Get PDF
    Sport video is an important video genre. Content-based sports video analysis attracts great interest from both industry and academic fields. A sports video is characterised by repetitive temporal structures, relatively plain contents, and strong spatio-temporal variations, such as quick camera switches and swift local motions. It is necessary to develop specific techniques for content-based sports video analysis to utilise these characteristics. For an efficient and effective sports video analysis system, there are three fundamental questions: (1) what are key stories for sports videos; (2) what incurs viewer’s interest; and (3) how to identify game highlights. This thesis is developed around these questions. We approached these questions from two different perspectives and in turn three research contributions are presented, namely, replay detection, attack temporal structure decomposition, and attention-based highlight identification. Replay segments convey the most important contents in sports videos. It is an efficient approach to collect game highlights by detecting replay segments. However, replay is an artefact of editing, which improves with advances in video editing tools. The composition of replay is complex, which includes logo transitions, slow motions, viewpoint switches and normal speed video clips. Since logo transition clips are pervasive in game collections of FIFA World Cup 2002, FIFA World Cup 2006 and UEFA Championship 2006, we take logo transition detection as an effective replacement of replay detection. A two-pass system was developed, including a five-layer adaboost classifier and a logo template matching throughout an entire video. The five-layer adaboost utilises shot duration, average game pitch ratio, average motion, sequential colour histogram and shot frequency between two neighbouring logo transitions, to filter out logo transition candidates. Subsequently, a logo template is constructed and employed to find all transition logo sequences. The precision and recall of this system in replay detection is 100% in a five-game evaluation collection. An attack structure is a team competition for a score. Hence, this structure is a conceptually fundamental unit of a football video as well as other sports videos. We review the literature of content-based temporal structures, such as play-break structure, and develop a three-step system for automatic attack structure decomposition. Four content-based shot classes, namely, play, focus, replay and break were identified by low level visual features. A four-state hidden Markov model was trained to simulate transition processes among these shot classes. Since attack structures are the longest repetitive temporal unit in a sports video, a suffix tree is proposed to find the longest repetitive substring in the label sequence of shot class transitions. These occurrences of this substring are regarded as a kernel of an attack hidden Markov process. Therefore, the decomposition of attack structure becomes a boundary likelihood comparison between two Markov chains. Highlights are what attract notice. Attention is a psychological measurement of “notice ”. A brief survey of attention psychological background, attention estimation from vision and auditory, and multiple modality attention fusion is presented. We propose two attention models for sports video analysis, namely, the role-based attention model and the multiresolution autoregressive framework. The role-based attention model is based on the perception structure during watching video. This model removes reflection bias among modality salient signals and combines these signals by reflectors. The multiresolution autoregressive framework (MAR) treats salient signals as a group of smooth random processes, which follow a similar trend but are filled with noise. This framework tries to estimate a noise-less signal from these coarse noisy observations by a multiple resolution analysis. Related algorithms are developed, such as event segmentation on a MAR tree and real time event detection. The experiment shows that these attention-based approach can find goal events at a high precision. Moreover, results of MAR-based highlight detection on the final game of FIFA 2002 and 2006 are highly similar to professionally labelled highlights by BBC and FIFA

    The language-cognition interface in bilinguals: an evaluation of the conceptual transfer hypothesis

    Get PDF
    Praca podejmuje temat wpływu języka na kategorie konceptualne u osób dwujęzycznych. Poruszana problematyka omawiana jest na podstawie najnowszych teorii pamięci bilingwalnej oraz stworzonej na ich kanwie hipotezy transferu konceptualnego autorstwa Scotta Jarvisa i Anety Pavlenko. Część teoretyczna przedstawia strukturę pamięci bilingwalnej, zwanej również słownikiem wewnętrznym, modele sfery konceptualnej oraz istniejące pomiędzy poziomem językowym i konceptualnym zależności. Te ostatnie rozpatrywane są przez pryzmat teorii względności językowej i jej zmodyfikowanych wersji: teorii „myślenie dla mowy” (ang. Thinking for Speaking) Dana Slobina, jak również hipotezy Christiane von Stutterheim. Ostatnim elementem dyskusji jest prezentacja hipotezy transferu konceptualnego oraz jej ocena pod kątem merytorycznym i empirycznym. Część badawcza przedstawia dwa projekty zrealizowane zgodnie z zaleceniami autorów hipotezy transferu konceptualnego. Projekt 1. dotyczy kategoryzacji semantycznej oraz niewerbalnej. Badane kategorie semantyczne oparte są na eksplikacjach Anny Wierzbickiej i dotyczą relacji międzyludzkich (przyjaciel, friend, kolega itd.). Projekt 2. to analiza ram konceptualizacyjnych pod kątem wydarzeń przedstawiających ruch ukierunkowany oraz konstrukcji narracji w pisemnych relacjach z obejrzanego filmu animowanego. Uzyskane dane w języku polskim i angielskim stanowią podstawę wniosków, które zaprezentowano w ostatnim rozdziale pracy. Badania przeprowadzono w Polsce i krajach anglojęzycznych (w Anglii i Irlandii). W skład badanych populacji weszli monolingwalni Polacy i rodzimi użytkownicy języka angielskiego (ang. native speakers) oraz Polacy posługujący się językiem angielskim w warunkach naturalnych (emigranci) i szkolnych (studenci filologii angielskiej). Każda z grup monolingwalnych uczestniczyła w sesjach badawczych dotyczących odpowiednio języka polskiego i angielskiego. Osoby dwujęzyczne testowane były w obydwu językach. Dane zebrano za pomocą scenariuszy sytuacyjnych, kwestionariuszy, oceny podobieństwa, a także opisu narracyjnego krótkometrażowego filmu animowanego pt. Katedra w reżyserii Tomasza Bagińskiego
    corecore