92 research outputs found

    Space-variant picture coding

    Get PDF
    PhDSpace-variant picture coding techniques exploit the strong spatial non-uniformity of the human visual system in order to increase coding efficiency in terms of perceived quality per bit. This thesis extends space-variant coding research in two directions. The first of these directions is in foveated coding. Past foveated coding research has been dominated by the single-viewer, gaze-contingent scenario. However, for research into the multi-viewer and probability-based scenarios, this thesis presents a missing piece: an algorithm for computing an additive multi-viewer sensitivity function based on an established eye resolution model, and, from this, a blur map that is optimal in the sense of discarding frequencies in least-noticeable- rst order. Furthermore, for the application of a blur map, a novel algorithm is presented for the efficient computation of high-accuracy smoothly space-variant Gaussian blurring, using a specialised filter bank which approximates perfect space-variant Gaussian blurring to arbitrarily high accuracy and at greatly reduced cost compared to the brute force approach of employing a separate low-pass filter at each image location. The second direction is that of artifi cially increasing the depth-of- field of an image, an idea borrowed from photography with the advantage of allowing an image to be reduced in bitrate while retaining or increasing overall aesthetic quality. Two synthetic depth of field algorithms are presented herein, with the desirable properties of aiming to mimic occlusion eff ects as occur in natural blurring, and of handling any number of blurring and occlusion levels with the same level of computational complexity. The merits of this coding approach have been investigated by subjective experiments to compare it with single-viewer foveated image coding. The results found the depth-based preblurring to generally be significantly preferable to the same level of foveation blurring

    Content-prioritised video coding for British Sign Language communication.

    Get PDF
    Video communication of British Sign Language (BSL) is important for remote interpersonal communication and for the equal provision of services for deaf people. However, the use of video telephony and video conferencing applications for BSL communication is limited by inadequate video quality. BSL is a highly structured, linguistically complete, natural language system that expresses vocabulary and grammar visually and spatially using a complex combination of facial expressions (such as eyebrow movements, eye blinks and mouth/lip shapes), hand gestures, body movements and finger-spelling that change in space and time. Accurate natural BSL communication places specific demands on visual media applications which must compress video image data for efficient transmission. Current video compression schemes apply methods to reduce statistical redundancy and perceptual irrelevance in video image data based on a general model of Human Visual System (HVS) sensitivities. This thesis presents novel video image coding methods developed to achieve the conflicting requirements for high image quality and efficient coding. Novel methods of prioritising visually important video image content for optimised video coding are developed to exploit the HVS spatial and temporal response mechanisms of BSL users (determined by Eye Movement Tracking) and the characteristics of BSL video image content. The methods implement an accurate model of HVS foveation, applied in the spatial and temporal domains, at the pre-processing stage of a current standard-based system (H.264). Comparison of the performance of the developed and standard coding systems, using methods of video quality evaluation developed for this thesis, demonstrates improved perceived quality at low bit rates. BSL users, broadcasters and service providers benefit from the perception of high quality video over a range of available transmission bandwidths. The research community benefits from a new approach to video coding optimisation and better understanding of the communication needs of deaf people

    Perception-driven approaches to real-time remote immersive visualization

    Get PDF
    In remote immersive visualization systems, real-time 3D perception through RGB-D cameras, combined with modern Virtual Reality (VR) interfaces, enhances the userโ€™s sense of presence in a remote scene through 3D reconstruction rendered in a remote immersive visualization system. Particularly, in situations when there is a need to visualize, explore and perform tasks in inaccessible environments, too hazardous or distant. However, a remote visualization system requires the entire pipeline from 3D data acquisition to VR rendering satisfies the speed, throughput, and high visual realism. Mainly when using point-cloud, there is a fundamental quality difference between the acquired data of the physical world and the displayed data because of network latency and throughput limitations that negatively impact the sense of presence and provoke cybersickness. This thesis presents state-of-the-art research to address these problems by taking the human visual system as inspiration, from sensor data acquisition to VR rendering. The human visual system does not have a uniform vision across the field of view; It has the sharpest visual acuity at the center of the field of view. The acuity falls off towards the periphery. The peripheral vision provides lower resolution to guide the eye movements so that the central vision visits all the interesting crucial parts. As a first contribution, the thesis developed remote visualization strategies that utilize the acuity fall-off to facilitate the processing, transmission, buffering, and rendering in VR of 3D reconstructed scenes while simultaneously reducing throughput requirements and latency. As a second contribution, the thesis looked into attentional mechanisms to select and draw user engagement to specific information from the dynamic spatio-temporal environment. It proposed a strategy to analyze the remote scene concerning the 3D structure of the scene, its layout, and the spatial, functional, and semantic relationships between objects in the scene. The strategy primarily focuses on analyzing the scene with models the human visual perception uses. It sets a more significant proportion of computational resources on objects of interest and creates a more realistic visualization. As a supplementary contribution, A new volumetric point-cloud density-based Peak Signal-to-Noise Ratio (PSNR) metric is proposed to evaluate the introduced techniques. An in-depth evaluation of the presented systems, comparative examination of the proposed point cloud metric, user studies, and experiments demonstrated that the methods introduced in this thesis are visually superior while significantly reducing latency and throughput

    Methods and Apparatus for Autonomous Robotic Control

    Get PDF
    Sensory processing of visual, auditory, and other sensor information (e.g., visual imagery, LIDAR, RADAR) is conventionally based on "stovepiped," or isolated processing, with little interactions between modules. Biological systems, on the other hand, fuse multi-sensory information to identify nearby objects of interest more quickly, more efficiently, and with higher signal-to-noise ratios. Similarly, examples of the OpenSense technology disclosed herein use neurally inspired processing to identify and locate objects in a robot's environment. This enables the robot to navigate its environment more quickly and with lower computational and power requirements

    Visual Attention in Dynamic Environments and its Application to Playing Online Games

    Get PDF
    Abstract In this thesis we present a prototype of Cognitive Programs (CPs) - an executive controller built on top of Selective Tuning (ST) model of attention. CPs enable top-down control of visual system and interaction between the low-level vision and higher-level task demands. Abstract We implement a subset of CPs for playing online video games in real time using only visual input. Two commercial closed-source games - Canabalt and Robot Unicorn Attack - are used for evaluation. Their simple gameplay and minimal controls put the emphasis on reaction speed and attention over planning. Abstract Our implementation of Cognitive Programs plays both games at human expert level, which experimentally proves the validity of the concept. Additionally we resolved multiple theoretical and engineering issues, e.g. extending the CPs to dynamic environments, finding suitable data structures for describing the task and information flow within the network and determining the correct timing for each process

    Long Range Automated Persistent Surveillance

    Get PDF
    This dissertation addresses long range automated persistent surveillance with focus on three topics: sensor planning, size preserving tracking, and high magnification imaging. field of view should be reserved so that camera handoff can be executed successfully before the object of interest becomes unidentifiable or untraceable. We design a sensor planning algorithm that not only maximizes coverage but also ensures uniform and sufficient overlapped cameraโ€™s field of view for an optimal handoff success rate. This algorithm works for environments with multiple dynamic targets using different types of cameras. Significantly improved handoff success rates are illustrated via experiments using floor plans of various scales. Size preserving tracking automatically adjusts the cameraโ€™s zoom for a consistent view of the object of interest. Target scale estimation is carried out based on the paraperspective projection model which compensates for the center offset and considers system latency and tracking errors. A computationally efficient foreground segmentation strategy, 3D affine shapes, is proposed. The 3D affine shapes feature direct and real-time implementation and improved flexibility in accommodating the targetโ€™s 3D motion, including off-plane rotations. The effectiveness of the scale estimation and foreground segmentation algorithms is validated via both offline and real-time tracking of pedestrians at various resolution levels. Face image quality assessment and enhancement compensate for the performance degradations in face recognition rates caused by high system magnifications and long observation distances. A class of adaptive sharpness measures is proposed to evaluate and predict this degradation. A wavelet based enhancement algorithm with automated frame selection is developed and proves efficient by a considerably elevated face recognition rate for severely blurred long range face images

    ํšจ์œจ์  ์˜์ƒ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2013. 2. ์ตœ์ง„์˜.์ปดํ“จํ„ฐ ๋น„์ „ ๋ฌธ์ œ๋Š” ์˜์ƒ ํš๋“ ์žฅ์น˜๋ฅผ ํ†ตํ•ด ํ”ฝ์…€ ๋‹จ์œ„๋กœ ์ˆ˜์น˜ํ™”๋œ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋ง ํ•˜๋Š” ๊ฒƒ์œผ๋กœ๋ถ€ํ„ฐ ์‹œ์ž‘๋œ๋‹ค. ๊ฐ€์žฅ ๊ธฐ๋ณธ์ด ๋˜๋Š” ๋ฐ์ดํ„ฐ์ธ ํ”ฝ์…€ ๊ฐ’๋“ค์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ๊ณ , ์ด ํ”ฝ์…€ ๊ฐ’๋“ค์„ ์กฐํ•ฉํ•˜์—ฌ ์ƒˆ๋กœ์šด ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๋“ค์„ ๊ตฌ์„ฑํ•˜๊ณ  ์ƒ˜ํ”Œ๋ง ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๊ธฐ๋„ ํ•œ๋‹ค. ์ข‹์€ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ตœ๋Œ€ํ•œ ๋งŽ์€ ์ˆ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ƒ˜ํ”Œ๋ง ํ•˜๋Š” ๊ฒƒ์ด ํ•„์š”ํ•˜์ง€๋งŒ ์ด๋Ÿด ๊ฒฝ์šฐ ํ•„์š”๋กœ ํ•˜๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€ํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰ ๋งŒ์„ ๊ณ ๋ คํ•ด ์ตœ์†Œํ•œ์˜ ๋ฐ์ดํ„ฐ๋งŒ ์ƒ˜ํ”Œ๋ง ํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ข‹์€ ์„ฑ๋Šฅ์„ ๊ธฐ๋Œ€ํ•˜๊ธฐ ์–ด๋ ต๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ํšจ์œจ์ ์ธ ์—ฐ์‚ฐ๋Ÿ‰์œผ๋กœ ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด์„œ๋Š”, ์ด๋ฏธ์ง€๊ฐ€ ๋ฐ”๋€œ์— ๋”ฐ๋ผ ํ˜น์€ ์‹œ๊ฐ„์ด ํ๋ฆ„์— ๋”ฐ๋ผ ๋ฌธ์ œ๋ฅผ ํ’€๊ธฐ์— ์ถฉ๋ถ„ํ•œ ์ตœ์†Œํ•œ์˜ ๋ฐ์ดํ„ฐ๋งŒ ์ฐพ์•„๋‚ด์–ด ์ƒ˜ํ”Œ๋ง ํ•˜๋Š” ๋Šฅ๋™ ์ƒ˜ํ”Œ๋ง(active sampling) ๊ฐœ๋…์ด ํ•„์š”ํ•˜๋‹ค. ์ด๋Ÿฌํ•œ ๋Šฅ๋™ ์ƒ˜ํ”Œ๋ง ๊ฐœ๋…์„ ํ˜„์‹คํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ ๋ฐ์ดํ„ฐ๋“ค์„ ์ฐพ์•„๋‚ด๋Š” ๊ณผ์ •์ด ๋งค์šฐ ์ค‘์š”ํ•˜๋ฉฐ, ์ฐพ์•„๋‚ธ ๋ฐ์ดํ„ฐ๋“ค์„ ์–ด๋–ป๊ฒŒ ์ง‘์ค‘ํ•˜์—ฌ ์ƒ˜ํ”Œ๋ง ํ•˜๋Š”๊ฐ€๊ฐ€ ์ค‘์š”ํ•ด์ง„๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ์„ธ ๊ฐ€์ง€์˜ ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง(attentional sampling) ๋ฐฉ๋ฒ•, ์ฆ‰ ๊ตฌ์กฐ์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง(structured attentional sampling), ๊ฒฝํ—˜์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง(empirical attentional sampling), ์„ ํƒ์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง(selective attentional sampling)์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ œ์•ˆ๋œ ๊ฐ๊ฐ์˜ ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ•๋“ค์€ ์ฃผ์˜์ง‘์ค‘์ด ํ•„์š”ํ•œ ์ค‘์š” ๋ฐ์ดํ„ฐ๋“ค์„ ์ฐพ๊ธฐ ์œ„ํ•ด ๋ฌธ์ œ์˜ ํŠน์„ฑ์— ๋Œ€ํ•œ ์‚ฌ์ „ ์ง€์‹(prior knowledge)์„ ์ ์šฉํ•˜๋Š” ์„ธ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ๊ทธ์— ๋”ฐ๋ผ ์ ์‘์ ์œผ๋กœ ์ƒ˜ํ”Œ๋ง ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์ด๋‹ค. ์ œ์•ˆ๋œ ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ•๋“ค์€ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ฌธ์ œ๋“ค์— ์„ฑ๊ณต์ ์œผ๋กœ ์ ์šฉ๋˜์–ด ์—ฐ์‚ฐ ํšจ์œจ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๊ฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ ์‹œ์ผฐ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๊ตฌ์กฐ์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง(structured attentional sampling)์€ ๋ฌธ์ œ์˜ ํŠน์„ฑ์— ๋งž์ถฐ ๋ฏธ๋ฆฌ ๊ตฌ์กฐํ™”๋œ ์ƒ˜ํ”Œ๋ง ํŒจํ„ด์— ๋”ฐ๋ผ ์ƒ˜ํ”Œ๋ง์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ•์„ ์‚ฌ๋žŒ ๋ˆˆ์˜ ๊ตฌ์กฐ๋ฅผ ํ‰๋‚ด ๋‚ด์–ด ๋ฌผ์ฒด ์ถ”์  ์‹คํŒจ๋ฅผ ํƒ์ง€ํ•˜๋Š” ๋ฐ ์ ์šฉํ•˜์˜€๋‹ค. ์‚ฌ๋žŒ ๋ˆˆ ๋ง๋ง‰ ์œ„์˜ ์‹œ์‹ ๊ฒฝ ์„ธํฌ(ganglion cells)์˜ ๋ถ„ํฌ๋ฅผ ๊ทผ์‚ฌํ™”ํ•œ log-polar ํŒจํ„ด ๊ตฌ์กฐ๋กœ ์ด๋ฏธ์ง€ ํ”ฝ์…€ ์ƒ˜ํ”Œ๋ง์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์‚ฌ๋žŒ ๋ˆˆ์˜ ์œ ์šฉํ•œ ํŠน์„ฑ์„ ํ‰๋‚ด ๋‚ด์—ˆ๋‹ค. Log-polar ํŒจํ„ด์œผ๋กœ ์ƒ˜ํ”Œ๋ง ๋œ ์ด๋ฏธ์ง€๋Š” ํšŒ์ „(rotation) ๋ณ€ํ™”์— ์˜ํ•œ ์˜ํ–ฅ์€ ๊ฐ์†Œ๋˜์–ด ๋‚˜ํƒ€๋‚˜๊ณ , ์ขŒ์šฐ๋‚˜ ์œ„์•„๋ž˜๋กœ์˜ ๋ณ‘์ง„(translation) ๋ณ€ํ™”๋Š” ์ฆํญ๋˜์–ด ๋‚˜ํƒ€๋‚˜๋Š” ํŠน์„ฑ์ด ์žˆ๋‹ค. ์ด๋Ÿฌํ•œ ํŠน์„ฑ์€ ํšŒ์ „์— ์˜ํ•ด ๋‚˜ํƒ€๋‚˜๋Š” ํฌ์ฆˆ ๋ณ€ํ™”๋“ค๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ์ถ”์  ์‹คํŒจ์— ๋Œ€ํ•œ ๊ฑฐ์ง“ ๊ฒฝ๋ณด(false alarm)๋“ค์€ ์ค„์ด๊ณ , ๊ธ‰๊ฒฉํ•œ ์œ„์น˜ ๋ณ€ํ™”๋กœ ์ธํ•œ ์ถ”์  ์‹คํŒจ์— ๋Œ€ํ•œ ์ฐธ ๊ฒฝ๋ณด(true alarm)๋ฅผ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ log-polar ๊ตฌ์กฐ์˜ ํŠน์ง•์ธ ์ค‘์‹ฌ์™€(fovea) ์„ ๋ช…ํ™” ํŠน์„ฑ(predominant property)์€ ์ดˆ์ ์ด ๋งž์ถฐ์ง„ ์ค‘์‹ฌ ๋ถ€๋ถ„(์ถ”์  ๋ฌผ์ฒด์˜ ์ค‘์‹ฌ ๋ถ€๋ถ„)์˜ ์„ ๋ช…๋„๋Š” ์ฆ๊ฐ€์‹œํ‚ค๊ณ  ๊ทธ ์ด์™ธ์˜ ์ฃผ๋ณ€๋ถ€(์ถ”์  ๋ฌผ์ฒด ๋ฐ”๊นฅ ๋ถ€๋ถ„)๋Š” ํ๋ฆฟํ•˜๊ฒŒ ํ•จ์œผ๋กœ์จ ์ถ”์  ์‹คํŒจ์˜ ์ˆœ๊ฐ„์„ ์ •ํ™•ํ•˜๊ฒŒ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ๋„์™€์ค€๋‹ค. ๋˜ํ•œ ๋ง๋ง‰ ์œ„์˜ ์‹œ์‹ ๊ฒฝ ์„ธํฌ ํ•˜๋‚˜ํ•˜๋‚˜๋Š” log-polar ๋ณ€ํ™˜ ์ด๋ฏธ์ง€์˜ ๊ฐ ํ”ฝ์…€์— ๋Œ€์‘์‹œ์ผœ, ๊ฐ ์„ธํฌ๊ฐ€ ๋น›์— ์ ์‘ํ•˜๋Š” ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ๊ฐ ํ”ฝ์…€์˜ ์ถ”์  ๋ฌผ์ฒด์˜ ์ƒ‰์ƒ์— ๋Œ€ํ•œ ์ ์‘์„ ๊ฐ€์šฐ์‹œ์•ˆ ํ˜ผํ•ฉ ๋ชจ๋ธ(Gaussian mixture model)์„ ์ด์šฉํ•˜์—ฌ ๋ชจ๋ธ๋ง ํ•˜์˜€๋‹ค. ์ด๋Ÿฌํ•œ ๋ฐฉ์‹์œผ๋กœ ์ œ์•ˆ๋œ ์ถ”์  ์‹คํŒจ ํƒ์ง€๋ฅผ ์œ„ํ•œ ๊ตฌ์กฐ์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง์˜ ์œ ์šฉ์„ฑ์€ ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. ๋‘ ๋ฒˆ์งธ ๊ฒฝํ—˜์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง(empirical attentional sampling)์€ ์ด์ „์— ํš๋“๋œ ๊ฒฝํ—˜์  ์ง€์‹์„ ํ˜„์žฌ ๋‹จ๊ณ„ ์ƒ˜ํ”Œ๋ง์— ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ์‹์ด๋‹ค. ๊ฒฝํ—˜์  ์ง€์‹์€ ๊ฒฝํ—˜ ํ•™์Šต ๊ณผ์ •์„ ํ†ตํ•˜์—ฌ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ชจ๋ธ๋ง ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝํ—˜์  ์ƒ˜ํ”Œ๋ง ๊ฐœ๋…์€ ์›€์ง์ด๋Š” ๋ฌผ์ฒด ํƒ์ง€๋ฅผ ์œ„ํ•ด ์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฐ๊ฒฝ ์ œ๊ฑฐ ๋ฐฉ๋ฒ•๋“ค์— ํ”ฝ์…€ ๋‹จ์œ„์˜ ์„ ํƒ์  ์—ฐ์‚ฐ ๋งˆ์Šคํฌ๋ฅผ ์ ์šฉํ•˜์—ฌ ์—ฐ์‚ฐ ์†๋„๋ฅผ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ฐฉ์‹์œผ๋กœ ์ ์šฉ๋˜์—ˆ๋‹ค. ์ œ์•ˆ๋œ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ•์€ ์ „๊ฒฝ ์ง€์—ญ(foreground region)๊ณผ ๊ฐ™์ด ์ฃผ์˜์ง‘์ค‘์„ ํ•„์š”๋กœ ํ•˜๋Š” ์˜์—ญ์— ์ดˆ์ ์ด ๋งž์ถฐ์ ธ ์ƒ˜ํ”Œ๋ง์ด ์ง„ํ–‰๋˜๋„๋ก ์„ค๊ณ„๋˜์—ˆ๋‹ค. ์ฃผ์˜์ง‘์ค‘ ์˜์—ญ์€ ์ „๊ฒฝ ํ™•๋ฅ  ์ง€๋„(foreground probability map)๋กœ ํ‘œํ˜„๋˜๊ณ , ์ด ํ™•๋ฅ  ์ง€๋„๋Š” ์ด์ „ ํ”„๋ ˆ์ž„์—์„œ์˜ ํƒ์ง€ ๊ฒฐ๊ณผ๋ฅผ ์ด์šฉํ•˜์—ฌ ์žฌ๊ท€์ (recursive) ํ™•๋ฅ  ์—…๋ฐ์ดํŠธ ๋ฐฉ์‹์œผ๋กœ ์ถ”์ •๋œ๋‹ค. ์ „๊ฒฝ ํ™•๋ฅ  ์ง€๋„๋Š” ์ „๊ฒฝ ๋ถ€๋ถ„์˜ ์‹œ๊ฐ„์ (temporal), ๊ณต๊ฐ„์ (spatial), ์ฃผํŒŒ์ˆ˜์ (frequency) ํŠน์„ฑ์„ ์ด์šฉํ•˜์—ฌ ์ƒ์„ฑ๋˜์—ˆ๋‹ค. ์ƒ์„ฑ๋œ ์ „๊ฒฝ ํ™•๋ฅ  ์ง€๋„๋ฅผ ์ด์šฉํ•˜์—ฌ, ๋ฌด์ž‘์œ„ ์ƒ˜ํ”Œ๋ง(randomly scattered sampling), ๊ณต๊ฐ„ ํ™•์žฅ ๋ฐฉ์‹์˜ ์ค‘์š” ์ƒ˜ํ”Œ๋ง(spatially expanding importance sampling), ๋†€๋žŒ ํ”ฝ์…€ ์ƒ˜ํ”Œ๋ง(surprise pixel sampling) ๋ฐฉ๋ฒ•๋“ค์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์ง„ํ–‰๋˜๋ฉด์„œ ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ œ์•ˆ๋œ ๊ฒฝํ—˜์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง ๋ฐฉ๋ฒ•์˜ ํšจ์œจ์„ฑ์€ ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆ๋˜์—ˆ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๊ธฐ์กด์˜ ํ”ฝ์…€ ๋‹จ์œ„์˜ ๋ฐฐ๊ฒฝ ์ œ๊ฑฐ ๋ฐฉ๋ฒ•์˜ ์—ฐ์‚ฐ ์†๋„๋ฅผ ํƒ์ง€ ์„ฑ๋Šฅ ์ €ํ•˜ ์—†์ด ์•ฝ 6.6๋ฐฐ ํ–ฅ์ƒ ์‹œ์ผฐ๋‹ค. ๋˜ํ•œ ๊ธฐ์กด์˜ ๋ฐฐ๊ฒฝ ์ œ๊ฑฐ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์ด์šฉํ•˜์—ฌ full HD ์˜์ƒ(1920x1080)์—์„œ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์›€์ง์ด๋Š” ๋ฌผ์ฒด๋ฅผ ํƒ์ง€ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค. ์„ ํƒ์  ์ฃผ์˜์ง‘์ค‘ ์ƒ˜ํ”Œ๋ง(selective attentional sampling)์€ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์™€ ๋ชฉ์ ์— ๋Œ€ํ•œ ์‚ฌ์ „ ์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ฌธ์ œ์˜ ํ•ด๊ฒฐ์„ ์œ„ํ•ด ๊ผญ ํ•„์š”๋กœ ํ•˜๋Š” ์ค‘์š” ๋ฐ์ดํ„ฐ๋งŒ ๋ฏธ๋ฆฌ ์„ ํƒํ•˜์—ฌ ๋ฌธ์ œ ํ•ด๊ฒฐ์˜ ํšจ์œจ์„ฑ์„ ๋†’์ด๋Š” ๋ฐฉ์‹์ด๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์„ ํƒ์  ์ƒ˜ํ”Œ๋ง ๋ฐฉ์‹์„ ์ด์šฉํ•˜์—ฌ ์ผ๋ฐ˜์ธ์ด ์ถ”๋Š” ์œ ๋ช… ๋Œ€์ค‘๊ฐ€์š”์˜ ์ถค์„ ์ธ์‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋Œ€์ค‘๊ฐ€์š” ์ถค์€ ์ผ๋ฐ˜์ ์œผ๋กœ, ๋ฐœ๋ ˆ๋‚˜ ๋ฆฌ๋“ฌ ์ฒด์กฐ์˜ ์ถค ๋™์ž‘๊ณผ๋Š” ๋‹ฌ๋ฆฌ ํ•˜๋‚˜ํ•˜๋‚˜๋ฅผ ๋”ฐ๋กœ ์ด๋ฆ„์„ ๋ถ™์ผ ์ˆ˜ ์—†๋Š” ์งง๊ณ  ๋ณต์žกํ•˜๋ฉฐ ๋‹ค์–‘ํ•œ ํ–‰๋™์˜ ์—ฐ์†์œผ๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค. ํŠนํžˆ ์ถค์— ๋Œ€ํ•œ ์ผ์ •ํ•œ ์ œ์•ฝ์ด ์—†๋‹ค ๋ณด๋‹ˆ, ๋™์ž‘์˜ ์ •ํ™•์„ฑ ๋ณด๋‹ค๋Š” ์ถ”๋Š” ์‚ฌ๋žŒ์˜ ๊ฐœ์„ฑ๊ณผ ์ž์œ ๋กœ์›€์— ๋”ฐ๋ผ ๋™์ผํ•œ ์ถค๋„ ๋‹ค์–‘ํ•˜๊ฒŒ ํ‘œํ˜„์ด ๋œ๋‹ค. ์ด๋Ÿฌํ•œ ํ–‰๋™์˜ ์ž์œ ๋กœ์›€๊ณผ ๋‹ค์–‘ํ•จ, ๊ทธ๋ฆฌ๊ณ  ์‹œ๊ฐ„์ ์œผ๋กœ ๊ธด ํ–‰๋™์˜ ๊ธธ์ด ๋•Œ๋ฌธ์— ๊ธฐ์กด์˜ ํ–‰๋™ ์ธ์‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์ง์ ‘์ ์œผ๋กœ ์ ์šฉํ•  ์ˆ˜ ์—†๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋ช…ํ™•ํ•˜๊ฒŒ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์—†์„ ์ •๋„๋กœ ์ž์œ ๋กœ์šด ํ–‰๋™์˜ ํ๋ฆ„ ํŠน์ง•์„ ํšจ๊ณผ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ  ์ธ์‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๊ธฐ ์œ„ํ•ด ์ƒˆ๋กœ์šด ํ–‰๋™ ํŠน์ง• ํ‘œํ˜„ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ , ์ด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๋‚ฎ์€ ์ฐจ์› ๋ฐ์ดํ„ฐ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋˜ํ•œ ํšจ์œจ์ ์ธ ์ธ์‹์„ ์œ„ํ•ด ํŠน์ง•์ ์ธ ์‹œ๊ณต๊ฐ„์  ํ–‰๋™์˜ ๋ณ€ํ™” ์ง€์ ์„ ์ฃผ์˜์ง‘์ค‘์  ํ–‰๋™ ์ง€์ (attentional motion spot)๋ผ ๋ช…๋ช…ํ•˜๊ณ  ์ด๋ฅผ ์ž๋™์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ํŠน์ง• ์ ๋“ค์˜ ์‹œ๊ณต๊ฐ„์  ๋ถ„ํฌ๋ฅผ ํ˜ผํ•ฉ ๊ฐ€์šฐ์‹œ์•ˆ(Gaussian) ๋ถ„ํฌ๋กœ ๋ชจ๋ธ๋งํ•˜๊ณ , ์ด๋ ‡๊ฒŒ ํ‘œํ˜„๋œ ๋ชจ๋ธ๋ง ๋ฐฉ๋ฒ•์„ ํ–‰๋™ ์•…๋ณด(Action Chart)๋ผ๊ณ  ๋ช…๋ช…ํ•˜์˜€๋‹ค. ์ด ํ–‰๋™ ์•…๋ณด๋Š” ์‹œ๊ณต๊ฐ„์ ์ธ ํ–‰๋™์˜ ํ๋ฆ„์„ ์Œ์•… ์•…๋ณด์ฒ˜๋Ÿผ ์ค‘์š” ํ–‰๋™์˜ ์‹œ๊ฐ„์  ๋ฐœ์ƒ ์ง€์ ๊ณผ ์ข…๋ฅ˜, ์ง€์† ์‹œ๊ฐ„์„ ํ‘œํ˜„ํ•˜๊ณ  ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ‘œํ˜„๋œ ํ–‰๋™ ์•…๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ์ƒˆ๋กญ๊ฒŒ ์ œ์ž‘๋œ ๋Œ€์ค‘ ๊ฐ€์š” ์ถค ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ํšจ์œจ์ ์ด๊ณ  ํšจ๊ณผ์ ์œผ๋กœ ์ธ์‹ํ•˜์˜€๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์„ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์„ ๊ตฌ์„ฑํ•˜๋Š” ์„ธ๋ถ€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ•˜๋‚˜ํ•˜๋‚˜๋ฅผ ์‹คํ—˜์ ์œผ๋กœ ๊ฒ€์ฆํ•˜์—ฌ ๊ฐ ๋ถ€๋ถ„์˜ ํ•„์š”์„ฑ์„ ๋ณด์˜€๊ณ , ํ˜„์žฌ ์กด์žฌํ•˜๋Š” ๊ธธ๊ณ  ๋ณต์žกํ•œ ํ–‰๋™์„ ์ธ์‹ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ง์ ‘ ๊ตฌํ˜„ํ•˜์—ฌ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ ์„ธํŠธ๋ฅผ ์ด์šฉํ•˜์—ฌ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์ด ์ธ์‹ ์„ฑ๋Šฅ๊ณผ ์—ฐ์‚ฐ ์‹œ๊ฐ„์ธก๋ฉด์—์„œ ์›”๋“ฑํžˆ ๋›ฐ์–ด๋‚จ์„ ๊ฒ€์ฆํ•˜์˜€๋‹ค. ๋˜ํ•œ ๋” ๋‚˜์•„๊ฐ€ ํ–‰๋™ ์•…๋ณด๋ฅผ ์ด์šฉํ•˜๋ฉด ๊ธด ์ถค ๋™์ž‘์„ ์‚ฌ๋žŒ์ด ํ•˜๋Š” ๊ฒƒ๊ณผ ๊ฑฐ์˜ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์œผ๋กœ ์š”์•ฝ ๊ฐ€๋Šฅํ•จ์„ ๋ณด์˜€๋‹ค.In many practical computer vision scenarios it is possible to use information gleaned from the previous observations through the sampling process. In order to achieve a good performance with small computation, it is desirable that the samples cover the domain of target distribution with the small number of samples as possible via a concept of active or adaptive sampling. Based on the active sampling strategy, sampling could be concentrated on attentional portions, which can improve not only the sampling efficiency but also performances of algorithms. In this thesis, we define three different attentional sampling concepts, structured attentional sampling, empirical attentional sampling and selective attentional sampling. The proposed attentional sampling methods are successfully applied to computer vision problems, by achieving dramatic improvement in the sense of performance as well as computational load. The structured attentional sampling scheme uses an inherent structure to sample an interesting region densely instead of equally distributed sampling over the entire region. This sampling scheme is applied to a tracking failure detection method by imitating human visual system. In this scheme, we adopt a sampling structure based on Log-polar transformation simulating retina structure. Since the log-polar structure shows invariance against rotational changes and intensifies translational changes, it helps to reduce false alarms arising from rotational pose variations and increase true alarms in abrupt translational changes. In addition, foveal predominant property of log-polar structure helps to detect the tracking failing moment by amplifying the resolution around focus (tracking box center) and blurring the peripheries. Each ganglion cell corresponds to a pixel of log-polar image, and its adaptation is modeled as Gaussian mixture model. The validity of the structured attentional sampling method is illustrated through various experiments. The empirical attentional sampling scheme uses previously obtained empirical knowledge when sampling in current time. The empirical knowledge is modeled by a probability distribution function through an empirical learning process. This empirical sampling scheme is applied to mask generation to speed up conventional background subtraction algorithms for moving object detection. The proposed sampling strategy is designed to focus on attentional region such as foreground regions. The attentional region is estimated by using the detection results in the previous frame in a recursive probabilistic way. We generate a foreground probability map by using foreground properties of temporal, spatial, and frequency properties. Based on this foreground probability map, randomly scattered sampling, spatially expanding importance sampling and surprise pixel sampling are performed sequentially to make the attention sampling mask. The efficiency of the proposed empirical attention sampling method is shown through various experiments. The proposed masking method successfully speeds up pixel-wise background subtraction methods approximately 6.6 times without deteriorating detection performance. Also real-time detection with Full HD video is successfully achieved by various conventional background subtraction algorithms together with the proposed sampling scheme. The selective attentional sampling scheme does not use whole data but selects only important data enough to achieve a given classification objective. This selective sampling scheme is applied to the recognition of pop dances. Pop dances are action streams consisting of diverse actions which cannot be simply annotated. For such ``unannotatable'' action streams, conventional methods cannot be applied directly due to their complexity and longevity. In order to describe unannotatable action stream effectively, the proposed method employs a novel mid-level ``feature flow'' with low dimensional embedding. Also, for the purpose of recognition, ``attentional motion spots'' holding important information about the sequence are automatically selected. The feature values and the temporal locations of each attentional motion spot are modeled with Gaussian mixtures as ``Action Charts.'' The Action Chart describes the characteristics of an action stream in the spatio-temporal domain. Using the abstract information in the Action Charts, the proposed method efficiently recognizes pop dance sequences. In order to demonstrate the validity of the proposed method, we compare our method against the state-of-the-art methods with a newly built SNU Pop-Dance dataset containing long action streams composed of diverse actions.1 Introduction 1 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contents of Research . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Structured Attentional Sampling . . . . . . . . . . . . . . . 3 1.2.2 Empirical Attentional Sampling . . . . . . . . . . . . . . . . 5 1.2.3 Selective Attentional Sampling . . . . . . . . . . . . . . . . 6 2 Structured Attentional Sampling for Tracking Failure Detection 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Characteristics of Log-Polar Image and Tracking Failure . . . . . . 11 2.2.1 Properties of Log-Polar Image . . . . . . . . . . . . . . . . 11 2.2.2 Tracking Failure in Log-Polar Image . . . . . . . . . . . . . 12 2.3 Tracking Failure Detection Algorithm . . . . . . . . . . . . . . . . 14 2.3.1 Modeling of Ganglion Cell Adaptation . . . . . . . . . . . . 14 2.3.2 Initialization of GMM . . . . . . . . . . . . . . . . . . . . . 16 2.3.3 Tracking Failure Detection . . . . . . . . . . . . . . . . . . 16 2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Eectiveness of Log-Polar Transformation and Initial Color Model Generation . . . . . . . . . . . . . . . . . . . . . . . 19 2.4.2 Combining with various tracking algorithms . . . . . . . . . 20 2.5 Final Remarks and Discussion . . . . . . . . . . . . . . . . . . . . . 23 3 Empirical Attentional Sampling for Speed-up of Background Sub- traction 24 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.2 Overall Scheme of Proposed Algorithm . . . . . . . . . . . . 28 3.3 Foreground Probability Map Generation . . . . . . . . . . . . . . . 31 3.3.1 Estimation of Foreground Properties . . . . . . . . . . . . . 31 3.3.2 Foreground Probability Map: PFG . . . . . . . . . . . . . . 32 3.4 Active Sampling Mask Generation . . . . . . . . . . . . . . . . . . 33 3.4.1 Randomly Scattered Sampling . . . . . . . . . . . . . . . . 33 3.4.2 Spatially Expanding Importance Sampling . . . . . . . . . . 36 3.4.3 Surprise Pixel Sampling Mask . . . . . . . . . . . . . . . . . 39 3.5 Computational Eciency Boundary . . . . . . . . . . . . . . . . . 39 3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.6.1 Eciency of Active Attentional Sampling . . . . . . . . . . 42 3.6.2 Detection Performance Comparison . . . . . . . . . . . . . 43 3.6.3 Speed-up Performance Comparison . . . . . . . . . . . . . . 45 3.6.4 Real-time Detection in Full HD Video . . . . . . . . . . . . 49 3.7 Final Remarks and Discussion . . . . . . . . . . . . . . . . . . . . . 49 4 Selective Attentional Sampling for Recognition of Pop Dances 51 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2 Action Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.2.1 Motion Feature Flow (MFF) . . . . . . . . . . . . . . . . . 58 4.2.2 Hierarchical Low Dimensional Embedding . . . . . . . . . . 62 4.2.3 Attentional Motion Spot Selection . . . . . . . . . . . . . . 64 4.2.4 Action Chart Generation and Recognition . . . . . . . . . . 65 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.1 Pop-Dance Dataset . . . . . . . . . . . . . . . . . . . . . . . 67 4.3.2 Validation of Proposed Features . . . . . . . . . . . . . . . 71 4.3.3 Recognition Performance . . . . . . . . . . . . . . . . . . . 73 4.3.4 Automatic Action Abstraction . . . . . . . . . . . . . . . . 78 4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 5 Concluding Remarks 82 Bibliography 85 A Derivation of Computational Eciency Boundaries 95 A.1 Denition of Notations . . . . . . . . . . . . . . . . . . . . . . . . . 95 A.2 Derivative of the eciency boundary . . . . . . . . . . . . . . . . . 96 Abstract in Korean 102Docto

    BEYOND MULTI-TARGET TRACKING: STATISTICAL PATTERN ANALYSIS OF PEOPLE AND GROUPS

    Get PDF
    Ogni giorno milioni e milioni di videocamere monitorano la vita quotidiana delle persone, registrando e collezionando una grande quantit\ue0 di dati. Questi dati possono essere molto utili per scopi di video-sorveglianza: dalla rilevazione di comportamenti anomali all'analisi del traffico urbano nelle strade. Tuttavia i dati collezionati vengono usati raramente, in quanto non \ue8 pensabile che un operatore umano riesca a esaminare manualmente e prestare attenzione a una tale quantit\ue0 di dati simultaneamente. Per questo motivo, negli ultimi anni si \ue8 verificato un incremento della richiesta di strumenti per l'analisi automatica di dati acquisiti da sistemi di video-sorveglianza in modo da estrarre informazione di pi\uf9 alto livello (per esempio, John, Sam e Anne stanno camminando in gruppo al parco giochi vicino alla stazione) a partire dai dati a disposizione che sono solitamente a basso livello e ridondati (per esempio, una sequenza di immagini). L'obiettivo principale di questa tesi \ue8 quello di proporre soluzioni e algoritmi automatici che permettono di estrarre informazione ad alto livello da una zona di interesse che viene monitorata da telecamere. Cos\uec i dati sono rappresentati in modo da essere facilmente interpretabili e analizzabili da qualsiasi persona. In particolare, questo lavoro \ue8 focalizzato sull'analisi di persone e i loro comportamenti sociali collettivi. Il titolo della tesi, beyond multi-target tracking, evidenzia lo scopo del lavoro: tutti i metodi proposti in questa tesi che si andranno ad analizzare hanno come comune denominatore il target tracking. Inoltre andremo oltre le tecniche standard per arrivare a una rappresentazione del dato a pi\uf9 alto livello. Per prima cosa, analizzeremo il problema del target tracking in quanto \ue8 alle basi di questo lavoro. In pratica, target tracking significa stimare la posizione di ogni oggetto di interesse in un immagine e la sua traiettoria nel tempo. Analizzeremo il problema da due prospettive complementari: 1) il punto di vista ingegneristico, dove l'obiettivo \ue8 quello di creare algoritmi che ottengono i risultati migliori per il problema in esame. 2) Il punto di vista della neuroscienza: motivati dalle teorie che cercano di spiegare il funzionamento del sistema percettivo umano, proporremo in modello attenzionale per tracking e il riconoscimento di oggetti e persone. Il secondo problema che andremo a esplorare sar\ue0 l'estensione del tracking alla situazione dove pi\uf9 telecamere sono disponibili. L'obiettivo \ue8 quello di mantenere un identificatore univoco per ogni persona nell'intera rete di telecamere. In altre parole, si vuole riconoscere gli individui che vengono monitorati in posizioni e telecamere diverse considerando un database di candidati. Tale problema \ue8 chiamato in letteratura re-indetificazione di persone. In questa tesi, proporremo un modello standard di come affrontare il problema. In questo modello, presenteremo dei nuovi descrittori di aspetto degli individui, in quanto giocano un ruolo importante allo scopo di ottenere i risultati migliori. Infine raggiungeremo il livello pi\uf9 alto di rappresentazione dei dati che viene affrontato in questa tesi, che \ue8 l'analisi di interazioni sociali tra persone. In particolare, ci focalizzeremo in un tipo specifico di interazione: il raggruppamento di persone. Proporremo dei metodi di visione computazionale che sfruttano nozioni di psicologia sociale per rilevare gruppi di persone. Inoltre, analizzeremo due modelli probabilistici che affrontano il problema di tracking (congiunto) di gruppi e individui.Every day millions and millions of surveillance cameras monitor the world, recording and collecting huge amount of data. The collected data can be extremely useful: from the behavior analysis to prevent unpleasant events, to the analysis of the traffic. However, these valuable data is seldom used, because of the amount of information that the human operator has to manually attend and examine. It would be like looking for a needle in the haystack. The automatic analysis of data is becoming mandatory for extracting summarized high-level information (e.g., John, Sam and Anne are walking together in group at the playground near the station) from the available redundant low-level data (e.g., an image sequence). The main goal of this thesis is to propose solutions and automatic algorithms that perform high-level analysis of a camera-monitored environment. In this way, the data are summarized in a high-level representation for a better understanding. In particular, this work is focused on the analysis of moving people and their collective behaviors. The title of the thesis, beyond multi-target tracking, mirrors the purpose of the work: we will propose methods that have the target tracking as common denominator, and go beyond the standard techniques in order to provide a high-level description of the data. First, we investigate the target tracking problem as it is the basis of all the next work. Target tracking estimates the position of each target in the image and its trajectory over time. We analyze the problem from two complementary perspectives: 1) the engineering point of view, where we deal with problem in order to obtain the best results in terms of accuracy and performance. 2) The neuroscience point of view, where we propose an attentional model for tracking and recognition of objects and people, motivated by theories of the human perceptual system. Second, target tracking is extended to the camera network case, where the goal is to keep a unique identifier for each person in the whole network, i.e., to perform person re-identification. The goal is to recognize individuals in diverse locations over different non-overlapping camera views or also the same camera, considering a large set of candidates. In this context, we propose a pipeline and appearance-based descriptors that enable us to define in a proper way the problem and to reach the-state-of-the-art results. Finally, the higher level of description investigated in this thesis is the analysis (discovery and tracking) of social interaction between people. In particular, we focus on finding small groups of people. We introduce methods that embed notions of social psychology into computer vision algorithms. Then, we extend the detection of social interaction over time, proposing novel probabilistic models that deal with (joint) individual-group tracking

    Enhancing Perception and Immersion in Pre-Captured Environments through Learning-Based Eye Height Adaptation

    Full text link
    Pre-captured immersive environments using omnidirectional cameras provide a wide range of virtual reality applications. Previous research has shown that manipulating the eye height in egocentric virtual environments can significantly affect distance perception and immersion. However, the influence of eye height in pre-captured real environments has received less attention due to the difficulty of altering the perspective after finishing the capture process. To explore this influence, we first propose a pilot study that captures real environments with multiple eye heights and asks participants to judge the egocentric distances and immersion. If a significant influence is confirmed, an effective image-based approach to adapt pre-captured real-world environments to the user's eye height would be desirable. Motivated by the study, we propose a learning-based approach for synthesizing novel views for omnidirectional images with altered eye heights. This approach employs a multitask architecture that learns depth and semantic segmentation in two formats, and generates high-quality depth and semantic segmentation to facilitate the inpainting stage. With the improved omnidirectional-aware layered depth image, our approach synthesizes natural and realistic visuals for eye height adaptation. Quantitative and qualitative evaluation shows favorable results against state-of-the-art methods, and an extensive user study verifies improved perception and immersion for pre-captured real-world environments.Comment: 10 pages, 13 figures, 3 tables, submitted to ISMAR 202
    • โ€ฆ
    corecore