43 research outputs found
Scalable virtual viewpoint image synthesis for multiple camera environments
One of the main aims of emerging audio-visual (AV) applications is to provide interactive navigation within a captured event or scene. This paper presents a view synthesis algorithm that provides a scalable and flexible approach to virtual viewpoint synthesis in multiple camera environments. The multi-view synthesis (MVS) process consists of four different phases that are described in detail: surface identification, surface selection, surface boundary blending and surface reconstruction. MVS view synthesis identifies and selects only the best quality surface areas from the set of available reference images, thereby reducing perceptual errors in virtual view reconstruction. The approach is camera setup independent and scalable as virtual views can be created given 1 to N of the available video inputs. Thus, MVS provides interactive AV applications with a means to handle scenarios where camera inputs increase or decrease over time
Multiple image view synthesis for free viewpoint video applications
Interactive audio-visual (AV) applications such as free viewpoint video (FVV) aim to enable unrestricted spatio-temporal navigation within multiple camera environments. Current virtual viewpoint view synthesis solutions for FVV are either purely image-based implying large information redundancy; or involve reconstructing complex 3D models of the scene. In this paper we present a new multiple image view synthesis algorithm that only requires camera parameters and disparity maps. The multi-view synthesis (MVS) approach can be used in any multi-camera environment and is scalable as virtual views can be created given 1 to N of the available video inputs, providing a means to gracefully handle scenarios where camera inputs decrease or increase over time. The algorithm identifies and selects only the best quality surface areas from available reference images, thereby reducing perceptual errors in virtual view reconstruction. Experimental results are presented and verified using both objective (PSNR) and subjective comparisons
3D image analysis for pedestrian detection
A method for solving the dense disparity stereo correspondence problem is presented in this paper. This technique is designed specifically for pedestrian detection type applications. A new Ground Control Points (GCPs) scheme is introduced, using groundplane homography information to determine regions in which good GCPs are likely to occur. The method also introduces a dynamic disparity limit constraint to further improve GCP selection and dense disparity generation. The technique is applied to a real world pedestrian detection scenario with a background modeling system based on disparity and edges
Multispectral object segmentation and retrieval in surveillance video
This paper describes a system for object segmentation and feature extraction for surveillance video. Segmentation is performed by a dynamic vision system that fuses information from thermal infrared video with standard CCTV video in order to detect and track objects. Separate background modelling in each modality and dynamic mutual information based thresholding are used to provide initial foreground candidates for tracking. The belief in the validity of these candidates is ascertained using knowledge of foreground pixels and temporal linking of candidates. The transferable belief model is used to combine these sources of information and segment objects. Extracted objects are subsequently tracked using adaptive thermo-visual appearance models. In order to facilitate search and classification of objects in large archives, retrieval features from both modalities are extracted for tracked objects. Overall system performance is demonstrated in a simple retrieval scenari
Comparison of fusion methods for thermo-visual surveillance tracking
In this paper, we evaluate the appearance tracking performance of multiple fusion schemes that combine information from standard CCTV and thermal infrared spectrum video for the tracking of surveillance objects, such as people, faces, bicycles and vehicles. We show results on numerous real world multimodal surveillance sequences, tracking challenging objects whose appearance changes rapidly. Based on these results we can determine the most promising fusion scheme
Detection thresholding using mutual information
In this paper, we introduce a novel non-parametric thresholding method that we term Mutual-Information
Thresholding. In our approach, we choose the two detection thresholds for two input signals such that the
mutual information between the thresholded signals is maximised. Two efficient algorithms implementing our
idea are presented: one using dynamic programming to fully explore the quantised search space and the other
method using the Simplex algorithm to perform gradient ascent to significantly speed up the search, under the
assumption of surface convexity. We demonstrate the effectiveness of our approach in foreground detection
(using multi-modal data) and as a component in a person detection system
Relating visual and semantic image descriptors
This paper addresses the automatic analysis of visual content and extraction of metadata beyond pure visual descriptors. Two approaches are described: Automatic Image Annotation (AIA) and Confidence Clustering (CC). AIA attempts to automatically classify images based on two binary classifiers and is
designed for the consumer electronics domain. Contrastingly, the CC approach does not attempt to assign a unique label to images but rather to organise the database based on concepts
Detecting shadows and low-lying objects in indoor and outdoor scenes using homographies
Many computer vision applications apply background suppression techniques for the detection and segmentation of moving objects in a scene. While these algorithms tend to work well in controlled conditions they often fail when applied to unconstrained real-world environments. This paper describes a system that detects and removes erroneously segmented foreground regions that are close to a ground plane. These regions include shadows, changing background objects and other low-lying objects such as leaves and rubbish. The system uses a set-up of two or more cameras and requires no 3D reconstruction or depth analysis of the regions. Therefore, a strong camera calibration of the set-up is not necessary. A geometric constraint called a homography is exploited to determine if foreground points are on or above the ground plane. The system takes advantage of the fact that regions in images off the homography plane will not correspond after a homography transformation. Experimental results using real world scenes from a pedestrian tracking application illustrate the effectiveness of the proposed approach
Bi & tri dimensional scene description and composition in the MPEG-4 standard
MPEG-4 is a new ISO/IEC standard being developed by MPEG (Moving Picture Experts Group). The standard is to be released in November 1998 and version 1 will be an International Standard in January 1999 The MPEG-4 standard addresses the new demands that arise in a world in which more and more audio-visual material is exchanged in digital form MPEG-4 addresses the coding of objects of various types. Not only traditional video and audio frames, but also natural video and audio objects as well as textures, text, 2- and 3-dimensional graphic primitives, and synthetic music
and sound effects.
Using MPEG-4 to reconstruct an audio-visual scene at a terminal, it is hence no longer sufficient to encode the raw audio-visual data and transmit it, as MPEG-2 does m order to synchronize video and audio. In MPEG-4, all objects are multiplexed together at the encoder and transported to the terminal Once de-multiplexed, these objects are composed at the terminal to construct and present to the end user a meaningful audio-visual scene. The placement of these elementary audio-visual objects in space and time is described in the scene description of a scene. While the action of putting these objects together in the same representation space is the composition of audio-visual objects.
My research was concerned with the scene description and composition of the audio-visual objects that are defined in an audio-visual scene Scene descriptions are coded independently irom sticams related to primitive audio-visual objects. The set of parameters belonging to the scene description are differentiated from the parameters that are used to improve the coding efficiency of an object. While the independent coding of different objects may achieve a higher compression rate, it also brings the ability to manipulate content at the terminal. This allows the modification of the scene description parameters without having to decode the primitive audio-visual objects themselves. This approach allows the development of a syntax that describes the spatio-temporal relationships of audio-visual scene objects. The behaviours of objects and their response to user inputs can thus also be represented in the scene description, allowing richer audio-visual content to be delivered as an MPEG-4 stream
The aceToolbox: low-level audiovisual feature extraction for retrieval and classification
In this paper we present an overview of a software platform
that has been developed within the aceMedia project,
termed the aceToolbox, that provides global and local lowlevel feature extraction from audio-visual content. The toolbox is based on the MPEG-7 eXperimental Model (XM),
with extensions to provide descriptor extraction from arbitrarily shaped image segments, thereby supporting local descriptors reflecting real image content. We describe the architecture of the toolbox as well as providing an overview of the descriptors supported to date. We also briefly describe the segmentation algorithm provided. We then demonstrate the usefulness of the toolbox in the context of two different content processing scenarios: similarity-based retrieval in large collections and scene-level classification of still images