21,147 research outputs found

    The aceToolbox: low-level audiovisual feature extraction for retrieval and classification

    Get PDF
    In this paper we present an overview of a software platform that has been developed within the aceMedia project, termed the aceToolbox, that provides global and local lowlevel feature extraction from audio-visual content. The toolbox is based on the MPEG-7 eXperimental Model (XM), with extensions to provide descriptor extraction from arbitrarily shaped image segments, thereby supporting local descriptors reflecting real image content. We describe the architecture of the toolbox as well as providing an overview of the descriptors supported to date. We also briefly describe the segmentation algorithm provided. We then demonstrate the usefulness of the toolbox in the context of two different content processing scenarios: similarity-based retrieval in large collections and scene-level classification of still images

    Demo: real-time indoors people tracking in scalable camera networks

    Get PDF
    In this demo we present a people tracker in indoor environments. The tracker executes in a network of smart cameras with overlapping views. Special attention is given to real-time processing by distribution of tasks between the cameras and the fusion server. Each camera performs tasks of processing the images and tracking of people in the image plane. Instead of camera images, only metadata (a bounding box per person) are sent from each camera to the fusion server. The metadata are used on the server side to estimate the position of each person in real-world coordinates. Although the tracker is designed to suit any indoor environment, in this demo the tracker's performance is presented in a meeting scenario, where occlusions of people by other people and/or furniture are significant and occur frequently. Multiple cameras insure views from multiple angles, which keeps tracking accurate even in cases of severe occlusions in some of the views

    Asynchronous displays for multi-UV search tasks

    Get PDF
    Synchronous video has long been the preferred mode for controlling remote robots with other modes such as asynchronous control only used when unavoidable as in the case of interplanetary robotics. We identify two basic problems for controlling multiple robots using synchronous displays: operator overload and information fusion. Synchronous displays from multiple robots can easily overwhelm an operator who must search video for targets. If targets are plentiful, the operator will likely miss targets that enter and leave unattended views while dealing with others that were noticed. The related fusion problem arises because robots' multiple fields of view may overlap forcing the operator to reconcile different views from different perspectives and form an awareness of the environment by "piecing them together". We have conducted a series of experiments investigating the suitability of asynchronous displays for multi-UV search. Our first experiments involved static panoramas in which operators selected locations at which robots halted and panned their camera to capture a record of what could be seen from that location. A subsequent experiment investigated the hypothesis that the relative performance of the panoramic display would improve as the number of robots was increased causing greater overload and fusion problems. In a subsequent Image Queue system we used automated path planning and also automated the selection of imagery for presentation by choosing a greedy selection of non-overlapping views. A fourth set of experiments used the SUAVE display, an asynchronous variant of the picture-in-picture technique for video from multiple UAVs. The panoramic displays which addressed only the overload problem led to performance similar to synchronous video while the Image Queue and SUAVE displays which addressed fusion as well led to improved performance on a number of measures. In this paper we will review our experiences in designing and testing asynchronous displays and discuss challenges to their use including tracking dynamic targets. © 2012 by the American Institute of Aeronautics and Astronautics, Inc

    Do less and achieve more: Training CNNs for action recognition utilizing action images from the Web

    Full text link
    Recently, attempts have been made to collect millions of videos to train Convolutional Neural Network (CNN) models for action recognition in videos. However, curating such large-scale video datasets requires immense human labor, and training CNNs on millions of videos demands huge computational resources. In contrast, collecting action images from the Web is much easier and training on images requires much less computation. In addition, labeled web images tend to contain discriminative action poses, which highlight discriminative portions of a video’s temporal progression. Through extensive experiments, we explore the question of whether we can utilize web action images to train better CNN models for action recognition in videos. We collect 23.8K manually filtered images from the Web that depict the 101 actions in the UCF101 action video dataset. We show that by utilizing web action images along with videos in training, significant performance boosts of CNN models can be achieved. We also investigate the scalability of the process by leveraging crawled web images (unfiltered) for UCF101 and ActivityNet. Using unfiltered images we can achieve performance improvements that are on-par with using filtered images. This means we can further reduce annotation labor and easily scale-up to larger problems. We also shed light on an artifact of finetuning CNN models that reduces the effective parameters of the CNN and show that using web action images can significantly alleviate this problem.https://arxiv.org/pdf/1512.07155v1.pdfFirst author draf

    Anti-social behavior detection in audio-visual surveillance systems

    Get PDF
    In this paper we propose a general purpose framework for detection of unusual events. The proposed system is based on the unsupervised method for unusual scene detection in web{cam images that was introduced in [1]. We extend their algorithm to accommodate data from different modalities and introduce the concept of time-space blocks. In addition, we evaluate early and late fusion techniques for our audio-visual data features. The experimental results on 192 hours of data show that data fusion of audio and video outperforms using a single modality

    Fusing MPEG-7 visual descriptors for image classification

    Get PDF
    This paper proposes three content-based image classification techniques based on fusing various low-level MPEG-7 visual descriptors. Fusion is necessary as descriptors would be otherwise incompatible and inappropriate to directly include e.g. in a Euclidean distance. Three approaches are described: A “merging” fusion combined with an SVM classifier, a back-propagation fusion combined with a KNN classifier and a Fuzzy-ART neurofuzzy network. In the latter case, fuzzy rules can be extracted in an effort to bridge the “semantic gap” between the low-level descriptors and the high-level semantics of an image. All networks were evaluated using content from the repository of the aceMedia project1 and more specifically in a beach/urban scene classification problem
    corecore