10 research outputs found


    Get PDF

    Artifical Intelligence for Human Computing

    Get PDF
    This book constitutes the thoroughly refereed post-proceedings of two events discussing AI for Human Computing: one Special Session during the Eighth International ACM Conference on Multimodal Interfaces (ICMI 2006), held in Banff, Canada, in November 2006, and a Workshop organized in conjunction with the 20th International Joint Conference on Artificial Intelligence (IJCAI 2007), held in Hyderabad, India, in January 2007. A large number of the contributions in this state-of-the-art survey are updated and extended versions of the papers presented during these two events. In order to obtain a more complete overview of research efforts in the field of human computing, a number of additional invited contributions are also included in this book on AI for human computing. The 17 revised papers presented were carefully selected from numerous submissions to and presentations made at the two events and include invited articles to round off coverage of all relevant topics of the emerging topic. The papers are organized in three parts: a part on foundational issues of human computing, a part on sensing humans and their activities, and a part on anthropocentric interaction models

    Discriminative Dictionary Learning with Motion Weber Local Descriptor for Violence Detection

    Full text link
    漏 1991-2012 IEEE. Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in developing an algorithm that can detect violence in surveillance videos with high performance. In this paper, following our recently proposed idea of motion Weber local descriptor (WLD), we make two major improvements and propose a more effective and efficient algorithm for detecting violence from motion images. First, we propose an improved WLD (IWLD) to better depict low-level image appearance information, and then extend the spatial descriptor IWLD by adding a temporal component to capture local motion information and hence form the motion IWLD. Second, we propose a modified sparse-representation-based classification model to both control the reconstruction error of coding coefficients and minimize the classification error. Based on the proposed sparse model, a class-specific dictionary containing dictionary atoms corresponding to the class labels is learned using class labels of training samples. With this learned dictionary, not only the representation residual but also the representation coefficients become discriminative. A classification scheme integrating the modified sparse model is developed to exploit such discriminative information. The experimental results on three benchmark data sets have demonstrated the superior performance of the proposed approach over the state of the arts

    MoWLD: A Robust Motion Image Descriptor for Violence Detection

    Get PDF
    Abstract Automatic violence detection from video is a hot topic for many video surveillance applications. However, there has been little success in designing an algorithm that can detect violence in surveillance videos with high performance. Existing methods typically apply the Bagof-Words (BoW) model on local spatiotemporal descriptors. However, traditional spatiotemporal features are not discriminative enough, and also the BoW model roughly assigns each feature vector to only one visual word and therefore ignores the spatial relationships among the features. To tackle these problems, in this paper we propose a novel Motion Weber Local Descriptor (MoWLD) in the spirit of the well-known WLD and make it a powerful and robust descriptor for motion images. We extend the WLD spatial descriptions by adding a temporal component to the appearance descriptor, which implicitly captures local motion information as well as low-level image appear information. To eliminate redundant and irrelevant features, the nonparametric Kernel Density Estimation (KDE) is employed on the MoWLD descriptor. In order to obtain more discriminative features, we adopt the sparse coding and max pooling scheme to further process the selected MoWLDs. Experimental results on three benchmark datasets have demonstrated the superiority of the proposed approach over the state-of-the-arts

    Spatiotemporal visual analysis of human actions

    No full text
    In this dissertation we propose four methods for the recognition of human activities. In all four of them, the representation of the activities is based on spatiotemporal features that are automatically detected at areas where there is a significant amount of independent motion, that is, motion that is due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features throughout this dissertation. The algorithms presented, however, can be used with any kind of features, as long as the latter are well localized and have a well-defined area of support in space and time. We introduce the utilized spatiotemporal salient points in the first method presented in this dissertation. By extending previous work on spatial saliency, we measure the variations in the information content of pixel neighborhoods both in space and time, and detect the points at the locations and scales for which this information content is locally maximized. In this way, an activity is represented as a collection of spatiotemporal salient points. We propose an iterative linear space-time warping technique in order to align the representations in space and time and propose to use Relevance Vector Machines (RVM) in order to classify each example into an action category. In the second method proposed in this dissertation we propose to enhance the acquired representations of the first method. More specifically, we propose to track each detected point in time, and create representations based on sets of trajectories, where each trajectory expresses how the information engulfed by each salient point evolves over time. In order to deal with imperfect localization of the detected points, we augment the observation model of the tracker with background information, acquired using a fully automatic background estimation algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels. In addition, we perform experiments where the tracked templates are localized on specific parts of the body, like the hands and the head, and we further augment the tracker鈥檚 observation model using a human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm (LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and RVMs for classification. In the third method that we propose, we assume that neighboring salient points follow a similar motion. This is in contrast to the previous method, where each salient point was tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are extracted across the whole dataset are subsequently clustered in order to create a codebook, which is used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for classification. The fourth and last method addresses the joint problem of localization and recognition of human activities depicted in unsegmented image sequences. Its main contribution is the use of an implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal localization of characteristic ensembles of spatiotemporal features. The latter are localized around automatically detected salient points. Evidence for the spatiotemporal localization of the activity is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct class-specific spatiotemporal models, which encode where in space and time each codeword ensemble appears in the training set. During testing, each activated codeword ensemble casts probabilistic votes concerning the spatiotemporal localization of the activity, according to the information stored during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume which potentially engulfs the activity, and is verified by performing action category classification with an RVM classifier

    Individual and group dynamic behaviour patterns in bound spaces

    Get PDF
    The behaviour analysis of individual and group dynamics in closed spaces is a subject of extensive research in both academia and industry. However, despite recent technological advancements the problem of implementing the existing methods for visual behaviour data analysis in production systems remains difficult and the applications are available only in special cases in which the resourcing is not a problem. Most of the approaches concentrate on direct extraction and classification of the visual features from the video footage for recognising the dynamic behaviour directly from the source. The adoption of such an approach allows recognising directly the elementary actions of moving objects, which is a difficult task on its own. The major factor that impacts the performance of the methods for video analytics is the necessity to combine processing of enormous volume of video data with complex analysis of this data using and computationally resourcedemanding analytical algorithms. This is not feasible for many applications, which must work in real time. In this research, an alternative simulation-based approach for behaviour analysis has been adopted. It can potentially reduce the requirements for extracting information from real video footage for the purpose of the analysis of the dynamic behaviour. This can be achieved by combining only limited data extracted from the original video footage with a symbolic data about the events registered on the scene, which is generated by 3D simulation synchronized with the original footage. Additionally, through incorporating some physical laws and the logics of dynamic behaviour directly in the 3D model of the visual scene, this framework allows to capture the behavioural patterns using simple syntactic pattern recognition methods. The extensive experiments with the prototype implementation prove in a convincing manner that the 3D simulation generates sufficiently rich data to allow analysing the dynamic behaviour in real-time with sufficient adequacy without the need to use precise physical data, using only a limited data about the objects on the scene, their location and dynamic characteristics. This research can have a wide applicability in different areas where the video analytics is necessary, ranging from public safety and video surveillance to marketing research to computer games and animation. Its limitations are linked to the dependence on some preliminary processing of the video footage which is still less detailed and computationally demanding than the methods which use directly the video frames of the original footage

    N.: Trajectory-based Representation of Human Actions

    No full text
    Abstract. This work addresses the problem of human action recognition by introducing a representation of a human action as a collection of short trajectories that are extracted in areas of the scene with significant amount of visual activity. The trajectories are extracted by an auxiliary particle filtering tracking scheme that is initialized at points that are considered salient both in space and time. The spatiotemporal salient points are detected by measuring the variations in the information content of pixel neighborhoods in space and time. We implement an online background estimation algorithm in order to deal with inadequate localization of the salient points on the moving parts in the scene, and to improve the overall performance of the particle filter tracking scheme. We use a variant of the Longest Common Subsequence algorithm (LCSS) in order to compare different sets of trajectories corresponding to different actions. We use Relevance Vector Machines (RVM) in order to address the classification problem. We propose new kernels for use by the RVM, which are specifically tailored to the proposed representation of short trajectories. The basis of these kernels is the modified LCSS distance of the previous step. We present results on real image sequences from a small database depicting people performing 12 aerobic exercises.