34 research outputs found

    Spatiotemporal visual analysis of human actions

    No full text
    In this dissertation we propose four methods for the recognition of human activities. In all four of them, the representation of the activities is based on spatiotemporal features that are automatically detected at areas where there is a significant amount of independent motion, that is, motion that is due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features throughout this dissertation. The algorithms presented, however, can be used with any kind of features, as long as the latter are well localized and have a well-defined area of support in space and time. We introduce the utilized spatiotemporal salient points in the first method presented in this dissertation. By extending previous work on spatial saliency, we measure the variations in the information content of pixel neighborhoods both in space and time, and detect the points at the locations and scales for which this information content is locally maximized. In this way, an activity is represented as a collection of spatiotemporal salient points. We propose an iterative linear space-time warping technique in order to align the representations in space and time and propose to use Relevance Vector Machines (RVM) in order to classify each example into an action category. In the second method proposed in this dissertation we propose to enhance the acquired representations of the first method. More specifically, we propose to track each detected point in time, and create representations based on sets of trajectories, where each trajectory expresses how the information engulfed by each salient point evolves over time. In order to deal with imperfect localization of the detected points, we augment the observation model of the tracker with background information, acquired using a fully automatic background estimation algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels. In addition, we perform experiments where the tracked templates are localized on specific parts of the body, like the hands and the head, and we further augment the tracker’s observation model using a human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm (LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and RVMs for classification. In the third method that we propose, we assume that neighboring salient points follow a similar motion. This is in contrast to the previous method, where each salient point was tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are extracted across the whole dataset are subsequently clustered in order to create a codebook, which is used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for classification. The fourth and last method addresses the joint problem of localization and recognition of human activities depicted in unsegmented image sequences. Its main contribution is the use of an implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal localization of characteristic ensembles of spatiotemporal features. The latter are localized around automatically detected salient points. Evidence for the spatiotemporal localization of the activity is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct class-specific spatiotemporal models, which encode where in space and time each codeword ensemble appears in the training set. During testing, each activated codeword ensemble casts probabilistic votes concerning the spatiotemporal localization of the activity, according to the information stored during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume which potentially engulfs the activity, and is verified by performing action category classification with an RVM classifier

    Multimedia Decision Fusion

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    BNAIC 2008:Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference

    Get PDF

    Multimedia

    Get PDF
    The nowadays ubiquitous and effortless digital data capture and processing capabilities offered by the majority of devices, lead to an unprecedented penetration of multimedia content in our everyday life. To make the most of this phenomenon, the rapidly increasing volume and usage of digitised content requires constant re-evaluation and adaptation of multimedia methodologies, in order to meet the relentless change of requirements from both the user and system perspectives. Advances in Multimedia provides readers with an overview of the ever-growing field of multimedia by bringing together various research studies and surveys from different subfields that point out such important aspects. Some of the main topics that this book deals with include: multimedia management in peer-to-peer structures & wireless networks, security characteristics in multimedia, semantic gap bridging for multimedia content and novel multimedia applications

    Learning Context-sensitive Human Emotions in Categorical and Dimensional Domains

    Get PDF
    Still image emotion recognition (ER) has been receiving increasing attention in recent years due to the tremendous amount of social media content on the Web. Many works offer both categorical and dimensional methods to detect image sentiments, while others focus on extracting the true social signals, such as happiness and anger. Deep learning architectures have delivered great suc- cess, however, their dependency on large-scale datasets labeled with (1) emotion, and (2) valence, arousal and dominance, in categorical and dimensional domains respectively, introduce challenges the community tries to tackle. Emotions offer dissimilar semantics when aroused in different con- texts, however context-sensitive ER has been by and large discarded in the literature so far. Moreover, while dimensional methods deliver higher accuracy, they have been less attended due to (1) lack of reliable large-scale labeled datasets, and (2) challenges involved in architecting un- supervised solutions to the problem. Owing to the success offered by multi-modal ER, still image ER in the single-modal domain; i.e. using only still images, remains less resorted to. In this work, (1) we first architect a novel fully automated dataset collection pipeline, equipped with a built-in semantic sanitizer, (2) we then build UCF-ER with 50K images, and LUCFER, the largest labeled ER dataset in the literature with more than 3.6M images, both datasets labeled with emotion and context, (3) next, we build a single-modal context-sensitive ER CNN model, fine-tuned on UCF-ER and LUCFER, (4) we then claim and show empirically that infusing context to the unified training process helps achieve a more balanced precision and recall, while boosting performance, yielding an overall classification accuracy of 73.12% compared to the state of the art 58.3%, (5) next, we propose an unsupervised approach for ranking of continuous emotions in images using canonical polyadic (CP) decomposition, providing theoretical proof that rank-1 CP decomposition can be used as a ranking machine, (6) finally, we provide empirical proof that our method generates a Pearson Correlation Coefficient, outperforming the state of the art by a large margin; i.e. 65.13% (difference) in one experiment and 104.08% (difference) in another, when applied to valence rank estimation

    An Ordinal Approach to Affective Computing

    Full text link
    Both depression prediction and emotion recognition systems are often based on ordinal ground truth due to subjectively annotated datasets. Yet, both have so far been posed as classification or regression problems. These naive approaches have fundamental issues because they are not focused on ordering, unlike ordinal regression, which is the most appropriate for truly ordinal ground truth. Ordinal regression to date offers comparatively fewer, more limited methods when compared with other branches in machine learning, and its usage has been limited to specific research domains. Accordingly, this thesis presents investigations into ordinal approaches for affective computing by describing a consistent framework to understand all ordinal system designs, proposing ordinal systems for large datasets, and introducing tools and principles to select suitable system designs and evaluation methods. First, three learning approaches are compared using the support vector framework to establish the empirical advantages of ordinal regression, which is lacking from the current literature. Results on depression and emotion corpora indicate that ordinal regression with proper tuning can improve existing depression and emotion systems. Ordinal logistic regression (OLR), which is an extension of logistic regression for ordinal scales, contributes to a number of model structures, from which the best structure must be chosen. Exploiting the newly proposed computationally efficient greedy algorithm for model structure selection (GREP), OLR outperformed or was comparable with state-of-the-art depression systems on two benchmark depression speech datasets. Deep learning has dominated many affective computing fields, and hence ordinal deep learning is an attractive prospect. However, it is under-studied even in the machine learning literature, which motivates an in-depth analysis of appropriate network architectures and loss functions. One of the significant outcomes of this analysis is the introduction of RankCNet, a novel ordinal network which utilises a surrogate loss function of rank correlation. Not only the modelling algorithm but the choice of evaluation measure depends on the nature of the ground truth. Rank correlation measures, which are sensitive to ordering, are more apt for ordinal problems than common classification or regression measures that ignore ordering information. Although rank-based evaluation for ordinal problems is not new, so far in affective computing, ordinality of the ground truth has been widely ignored during evaluation. Hence, a systematic analysis in the affective computing context is presented, to provide clarity and encourage careful choice of evaluation measures. Another contribution is a neural network framework with a novel multi-term loss function to assess the ordinality of ordinally-annotated datasets, which can guide the selection of suitable learning and evaluation methods. Experiments on multiple synthetic and affective speech datasets reveal that the proposed system can offer reliable and meaningful predictions about the ordinality of a given dataset. Overall, the novel contributions and findings presented in this thesis not only improve prediction accuracy but also encourage future research towards ordinal affective computing: a different paradigm, but often the most appropriate

    Robust subspace learning for static and dynamic affect and behaviour modelling

    Get PDF
    Machine analysis of human affect and behavior in naturalistic contexts has witnessed a growing attention in the last decade from various disciplines ranging from social and cognitive sciences to machine learning and computer vision. Endowing machines with the ability to seamlessly detect, analyze, model, predict as well as simulate and synthesize manifestations of internal emotional and behavioral states in real-world data is deemed essential for the deployment of next-generation, emotionally- and socially-competent human-centered interfaces. In this thesis, we are primarily motivated by the problem of modeling, recognizing and predicting spontaneous expressions of non-verbal human affect and behavior manifested through either low-level facial attributes in static images or high-level semantic events in image sequences. Both visual data and annotations of naturalistic affect and behavior naturally contain noisy measurements of unbounded magnitude at random locations, commonly referred to as ‘outliers’. We present here machine learning methods that are robust to such gross, sparse noise. First, we deal with static analysis of face images, viewing the latter as a superposition of mutually-incoherent, low-complexity components corresponding to facial attributes, such as facial identity, expressions and activation of atomic facial muscle actions. We develop a robust, discriminant dictionary learning framework to extract these components from grossly corrupted training data and combine it with sparse representation to recognize the associated attributes. We demonstrate that our framework can jointly address interrelated classification tasks such as face and facial expression recognition. Inspired by the well-documented importance of the temporal aspect in perceiving affect and behavior, we direct the bulk of our research efforts into continuous-time modeling of dimensional affect and social behavior. Having identified a gap in the literature which is the lack of data containing annotations of social attitudes in continuous time and scale, we first curate a new audio-visual database of multi-party conversations from political debates annotated frame-by-frame in terms of real-valued conflict intensity and use it to conduct the first study on continuous-time conflict intensity estimation. Our experimental findings corroborate previous evidence indicating the inability of existing classifiers in capturing the hidden temporal structures of affective and behavioral displays. We present here a novel dynamic behavior analysis framework which models temporal dynamics in an explicit way, based on the natural assumption that continuous- time annotations of smoothly-varying affect or behavior can be viewed as outputs of a low-complexity linear dynamical system when behavioral cues (features) act as system inputs. A novel robust structured rank minimization framework is proposed to estimate the system parameters in the presence of gross corruptions and partially missing data. Experiments on prediction of dimensional conflict and affect as well as multi-object tracking from detection validate the effectiveness of our predictive framework and demonstrate that for the first time that complex human behavior and affect can be learned and predicted based on small training sets of person(s)-specific observations.Open Acces
    corecore