85,845 research outputs found
Content modelling for human action detection via multidimensional approach
Video content analysis is an active research domain due to the availability and the increment of audiovisual data in the digital format. There is a need to automatically extracting video content for efficient access, understanding,browsing and retrieval of videos. To obtain the information that is of interest and to provide better entertainment, tools are needed to help users extract relevant content and to effectively navigate through the large amount of available video information. Existing methods do not seem to attempt to model and estimate the semantic content of the video. Detecting and interpreting human presence,actions and activities is one of the most valuable functions in this proposed framework. The general objectives of this research are to analyze and process the audio-video streams to a robust audiovisual action recognition system by integrating, structuring and accessing multimodal information via multidimensional retrieval and extraction model. The proposed technique characterizes the action scenes by integrating cues obtained from both the audio and video tracks. Information is combined based on visual features (motion,edge, and visual characteristics of objects), audio features and video for recognizing action. This model uses HMM and GMM to provide a framework for fusing these features and to represent the multidimensional structure of the framework. The action-related visual cues are obtained by computing the spatio temporal dynamic activity from the video shots and by abstracting specific visual events. Simultaneously, the audio features are analyzed by locating and compute several sound effects of action events that embedded in the video. Finally, these audio and visual cues are combined to identify the action scenes. Compared with using single source of either visual or audio track alone, such combined audio visual information provides more reliable performance and allows us to understand the story content of movies in more detail. To compare the usefulness of the proposed framework, several experiments were conducted and the results were obtained by using visual features only (77.89% for precision;72.10% for recall), audio features only (62.52% for precision; 48.93% for recall)and combined audiovisual (90.35% for precision; 90.65% for recall)
Automatic Detection of Pain from Spontaneous Facial Expressions
This paper presents a new approach for detecting pain in sequences of spontaneous facial expressions. The motivation for this work is to accompany mobile-based self-management of chronic pain as a virtual sensor for tracking patients' expressions in real-world settings. Operating under such constraints requires a resource efficient approach for processing non-posed facial expressions from unprocessed temporal data. In this work, the facial action units of pain are modeled as sets of distances among related facial landmarks. Using standardized measurements of pain versus no-pain that are specific to each user, changes in the extracted features in relation to pain are detected. The activated features in each frame are combined using an adapted form of the Prkachin and Solomon Pain Intensity scale (PSPI) to detect the presence of pain per frame. Painful features must be activated in N consequent frames (time window) to indicate the presence of pain in a session. The discussed method was tested on 171 video sessions for 19 subjects from the McMaster painful dataset for spontaneous facial expressions. The results show higher precision than coverage in detecting sequences of pain. Our algorithm achieves 94% precision (F-score=0.82) against human observed labels, 74% precision (F-score=0.62) against automatically generated pain intensities and 100% precision (F-score=0.67) against self-reported pain intensities
PersonRank: Detecting Important People in Images
Always, some individuals in images are more important/attractive than others
in some events such as presentation, basketball game or speech. However, it is
challenging to find important people among all individuals in images directly
based on their spatial or appearance information due to the existence of
diverse variations of pose, action, appearance of persons and various changes
of occasions. We overcome this difficulty by constructing a multiple
Hyper-Interaction Graph to treat each individual in an image as a node and
inferring the most active node referring to interactions estimated by various
types of clews. We model pairwise interactions between persons as the edge
message communicated between nodes, resulting in a bidirectional
pairwise-interaction graph. To enrich the personperson interaction estimation,
we further introduce a unidirectional hyper-interaction graph that models the
consensus of interaction between a focal person and any person in a local
region around. Finally, we modify the PageRank algorithm to infer the
activeness of persons on the multiple Hybrid-Interaction Graph (HIG), the union
of the pairwise-interaction and hyperinteraction graphs, and we call our
algorithm the PersonRank. In order to provide publicable datasets for
evaluation, we have contributed a new dataset called Multi-scene Important
People Image Dataset and gathered a NCAA Basketball Image Dataset from sports
game sequences. We have demonstrated that the proposed PersonRank outperforms
related methods clearly and substantially.Comment: 8 pages, conferenc
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
This paper presents a self-supervised method for visual detection of the
active speaker in a multi-person spoken interaction scenario. Active speaker
detection is a fundamental prerequisite for any artificial cognitive system
attempting to acquire language in social settings. The proposed method is
intended to complement the acoustic detection of the active speaker, thus
improving the system robustness in noisy conditions. The method can detect an
arbitrary number of possibly overlapping active speakers based exclusively on
visual information about their face. Furthermore, the method does not rely on
external annotations, thus complying with cognitive development. Instead, the
method uses information from the auditory modality to support learning in the
visual domain. This paper reports an extensive evaluation of the proposed
method using a large multi-person face-to-face interaction dataset. The results
show good performance in a speaker dependent setting. However, in a speaker
independent setting the proposed method yields a significantly lower
performance. We believe that the proposed method represents an essential
component of any artificial cognitive system or robotic platform engaging in
social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System
- …