382 research outputs found
Action Recognition in Videos: from Motion Capture Labs to the Web
This paper presents a survey of human action recognition approaches based on
visual data recorded from a single video camera. We propose an organizing
framework which puts in evidence the evolution of the area, with techniques
moving from heavily constrained motion capture scenarios towards more
challenging, realistic, "in the wild" videos. The proposed organization is
based on the representation used as input for the recognition task, emphasizing
the hypothesis assumed and thus, the constraints imposed on the type of video
that each technique is able to address. Expliciting the hypothesis and
constraints makes the framework particularly useful to select a method, given
an application. Another advantage of the proposed organization is that it
allows categorizing newest approaches seamlessly with traditional ones, while
providing an insightful perspective of the evolution of the action recognition
task up to now. That perspective is the basis for the discussion in the end of
the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4
table
SocAoG: Incremental Graph Parsing for Social Relation Inference in Dialogues
Inferring social relations from dialogues is vital for building emotionally
intelligent robots to interpret human language better and act accordingly. We
model the social network as an And-or Graph, named SocAoG, for the consistency
of relations among a group and leveraging attributes as inference cues.
Moreover, we formulate a sequential structure prediction task, and propose an
-- strategy to incrementally parse SocAoG for the
dynamic inference upon any incoming utterance: (i) an process
predicting attributes and relations conditioned on the semantics of dialogues,
(ii) a process updating the social relations based on related
attributes, and (iii) a process updating individual's attributes based
on interpersonal social relations. Empirical results on DialogRE and MovieGraph
show that our model infers social relations more accurately than the
state-of-the-art methods. Moreover, the ablation study shows the three
processes complement each other, and the case study demonstrates the dynamic
relational inference.Comment: Long paper (oral) accepted by ACL-IJCNLP 202
REPRESENTATION LEARNING FOR ACTION RECOGNITION
The objective of this research work is to develop discriminative representations for human
actions. The motivation stems from the fact that there are many issues encountered while
capturing actions in videos like intra-action variations (due to actors, viewpoints, and duration),
inter-action similarity, background motion, and occlusion of actors. Hence, obtaining
a representation which can address all the variations in the same action while maintaining
discrimination with other actions is a challenging task. In literature, actions have been represented
either using either low-level or high-level features. Low-level features describe
the motion and appearance in small spatio-temporal volumes extracted from a video. Due
to the limited space-time volume used for extracting low-level features, they are not able
to account for viewpoint and actor variations or variable length actions. On the other hand,
high-level features handle variations in actors, viewpoints, and duration but the resulting
representation is often high-dimensional which introduces the curse of dimensionality. In
this thesis, we propose new representations for describing actions by combining the advantages
of both low-level and high-level features. Specifically, we investigate various linear
and non-linear decomposition techniques to extract meaningful attributes in both high-level
and low-level features. In the first approach, the sparsity of high-level feature descriptors is leveraged to build
action-specific dictionaries. Each dictionary retains only the discriminative information
for a particular action and hence reduces inter-action similarity. Then, a sparsity-based
classification method is proposed to classify the low-rank representation of clips obtained
using these dictionaries. We show that this representation based on dictionary learning improves
the classification performance across actions. Also, a few of the actions consist of
rapid body deformations that hinder the extraction of local features from body movements.
Hence, we propose to use a dictionary which is trained on convolutional neural network
(CNN) features of the human body in various poses to reliably identify actors from the
background. Particularly, we demonstrate the efficacy of sparse representation in the identification
of the human body under rapid and substantial deformation.
In the first two approaches, sparsity-based representation is developed to improve discriminability
using class-specific dictionaries that utilize action labels. However, developing
an unsupervised representation of actions is more beneficial as it can be used to both
recognize similar actions and localize actions. We propose to exploit inter-action similarity
to train a universal attribute model (UAM) in order to learn action attributes (common and
distinct) implicitly across all the actions. Using maximum aposteriori (MAP) adaptation,
a high-dimensional super action-vector (SAV) for each clip is extracted. As this SAV contains
redundant attributes of all other actions, we use factor analysis to extract a novel lowvi
dimensional action-vector representation for each clip. Action-vectors are shown to suppress
background motion and highlight actions of interest in both trimmed and untrimmed
clips that contributes to action recognition without the help of any classifiers.
It is observed during our experiments that action-vector cannot effectively discriminate
between actions which are visually similar to each other. Hence, we subject action-vectors
to supervised linear embedding using linear discriminant analysis (LDA) and probabilistic
LDA (PLDA) to enforce discrimination. Particularly, we show that leveraging complimentary
information across action-vectors using different local features followed by discriminative
embedding provides the best classification performance. Further, we explore
non-linear embedding of action-vectors using Siamese networks especially for fine-grained
action recognition. A visualization of the hidden layer output in Siamese networks shows
its ability to effectively separate visually similar actions. This leads to better classification
performance than linear embedding on fine-grained action recognition.
All of the above approaches are presented on large unconstrained datasets with hundreds
of examples per action. However, actions in surveillance videos like snatch thefts are
difficult to model because of the diverse variety of scenarios in which they occur and very
few labeled examples. Hence, we propose to utilize the universal attribute model (UAM)
trained on large action datasets to represent such actions. Specifically, we show that there
are similarities between certain actions in the large datasets with snatch thefts which help
in extracting a representation for snatch thefts using the attributes from the UAM. This
representation is shown to be effective in distinguishing snatch thefts from regular actions
with high accuracy.In summary, this thesis proposes both supervised and unsupervised approaches for representing
actions which provide better discrimination than existing representations. The
first approach presents a dictionary learning based sparse representation for effective discrimination
of actions. Also, we propose a sparse representation for the human body based
on dictionaries in order to recognize actions with rapid body deformations. In the next
approach, a low-dimensional representation called action-vector for unsupervised action
recognition is presented. Further, linear and non-linear embedding of action-vectors is
proposed for addressing inter-action similarity and fine-grained action recognition, respectively.
Finally, we propose a representation for locating snatch thefts among thousands of
regular interactions in surveillance videos
Content-Based Access Control
In conventional database, the most popular access control model specifies policies explicitly for each role of every user against each data object manually. Nowadays, in large-scale content-centric data sharing, conventional approaches could be impractical due to exponential explosion of the data growth and the sensitivity of data objects. What's more, conventional database access control policy will not be functional when the semantic content of data is expected to play a role in access decisions. Users are often over-privileged, and ex post facto auditing is enforced to detect misuse of the privileges. Unfortunately, it is usually difficult to reverse the damage, as (large amount of) data has been disclosed already. In this dissertation, we first introduce Content-Based Access Control (CBAC), an innovative access control model for content-centric information sharing. As a complement to conventional access control models, the CBAC model makes access control decisions based on the content similarity between user credentials and data content automatically. In CBAC, each user is allowed by a metarule to access "a subset" of the designated data objects of a content-centric database, while the boundary of the subset is dynamically determined by the textual content of data objects. We then present an enforcement mechanism for CBAC that exploits Oracles Virtual Private Database (VPD) to implement a row-wise access control and to prevent data objects from being abused by unnecessary access admission. To further improve the performance of the proposed approach, we introduce a content-based blocking mechanism to improve the efficiency of CBAC enforcement to further reveal a more relevant part of the data objects comparing with only using the user credentials and data content. We also utilized several tagging mechanisms for more accurate textual content matching for short text snippets (e.g. short VarChar attributes) to extract topics other than pure word occurrences to represent the content of data. In the tagging mechanism, the similarity of content is calculated not purely dependent on the word occurrences but the semantic topics underneath the text content. Experimental results show that CBAC makes accurate access control decisions with a small overhead
- …