7,287 research outputs found
Object Detection by Spatio-Temporal Analysis and Tracking of the Detected Objects in a Video with Variable Background
In this paper we propose a novel approach for detecting and tracking objects
in videos with variable background i.e. videos captured by moving cameras
without any additional sensor. In a video captured by a moving camera, both the
background and foreground are changing in each frame of the image sequence. So
for these videos, modeling a single background with traditional background
modeling methods is infeasible and thus the detection of actual moving object
in a variable background is a challenging task. To detect actual moving object
in this work, spatio-temporal blobs have been generated in each frame by
spatio-temporal analysis of the image sequence using a three-dimensional Gabor
filter. Then individual blobs, which are parts of one object are merged using
Minimum Spanning Tree to form the moving object in the variable background. The
height, width and four-bin gray-value histogram of the object are calculated as
its features and an object is tracked in each frame using these features to
generate the trajectories of the object through the video sequence. In this
work, problem of data association during tracking is solved by Linear
Assignment Problem and occlusion is handled by the application of kalman
filter. The major advantage of our method over most of the existing tracking
algorithms is that, the proposed method does not require initialization in the
first frame or training on sample data to perform. Performance of the algorithm
has been tested on benchmark videos and very satisfactory result has been
achieved. The performance of the algorithm is also comparable and superior with
respect to some benchmark algorithms
Thin-Slicing Network: A Deep Structured Model for Pose Estimation in Videos
Deep ConvNets have been shown to be effective for the task of human pose
estimation from single images. However, several challenging issues arise in the
video-based case such as self-occlusion, motion blur, and uncommon poses with
few or no examples in training data sets. Temporal information can provide
additional cues about the location of body joints and help to alleviate these
issues. In this paper, we propose a deep structured model to estimate a
sequence of human poses in unconstrained videos. This model can be efficiently
trained in an end-to-end manner and is capable of representing appearance of
body joints and their spatio-temporal relationships simultaneously. Domain
knowledge about the human body is explicitly incorporated into the network
providing effective priors to regularize the skeletal structure and to enforce
temporal consistency. The proposed end-to-end architecture is evaluated on two
widely used benchmarks (Penn Action dataset and JHMDB dataset) for video-based
pose estimation. Our approach significantly outperforms the existing
state-of-the-art methods.Comment: Preliminary version to appear in CVPR201
Real-Time Action Detection in Video Surveillance using Sub-Action Descriptor with Multi-CNN
When we say a person is texting, can you tell the person is walking or
sitting? Emphatically, no. In order to solve this incomplete representation
problem, this paper presents a sub-action descriptor for detailed action
detection. The sub-action descriptor consists of three levels: the posture, the
locomotion, and the gesture level. The three levels give three sub-action
categories for one action to address the representation problem. The proposed
action detection model simultaneously localizes and recognizes the actions of
multiple individuals in video surveillance using appearance-based temporal
features with multi-CNN. The proposed approach achieved a mean average
precision (mAP) of 76.6% at the frame-based and 83.5% at the video-based
measurement on the new large-scale ICVL video surveillance dataset that the
authors introduce and make available to the community with this paper.
Extensive experiments on the benchmark KTH dataset demonstrate that the
proposed approach achieved better performance, which in turn boosts the action
recognition performance over the state-of-the-art. The action detection model
can run at around 25 fps on the ICVL and more than 80 fps on the KTH dataset,
which is suitable for real-time surveillance applications.Comment: 29 pages, 16 figure
Human Action Recognition and Prediction: A Survey
Derived from rapid advances in computer vision and machine learning, video
analysis tasks have been moving from inferring the present state to predicting
the future state. Vision-based action recognition and prediction from videos
are such tasks, where action recognition is to infer human actions (present
state) based upon complete action executions, and action prediction to predict
human actions (future state) based upon incomplete action executions. These two
tasks have become particularly prevalent topics recently because of their
explosively emerging real-world applications, such as visual surveillance,
autonomous driving vehicle, entertainment, and video retrieval, etc. Many
attempts have been devoted in the last a few decades in order to build a robust
and effective framework for action recognition and prediction. In this paper,
we survey the complete state-of-the-art techniques in the action recognition
and prediction. Existing models, popular algorithms, technical difficulties,
popular action databases, evaluation protocols, and promising future directions
are also provided with systematic discussions
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
A Survey on Content-Aware Video Analysis for Sports
Sports data analysis is becoming increasingly large-scale, diversified, and
shared, but difficulty persists in rapidly accessing the most crucial
information. Previous surveys have focused on the methodologies of sports video
analysis from the spatiotemporal viewpoint instead of a content-based
viewpoint, and few of these studies have considered semantics. This study
develops a deeper interpretation of content-aware sports video analysis by
examining the insight offered by research into the structure of content under
different scenarios. On the basis of this insight, we provide an overview of
the themes particularly relevant to the research on content-aware systems for
broadcast sports. Specifically, we focus on the video content analysis
techniques applied in sportscasts over the past decade from the perspectives of
fundamentals and general review, a content hierarchical model, and trends and
challenges. Content-aware analysis methods are discussed with respect to
object-, event-, and context-oriented groups. In each group, the gap between
sensation and content excitement must be bridged using proper strategies. In
this regard, a content-aware approach is required to determine user demands.
Finally, the paper summarizes the future trends and challenges for sports video
analysis. We believe that our findings can advance the field of research on
content-aware video analysis for broadcast sports.Comment: Accepted for publication in IEEE Transactions on Circuits and Systems
for Video Technology (TCSVT
A Hierarchical Deep Temporal Model for Group Activity Recognition
In group activity recognition, the temporal dynamics of the whole activity
can be inferred based on the dynamics of the individual people representing the
activity. We build a deep model to capture these dynamics based on LSTM
(long-short term memory) models. To make use of these ob- servations, we
present a 2-stage deep temporal model for the group activity recognition
problem. In our model, a LSTM model is designed to represent action dynamics of
in- dividual people in a sequence and another LSTM model is designed to
aggregate human-level information for whole activity understanding. We evaluate
our model over two datasets: the collective activity dataset and a new volley-
ball dataset. Experimental results demonstrate that our proposed model improves
group activity recognition perfor- mance with compared to baseline methods.Comment: cs.cv Accepted to CVPR 201
A survey on trajectory clustering analysis
This paper comprehensively surveys the development of trajectory clustering.
Considering the critical role of trajectory data mining in modern intelligent
systems for surveillance security, abnormal behavior detection, crowd behavior
analysis, and traffic control, trajectory clustering has attracted growing
attention. Existing trajectory clustering methods can be grouped into three
categories: unsupervised, supervised and semi-supervised algorithms. In spite
of achieving a certain level of development, trajectory clustering is limited
in its success by complex conditions such as application scenarios and data
dimensions. This paper provides a holistic understanding and deep insight into
trajectory clustering, and presents a comprehensive analysis of representative
methods and promising future directions
H+O: Unified Egocentric Recognition of 3D Hand-Object Poses and Interactions
We present a unified framework for understanding 3D hand and object
interactions in raw image sequences from egocentric RGB cameras. Given a single
RGB image, our model jointly estimates the 3D hand and object poses, models
their interactions, and recognizes the object and action classes with a single
feed-forward pass through a neural network. We propose a single architecture
that does not rely on external detection algorithms but rather is trained
end-to-end on single images. We further merge and propagate information in the
temporal domain to infer interactions between hand and object trajectories and
recognize actions. The complete model takes as input a sequence of frames and
outputs per-frame 3D hand and object pose predictions along with the estimates
of object and action categories for the entire sequence. We demonstrate
state-of-the-art performance of our algorithm even in comparison to the
approaches that work on depth data and ground-truth annotations.Comment: CVPR 2019 (Oral
An End-to-end 3D Convolutional Neural Network for Action Detection and Segmentation in Videos
In this paper, we propose an end-to-end 3D CNN for action detection and
segmentation in videos. The proposed architecture is a unified deep network
that is able to recognize and localize action based on 3D convolution features.
A video is first divided into equal length clips and next for each clip a set
of tube proposals are generated based on 3D CNN features. Finally, the tube
proposals of different clips are linked together and spatio-temporal action
detection is performed using these linked video proposals. This top-down action
detection approach explicitly relies on a set of good tube proposals to perform
well and training the bounding box regression usually requires a large number
of annotated samples. To remedy this, we further extend the 3D CNN to an
encoder-decoder structure and formulate the localization problem as action
segmentation. The foreground regions (i.e. action regions) for each frame are
segmented first then the segmented foreground maps are used to generate the
bounding boxes. This bottom-up approach effectively avoids tube proposal
generation by leveraging the pixel-wise annotations of segmentation. The
segmentation framework also can be readily applied to a general problem of
video object segmentation. Extensive experiments on several video datasets
demonstrate the superior performance of our approach for action detection and
video object segmentation compared to the state-of-the-arts.Comment: arXiv admin note: substantial text overlap with arXiv:1703.1066
- …