285 research outputs found
Search Tracker: Human-derived object tracking in-the-wild through large-scale search and retrieval
Humans use context and scene knowledge to easily localize moving objects in
conditions of complex illumination changes, scene clutter and occlusions. In
this paper, we present a method to leverage human knowledge in the form of
annotated video libraries in a novel search and retrieval based setting to
track objects in unseen video sequences. For every video sequence, a document
that represents motion information is generated. Documents of the unseen video
are queried against the library at multiple scales to find videos with similar
motion characteristics. This provides us with coarse localization of objects in
the unseen video. We further adapt these retrieved object locations to the new
video using an efficient warping scheme. The proposed method is validated on
in-the-wild video surveillance datasets where we outperform state-of-the-art
appearance-based trackers. We also introduce a new challenging dataset with
complex object appearance changes.Comment: Under review with the IEEE Transactions on Circuits and Systems for
Video Technolog
Aggregation signature for small object tracking
Small object tracking becomes an increasingly important task, which however
has been largely unexplored in computer vision. The great challenges stem from
the facts that: 1) small objects show extreme vague and variable appearances,
and 2) they tend to be lost easier as compared to normal-sized ones due to the
shaking of lens. In this paper, we propose a novel aggregation signature
suitable for small object tracking, especially aiming for the challenge of
sudden and large drift. We make three-fold contributions in this work. First,
technically, we propose a new descriptor, named aggregation signature, based on
saliency, able to represent highly distinctive features for small objects.
Second, theoretically, we prove that the proposed signature matches the
foreground object more accurately with a high probability. Third,
experimentally, the aggregation signature achieves a high performance on
multiple datasets, outperforming the state-of-the-art methods by large margins.
Moreover, we contribute with two newly collected benchmark datasets, i.e.,
small90 and small112, for visually small object tracking. The datasets will be
available in https://github.com/bczhangbczhang/.Comment: IEEE Transactions on Image Processing, 201
움직이는 물체 검출 및 추적을 위한 생체 모방 모델
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 최진영.In this thesis, we propose bio-mimetic models for motion detection and visual tracking to overcome the limitations of existing methods in actual environments. The models are inspired from the theory that there are four different forms of visual memory for human visual perception when representing a scenevisible persistence, informational persistence, visual short-term memory (VSTM), and visual long-term memory (VLTM). We view our problem as a problem of modeling and representing an observed scene with temporary short-term models (TSTM) and conservative long-term models (CLTM). We study on building efficient and effective models for TSTM and CLTM, and utilizing them together to obtain robust detection and tracking results under occlusions, clumsy initializations, background clutters, drifting, and non-rigid deformations encountered in actual environments.
First, we propose an efficient representation of TSTM to be used for moving object detection on non-stationary cameras, which runs within 5.8 milliseconds (ms) on a PC, and real-time on mobile devices. To achieve real-time capability with robust performance, our method models the background through the proposed dual-mode kernel model (DMKM) and compensates the motion of the camera by mixing neighboring models. Modeling through DMKM prevents the background model from being contaminated by foreground pixels, while still allowing the model to be able to adapt to changes of the background. Mixing neighboring models reduces the errors arising from motion compensation and their influences are further reduced by keeping the age of the model. Also, to decrease computation load, the proposed method applies one DMKM to multiple pixels without performance degradation. Experimental results show the computational lightness and the real-time capability of our method on a smart phone with robust detection performances.
Second, by using the concept from both TSTM and CLTM, a new visual tracking method using the novel tri-model is proposed. The proposed method aims to solve the problems of occlusions, background clutters, and drifting simultaneously with the new tri-model. The proposed tri-model is composed of three models, where each model learns the target object, the background, and other non-target moving objects online. The proposed scheme performs tracking by finding the best explanation of the scene with the three learned models. By utilizing the information in the background and the foreground models as well as the target object model, our method obtains robust results under occlusions and background clutters. Also, the target object model is updated in a conservative way to prevent drifting. Furthermore, our method is not restricted to bounding-boxes when representing the target object, and is able to give pixel-wise tracking results.
Third, we go beyond pixel-wise modeling and propose a local feature based tracking model using both TSTM and CLTM to track objects in case of uncertain initializations and severe occlusions. To track objects accurately in such situations, the proposed scheme uses ``motion saliency'' and ``descriptor saliency'' of local features and performs tracking based on generalized Hough transform (GHT). The proposed motion saliency of a local feature utilizes instantaneous velocity of features to form TSTM and emphasizes features having distinctive motions, compared to the motions coming from local features which are not from the object. The descriptor saliency models local features as CLTM and emphasizes features which are likely to be of the object in terms of its feature descriptors. Through these saliencies, the proposed method tries to ``learn and find'' the target object rather than looking for what was given at initialization, becoming robust to initialization problems. Also, our tracking result is obtained by combining the results of each local features of the target and the surroundings, thus being robust against severe occlusions as well. The proposed method is compared against eight other methods, with nine image sequences, and hundred random initializations. The experimental results show that our method outperforms all other compared methods.
Fourth and last, we focus on building robust CLTM with local patches and their neighboring structures. The proposed method is based on sequential Bayesian inference and focuses on solving both the problem of tracking under partial occlusions and the problem of non-rigid object tracking in real-time on desktop personal computers (PC). The proposed scheme is mainly composed of two parts: (1) modeling the target object using elastic structure of local patches for robust performanceand (2) efficient hierarchical diffusion method to perform the tracking process in real-time. The elastic structure of local patches allows the proposed scheme to handle partial occlusions and non-rigid deformations through the relationship among neighboring patches. The proposed hierarchical diffusion generates samples from the region where the posterior is concentrated to reduce computation time. The method is extensively tested on a number of challenging image sequences with occlusion and non-rigid deformation. The experimental results show the real-time capability and the robustness of the proposed scheme under various situations.1 Introduction
1.1 Background and Research Issues
1.1.1 Issues in Motion Detection
1.1.2 Issues in Object Tracking
1.2 The Human Visual Memory
1.2.1 Sensory Memory
1.2.2 Visual Short-Term Memory
1.2.3 Visual Long-Term Memory
1.3 Bio-mimetic Framework for Detection and Tracking
1.4 Contents of the Research
2 Detection by Pixel-wise Dual-Mode Kernel Model
2.1 Proposed Method
2.1.1 Approximated Gaussian Kernel Model
2.1.2 Dual-Mode Kernel Model (DMKM)
2.1.3 Motion Compensation by Mixing Models
2.1.4 Detection of Foreground Pixels
2.2 Experimental Results
2.2.1 Runtime Comparison
2.2.2 Qualitative Comparison
2.2.3 Quantitative Comparison
2.2.4 Effects of Dual-Mode Kernel Model
2.2.5 Effects of Motion Compensation
2.2.6 Mobile Results
2.3 Remarks and Discussion
3 Tracking by Pixel-wise Tri-Model Representation
3.1 Tri-Model Framework
3.1.1 Overall Scheme
3.1.2 Advantages
3.1.3 Practical Approximation
3.2 Tracking with the Tri-Model
3.2.1 Likelihood of the Tri-Model
3.2.2 Likelihood Maximization
3.2.3 Estimating Pixel-Wise Labels
3.3 Learning the Tri-Model
3.3.1 Target Model
3.3.2 Background Model
3.3.3 Foreground Model
3.4 Experimental Results
3.4.1 Experimental Settings
3.4.2 Tracking Accuracy: Bounding Box
3.4.3 Tracking Accuracy: Pixel-Wise
3.5 Remarks and Discussion
4 Tracking by Feature-point-wise Saliency Model
4.1 Proposed Method
4.1.1 Tracking based on GHT
4.1.2 Descriptor Saliency and Feature DB Update
4.1.3 Motion Saliency
4.2 Experimental Results
4.2.1 Tracking with Inaccurate Initializations
4.2.2 Tracking Under Occlusions
4.3 Remarks and Discussion
5 Tracking by Patch-wise Elastic Structure Model
5.1 Tracking with Elastic Structure of Local Patches
5.1.1 Sequential Bayesian Inference Framework
5.1.2 Elastic Structure of Local Patches
5.1.3 Modeling a Single Patch
5.1.4 Modeling the Relationship between Patches
5.1.5 Model Update
5.1.6 Hierarchical Diffusion
5.1.7 Summary of the Proposed Method
5.2 Experiments
5.2.1 Parameter Effects
5.2.2 Performance Evaluation
5.2.3 Discussion on Translation, Rotation, Illumination Changes
5.2.4 Discussion on Partial Occlusions
5.2.5 Discussion on Non-Rigid Deformations
5.2.6 Discussion on Additional Cases
5.2.7 Summary of Tracking Results
5.2.8 Effectiveness of Hierarchical Diffusion
5.2.9 Limitations
5.3 Remarks and Discussion
6 Concluding Remarks and Future Works
Bibliography
Abstract in KoreanDocto
Registration of 3D Point Clouds and Meshes: A Survey From Rigid to Non-Rigid
Three-dimensional surface registration transforms multiple three-dimensional data sets into the same coordinate system so as to align overlapping components of these sets. Recent surveys have covered different aspects of either rigid or nonrigid registration, but seldom discuss them as a whole. Our study serves two purposes: 1) To give a comprehensive survey of both types of registration, focusing on three-dimensional point clouds and meshes and 2) to provide a better understanding of registration from the perspective of data fitting. Registration is closely related to data fitting in which it comprises three core interwoven components: model selection, correspondences and constraints, and optimization. Study of these components 1) provides a basis for comparison of the novelties of different techniques, 2) reveals the similarity of rigid and nonrigid registration in terms of problem representations, and 3) shows how overfitting arises in nonrigid registration and the reasons for increasing interest in intrinsic techniques. We further summarize some practical issues of registration which include initializations and evaluations, and discuss some of our own observations, insights and foreseeable research trends
Visual Tracking in Robotic Minimally Invasive Surgery
Intra-operative imaging and robotics are some of the technologies driving forward better and more effective minimally invasive surgical procedures. To advance surgical practice and capabilities further, one of the key requirements for computationally enhanced interventions is to know how instruments and tissues move during the operation. While endoscopic video captures motion, the complex appearance dynamic effects of surgical scenes are challenging for computer vision algorithms to handle with robustness. Tackling both tissue and instrument motion estimation, this thesis proposes a combined non-rigid surface deformation estimation method to track tissue surfaces robustly and in conditions with poor illumination. For instrument tracking, a keypoint based 2D tracker that relies on the Generalized Hough Transform is developed to initialize a 3D tracker in order to robustly track surgical instruments through long sequences that contain complex motions. To handle appearance changes and occlusion a patch-based adaptive weighting with segmentation and scale tracking framework is developed. It takes a tracking-by-detection approach and a segmentation model is used to assigns weights to template patches in order to suppress back- ground information. The performance of the method is thoroughly evaluated showing that without any offline-training, the tracker works well even in complex environments. Finally, the thesis proposes a novel 2D articulated instrument pose estimation framework, which includes detection-regression fully convolutional network and a multiple instrument parsing component. The framework achieves compelling performance and illustrates interesting properties includ- ing transfer between different instrument types and between ex vivo and in vivo data. In summary, the thesis advances the state-of-the art in visual tracking for surgical applications for both tissue and instrument motion estimation. It contributes to developing the technological capability of full surgical scene understanding from endoscopic video
Spatiotemporal visual analysis of human actions
In this dissertation we propose four methods for the recognition of human activities. In all four of
them, the representation of the activities is based on spatiotemporal features that are automatically
detected at areas where there is a significant amount of independent motion, that is, motion that is
due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features
throughout this dissertation. The algorithms presented, however, can be used with any kind of features,
as long as the latter are well localized and have a well-defined area of support in space and time. We
introduce the utilized spatiotemporal salient points in the first method presented in this dissertation.
By extending previous work on spatial saliency, we measure the variations in the information content of
pixel neighborhoods both in space and time, and detect the points at the locations and scales for which
this information content is locally maximized. In this way, an activity is represented as a collection of
spatiotemporal salient points. We propose an iterative linear space-time warping technique in order
to align the representations in space and time and propose to use Relevance Vector Machines (RVM)
in order to classify each example into an action category. In the second method proposed in this
dissertation we propose to enhance the acquired representations of the first method. More specifically,
we propose to track each detected point in time, and create representations based on sets of trajectories,
where each trajectory expresses how the information engulfed by each salient point evolves over time.
In order to deal with imperfect localization of the detected points, we augment the observation model
of the tracker with background information, acquired using a fully automatic background estimation
algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels.
In addition, we perform experiments where the tracked templates are localized on specific parts of the
body, like the hands and the head, and we further augment the tracker’s observation model using a
human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm
(LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and
RVMs for classification. In the third method that we propose, we assume that neighboring salient
points follow a similar motion. This is in contrast to the previous method, where each salient point was
tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual
descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The
latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal
neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in
translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the
corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are
extracted across the whole dataset are subsequently clustered in order to create a codebook, which is
used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for
classification. The fourth and last method addresses the joint problem of localization and recognition
of human activities depicted in unsegmented image sequences. Its main contribution is the use of an
implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal
localization of characteristic ensembles of spatiotemporal features. The latter are localized around
automatically detected salient points. Evidence for the spatiotemporal localization of the activity
is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in
order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct
class-specific spatiotemporal models, which encode where in space and time each codeword ensemble
appears in the training set. During testing, each activated codeword ensemble casts probabilistic
votes concerning the spatiotemporal localization of the activity, according to the information stored
during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable
hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume
which potentially engulfs the activity, and is verified by performing action category classification with
an RVM classifier
Short-Term Visual Object Tracking in Real-Time
In the thesis, we propose two novel short-term object tracking methods, the Flock of
Trackers (FoT) and the Scale-Adaptive Mean-Shift (ASMS), a framework for fusion of
multiple trackers and detector and contributions to the problem of tracker evaluation
within the Visual Object Tracking (VOT) initiative.
The Flock of Trackers partitions the object of interest to an equally sized parts. For
each part, the FoT computes an optical flow correspondence and estimates its reliability.
Reliable correspondences are used to robustly estimates a target pose using RANSAC
technique, which allows for range of complex rigid transformation (e.g. affine transformation)
of a target. The scale-adaptive mean-shift tracker is a gradient optimization
method that iteratively moves a search window to the position which minimizes a distance
of a appearance model extracted from the search window to the target model. The
ASMS propose a theoretically justified modification of the mean-shift framework that
addresses one of the drawbacks of the mean-shift trackers which is the fixed size search
window, i.e. target scale. Moreover, the ASMS introduce a technique that incorporates
a background information into the gradient optimization to reduce tracker failures in
presence of background clutter.
To take advantage of strengths of the previous methods, we introduce a novel tracking
framework HMMTxD that fuses multiple tracking methods together with a proposed
feature-based online detector. The framework utilizes a hidden Markov model
(HMM) to learn online how well each tracking method performs using sparsely ”annotated”
data provided by a detector, which are assumed to be correct, and confidence
provided by the trackers. The HMM estimates the probability that a tracker is correct in
the current frame given the previously learned HMM model and the current tracker confidence.
This tracker fusion alleviates the drawbacks of the individual tracking methods
since the HMMTxD learns which trackers are performing well and switch off the rest.
All of the proposed trackers were extensively evaluated on several benchmarks and
publicly available tracking sequences and achieve excellent results in various evaluation
criteria. The FoT achieved state-of-the-art performance in the VOT2013 benchmark,
finishing second. Today, the FoT is used as a building block in complex applications
such as multi-object tracking frameworks. The ASMS achieved state-of-the-art results
in the VOT2015 benchmark and was chosen as the best performing method in terms
of a trade-off between performance and running time. The HMMTxD demonstrated
state-of-the-art performance in multiple benchmarks (VOT2014, VOT2015 and OTB).
The thesis also contributes, and provides an overview, to the Visual Object Tracking
(VOT) evaluation methodology. This methodology provides a means for unbiased
comparison of different tracking methods across publication, which is crucial for advancement
of the state-of-the-art over a longer timespan and also provides a tools for
deeper performance analysis of tracking methods. Furthermore, a annual workshops
are organized on major computer vision conferences, where the authors are encouraged
to submit their novel methods to compete against each other and where the advances in
the visual object tracking are discussed.Katedra kybernetik
- …