226 research outputs found
Proposal Flow: Semantic Correspondences from Object Proposals
Finding image correspondences remains a challenging problem in the presence
of intra-class variations and large changes in scene layout. Semantic flow
methods are designed to handle images depicting different instances of the same
object or scene category. We introduce a novel approach to semantic flow,
dubbed proposal flow, that establishes reliable correspondences using object
proposals. Unlike prevailing semantic flow approaches that operate on pixels or
regularly sampled local regions, proposal flow benefits from the
characteristics of modern object proposals, that exhibit high repeatability at
multiple scales, and can take advantage of both local and geometric consistency
constraints among proposals. We also show that the corresponding sparse
proposal flow can effectively be transformed into a conventional dense flow
field. We introduce two new challenging datasets that can be used to evaluate
both general semantic flow techniques and region-based approaches such as
proposal flow. We use these benchmarks to compare different matching
algorithms, object proposals, and region features within proposal flow, to the
state of the art in semantic flow. This comparison, along with experiments on
standard datasets, demonstrates that proposal flow significantly outperforms
existing semantic flow methods in various settings.Comment: arXiv admin note: text overlap with arXiv:1511.0506
A robust and efficient video representation for action recognition
This paper introduces a state-of-the-art video representation and applies it
to efficient action recognition and detection. We first propose to improve the
popular dense trajectory features by explicit camera motion estimation. More
specifically, we extract feature point matches between frames using SURF
descriptors and dense optical flow. The matches are used to estimate a
homography with RANSAC. To improve the robustness of homography estimation, a
human detector is employed to remove outlier matches from the human body as
human motion is not constrained by the camera. Trajectories consistent with the
homography are considered as due to camera motion, and thus removed. We also
use the homography to cancel out camera motion from the optical flow. This
results in significant improvement on motion-based HOF and MBH descriptors. We
further explore the recent Fisher vector as an alternative feature encoding
approach to the standard bag-of-words histogram, and consider different ways to
include spatial layout information in these encodings. We present a large and
varied set of evaluations, considering (i) classification of short basic
actions on six datasets, (ii) localization of such actions in feature-length
movies, and (iii) large-scale recognition of complex events. We find that our
improved trajectory features significantly outperform previous dense
trajectories, and that Fisher vectors are superior to bag-of-words encodings
for video recognition tasks. In all three tasks, we show substantial
improvements over the state-of-the-art results
움직이는 물체 검출 및 추적을 위한 생체 모방 모델
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 최진영.In this thesis, we propose bio-mimetic models for motion detection and visual tracking to overcome the limitations of existing methods in actual environments. The models are inspired from the theory that there are four different forms of visual memory for human visual perception when representing a scenevisible persistence, informational persistence, visual short-term memory (VSTM), and visual long-term memory (VLTM). We view our problem as a problem of modeling and representing an observed scene with temporary short-term models (TSTM) and conservative long-term models (CLTM). We study on building efficient and effective models for TSTM and CLTM, and utilizing them together to obtain robust detection and tracking results under occlusions, clumsy initializations, background clutters, drifting, and non-rigid deformations encountered in actual environments.
First, we propose an efficient representation of TSTM to be used for moving object detection on non-stationary cameras, which runs within 5.8 milliseconds (ms) on a PC, and real-time on mobile devices. To achieve real-time capability with robust performance, our method models the background through the proposed dual-mode kernel model (DMKM) and compensates the motion of the camera by mixing neighboring models. Modeling through DMKM prevents the background model from being contaminated by foreground pixels, while still allowing the model to be able to adapt to changes of the background. Mixing neighboring models reduces the errors arising from motion compensation and their influences are further reduced by keeping the age of the model. Also, to decrease computation load, the proposed method applies one DMKM to multiple pixels without performance degradation. Experimental results show the computational lightness and the real-time capability of our method on a smart phone with robust detection performances.
Second, by using the concept from both TSTM and CLTM, a new visual tracking method using the novel tri-model is proposed. The proposed method aims to solve the problems of occlusions, background clutters, and drifting simultaneously with the new tri-model. The proposed tri-model is composed of three models, where each model learns the target object, the background, and other non-target moving objects online. The proposed scheme performs tracking by finding the best explanation of the scene with the three learned models. By utilizing the information in the background and the foreground models as well as the target object model, our method obtains robust results under occlusions and background clutters. Also, the target object model is updated in a conservative way to prevent drifting. Furthermore, our method is not restricted to bounding-boxes when representing the target object, and is able to give pixel-wise tracking results.
Third, we go beyond pixel-wise modeling and propose a local feature based tracking model using both TSTM and CLTM to track objects in case of uncertain initializations and severe occlusions. To track objects accurately in such situations, the proposed scheme uses ``motion saliency'' and ``descriptor saliency'' of local features and performs tracking based on generalized Hough transform (GHT). The proposed motion saliency of a local feature utilizes instantaneous velocity of features to form TSTM and emphasizes features having distinctive motions, compared to the motions coming from local features which are not from the object. The descriptor saliency models local features as CLTM and emphasizes features which are likely to be of the object in terms of its feature descriptors. Through these saliencies, the proposed method tries to ``learn and find'' the target object rather than looking for what was given at initialization, becoming robust to initialization problems. Also, our tracking result is obtained by combining the results of each local features of the target and the surroundings, thus being robust against severe occlusions as well. The proposed method is compared against eight other methods, with nine image sequences, and hundred random initializations. The experimental results show that our method outperforms all other compared methods.
Fourth and last, we focus on building robust CLTM with local patches and their neighboring structures. The proposed method is based on sequential Bayesian inference and focuses on solving both the problem of tracking under partial occlusions and the problem of non-rigid object tracking in real-time on desktop personal computers (PC). The proposed scheme is mainly composed of two parts: (1) modeling the target object using elastic structure of local patches for robust performanceand (2) efficient hierarchical diffusion method to perform the tracking process in real-time. The elastic structure of local patches allows the proposed scheme to handle partial occlusions and non-rigid deformations through the relationship among neighboring patches. The proposed hierarchical diffusion generates samples from the region where the posterior is concentrated to reduce computation time. The method is extensively tested on a number of challenging image sequences with occlusion and non-rigid deformation. The experimental results show the real-time capability and the robustness of the proposed scheme under various situations.1 Introduction
1.1 Background and Research Issues
1.1.1 Issues in Motion Detection
1.1.2 Issues in Object Tracking
1.2 The Human Visual Memory
1.2.1 Sensory Memory
1.2.2 Visual Short-Term Memory
1.2.3 Visual Long-Term Memory
1.3 Bio-mimetic Framework for Detection and Tracking
1.4 Contents of the Research
2 Detection by Pixel-wise Dual-Mode Kernel Model
2.1 Proposed Method
2.1.1 Approximated Gaussian Kernel Model
2.1.2 Dual-Mode Kernel Model (DMKM)
2.1.3 Motion Compensation by Mixing Models
2.1.4 Detection of Foreground Pixels
2.2 Experimental Results
2.2.1 Runtime Comparison
2.2.2 Qualitative Comparison
2.2.3 Quantitative Comparison
2.2.4 Effects of Dual-Mode Kernel Model
2.2.5 Effects of Motion Compensation
2.2.6 Mobile Results
2.3 Remarks and Discussion
3 Tracking by Pixel-wise Tri-Model Representation
3.1 Tri-Model Framework
3.1.1 Overall Scheme
3.1.2 Advantages
3.1.3 Practical Approximation
3.2 Tracking with the Tri-Model
3.2.1 Likelihood of the Tri-Model
3.2.2 Likelihood Maximization
3.2.3 Estimating Pixel-Wise Labels
3.3 Learning the Tri-Model
3.3.1 Target Model
3.3.2 Background Model
3.3.3 Foreground Model
3.4 Experimental Results
3.4.1 Experimental Settings
3.4.2 Tracking Accuracy: Bounding Box
3.4.3 Tracking Accuracy: Pixel-Wise
3.5 Remarks and Discussion
4 Tracking by Feature-point-wise Saliency Model
4.1 Proposed Method
4.1.1 Tracking based on GHT
4.1.2 Descriptor Saliency and Feature DB Update
4.1.3 Motion Saliency
4.2 Experimental Results
4.2.1 Tracking with Inaccurate Initializations
4.2.2 Tracking Under Occlusions
4.3 Remarks and Discussion
5 Tracking by Patch-wise Elastic Structure Model
5.1 Tracking with Elastic Structure of Local Patches
5.1.1 Sequential Bayesian Inference Framework
5.1.2 Elastic Structure of Local Patches
5.1.3 Modeling a Single Patch
5.1.4 Modeling the Relationship between Patches
5.1.5 Model Update
5.1.6 Hierarchical Diffusion
5.1.7 Summary of the Proposed Method
5.2 Experiments
5.2.1 Parameter Effects
5.2.2 Performance Evaluation
5.2.3 Discussion on Translation, Rotation, Illumination Changes
5.2.4 Discussion on Partial Occlusions
5.2.5 Discussion on Non-Rigid Deformations
5.2.6 Discussion on Additional Cases
5.2.7 Summary of Tracking Results
5.2.8 Effectiveness of Hierarchical Diffusion
5.2.9 Limitations
5.3 Remarks and Discussion
6 Concluding Remarks and Future Works
Bibliography
Abstract in KoreanDocto
STEm-Seg: Spatio-temporal Embeddings for Instance Segmentation in Videos
Existing methods for instance segmentation in videos typi-cally involve
multi-stage pipelines that follow the tracking-by-detectionparadigm and model a
video clip as a sequence of images. Multiple net-works are used to detect
objects in individual frames, and then associatethese detections over time.
Hence, these methods are often non-end-to-end trainable and highly tailored to
specific tasks. In this paper, we pro-pose a different approach that is
well-suited to a variety of tasks involvinginstance segmentation in videos. In
particular, we model a video clip asa single 3D spatio-temporal volume, and
propose a novel approach thatsegments and tracks instances across space and
time in a single stage. Ourproblem formulation is centered around the idea of
spatio-temporal em-beddings which are trained to cluster pixels belonging to a
specific objectinstance over an entire video clip. To this end, we introduce
(i) novel mix-ing functions that enhance the feature representation of
spatio-temporalembeddings, and (ii) a single-stage, proposal-free network that
can rea-son about temporal context. Our network is trained end-to-end to
learnspatio-temporal embeddings as well as parameters required to clusterthese
embeddings, thus simplifying inference. Our method achieves state-of-the-art
results across multiple datasets and tasks. Code and modelsare available at
https://github.com/sabarim/STEm-Seg.Comment: 28 pages, 6 figure
Single camera pose estimation using Bayesian filtering and Kinect motion priors
Traditional approaches to upper body pose estimation using monocular vision
rely on complex body models and a large variety of geometric constraints. We
argue that this is not ideal and somewhat inelegant as it results in large
processing burdens, and instead attempt to incorporate these constraints
through priors obtained directly from training data. A prior distribution
covering the probability of a human pose occurring is used to incorporate
likely human poses. This distribution is obtained offline, by fitting a
Gaussian mixture model to a large dataset of recorded human body poses, tracked
using a Kinect sensor. We combine this prior information with a random walk
transition model to obtain an upper body model, suitable for use within a
recursive Bayesian filtering framework. Our model can be viewed as a mixture of
discrete Ornstein-Uhlenbeck processes, in that states behave as random walks,
but drift towards a set of typically observed poses. This model is combined
with measurements of the human head and hand positions, using recursive
Bayesian estimation to incorporate temporal information. Measurements are
obtained using face detection and a simple skin colour hand detector, trained
using the detected face. The suggested model is designed with analytical
tractability in mind and we show that the pose tracking can be
Rao-Blackwellised using the mixture Kalman filter, allowing for computational
efficiency while still incorporating bio-mechanical properties of the upper
body. In addition, the use of the proposed upper body model allows reliable
three-dimensional pose estimates to be obtained indirectly for a number of
joints that are often difficult to detect using traditional object recognition
strategies. Comparisons with Kinect sensor results and the state of the art in
2D pose estimation highlight the efficacy of the proposed approach.Comment: 25 pages, Technical report, related to Burke and Lasenby, AMDO 2014
conference paper. Code sample: https://github.com/mgb45/SignerBodyPose Video:
https://www.youtube.com/watch?v=dJMTSo7-uF
Registration and categorization of camera captured documents
Camera captured document image analysis concerns with processing of documents captured with hand-held sensors, smart phones, or other capturing devices using advanced image processing, computer vision, pattern recognition, and machine learning techniques. As there is no constrained capturing in the real world, the captured documents suffer from illumination variation, viewpoint variation, highly variable scale/resolution, background clutter, occlusion, and non-rigid deformations e.g., folds and crumples. Document registration is a problem where the image of a template document whose layout is known is registered with a test document image. Literature in camera captured document mosaicing addressed the registration of captured documents with the assumption of considerable amount of single chunk overlapping content. These methods cannot be directly applied to registration of forms, bills, and other commercial documents where the fixed content is distributed into tiny portions across the document. On the other hand, most of the existing document image registration methods work with scanned documents under affine transformation. Literature in document image retrieval addressed categorization of documents based on text, figures, etc.
However, the scalability of existing document categorization methodologies based on logo identification is very limited. This dissertation focuses on two problems (i) registration of captured documents where the overlapping content is distributed into tiny portions across the documents and (ii) categorization of captured documents into predefined logo classes that scale to large datasets using local invariant features. A novel methodology is proposed for the registration of user defined Regions Of Interest (ROI) using corresponding local features from their neighborhood. The methodology enhances prior approaches in point pattern based registration, like RANdom SAmple Consensus (RANSAC) and Thin Plate Spline-Robust Point Matching (TPS-RPM), to enable registration of cell phone and camera captured documents under non-rigid transformations. Three novel aspects are embedded into the methodology: (i) histogram based uniformly transformed correspondence estimation, (ii) clustering of points located near the ROI to select only close by regions for matching, and (iii) validation of the registration in RANSAC and TPS-RPM algorithms. Experimental results on a dataset of 480 images captured using iPhone 3GS and Logitech webcam Pro 9000 have shown an average registration accuracy of 92.75% using Scale Invariant Feature Transform (SIFT).
Robust local features for logo identification are determined empirically by comparisons among SIFT, Speeded-Up Robust Features (SURF), Hessian-Affine, Harris-Affine, and Maximally Stable Extremal Regions (MSER). Two different matching methods are presented for categorization: matching all features extracted from the query document as a single set and a segment-wise matching of query document features using segmentation achieved by grouping area under intersecting dense local affine covariant regions. The later approach not only gives an approximate location of predicted logo classes in the query document but also helps to increase the prediction accuracies. In order to facilitate scalability to large data sets, inverted indexing of logo class features has been incorporated in both approaches. Experimental results on a dataset of real camera captured documents have shown a peak 13.25% increase in the F–measure accuracy using the later approach as compared to the former
Automating Bridge Inspection Procedures: Real-Time UAS-Based Detection and Tracking of Concrete Bridge Element
Bridge inspections are necessary to maintain the safety, health, and welfare of the
public. All bridges in the United States are federally mandated to undergo routine
evaluations to confirm their structural integrity throughout their lifetime. The traditional
process implements a bridge inspection team to conduct the inspection, heavily relying
on visual measurements and subjective estimates of the existing state of the structure.
Conducting unmanned automated bridge inspections would allow for a more efficient,
accurate, and safer alternative to traditional bridge inspection procedures. Optimizing
bridge inspections in this manner would enable frequent inspections in order to
comprehensively monitor the health of bridges and quickly recognize minor problems
which could be easily corrected before turning into more critical issues. In order to
create an unmanned data acquisition procedure, unmanned aerial vehicles with high-resolution
cameras will be employed to collect videos of the bridge under inspection. To
automate a bridge inspection procedure employing machine learning methods, such as
neural networks, and machine vision methods, such as Hough transform and Canny edge
detection, will assist in identifying the entire beam. These methods along with future
work in damage detection and assessment will be the main steps to create an unmanned
automated bridge inspection
- …