20 research outputs found
Scene Parsing with Multiscale Feature Learning, Purity Trees, and Optimal Covers
Scene parsing, or semantic segmentation, consists in labeling each pixel in
an image with the category of the object it belongs to. It is a challenging
task that involves the simultaneous detection, segmentation and recognition of
all the objects in the image.
The scene parsing method proposed here starts by computing a tree of segments
from a graph of pixel dissimilarities. Simultaneously, a set of dense feature
vectors is computed which encodes regions of multiple sizes centered on each
pixel. The feature extractor is a multiscale convolutional network trained from
raw pixels. The feature vectors associated with the segments covered by each
node in the tree are aggregated and fed to a classifier which produces an
estimate of the distribution of object categories contained in the segment. A
subset of tree nodes that cover the image are then selected so as to maximize
the average "purity" of the class distributions, hence maximizing the overall
likelihood that each segment will contain a single object. The convolutional
network feature extractor is trained end-to-end from raw pixels, alleviating
the need for engineered features. After training, the system is parameter free.
The system yields record accuracies on the Stanford Background Dataset (8
classes), the Sift Flow Dataset (33 classes) and the Barcelona Dataset (170
classes) while being an order of magnitude faster than competing approaches,
producing a 320 \times 240 image labeling in less than 1 second.Comment: 9 pages, 4 figures - Published in 29th International Conference on
Machine Learning (ICML 2012), Jun 2012, Edinburgh, United Kingdo
Teaching Compositionality to CNNs
Convolutional neural networks (CNNs) have shown great success in computer
vision, approaching human-level performance when trained for specific tasks via
application-specific loss functions. In this paper, we propose a method for
augmenting and training CNNs so that their learned features are compositional.
It encourages networks to form representations that disentangle objects from
their surroundings and from each other, thereby promoting better
generalization. Our method is agnostic to the specific details of the
underlying CNN to which it is applied and can in principle be used with any
CNN. As we show in our experiments, the learned representations lead to feature
activations that are more localized and improve performance over
non-compositional baselines in object recognition tasks.Comment: Preprint appearing in CVPR 201
Data-driven Crowd Analysis in Videos
International audienceIn this work we present a new crowd analysis algorithm powered by behavior priors that are learned on a large database of crowd videos gathered from the Internet. The algorithm works by first learning a set of crowd behavior priors off-line. During testing, crowd patches are matched to the database and behavior priors are transferred. We adhere to the insight that despite the fact that the entire space of possible crowd behaviors is infinite, the space of distinguishable crowd motion patterns may not be all that large. For many individuals in a crowd, we are able to find analogous crowd patches in our database which contain similar patterns of behavior that can effectively act as priors to constrain the difficult task of tracking an individual in a crowd. Our algorithm is data-driven and, unlike some crowd characterization methods, does not require us to have seen the test video beforehand. It performs like state-ofthe-art methods for tracking people having common crowd behaviors and outperforms the methods when the tracked individual behaves in an unusual way
A Tree-Based Context Model for Object Recognition
There has been a growing interest in exploiting contextual information in addition to local features to detect and localize multiple object categories in an image. A context model can rule out some unlikely combinations or locations of objects and guide detectors to produce a semantically coherent interpretation of a scene. However, the performance benefit of context models has been limited because most of the previous methods were tested on datasets with only a few object categories, in which most images contain one or two object categories. In this paper, we introduce a new dataset with images that contain many instances of different object categories, and propose an efficient model that captures the contextual information among more than a hundred object categories using a tree structure. Our model incorporates global image features, dependencies between object categories, and outputs of local detectors into one probabilistic framework. We demonstrate that our context model improves object recognition performance and provides a coherent interpretation of a scene, which enables a reliable image querying system by multiple object categories. In addition, our model can be applied to scene understanding tasks that local detectors alone cannot solve, such as detecting objects out of context or querying for the most typical and the least typicalscenes in a dataset.This research was partially funded by Shell International Exploration and Production Inc., by Army Research Office under award W911NF-06-1-0076, by NSF Career Award (ISI 0747120), and by the Air Force Office of Scientific Research under Award No.FA9550-06-1-0324. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the Air Force
SIFT Flow: Dense Correspondence across Scenes and its Applications
While image alignment has been studied in different areas of computer vision for decades, aligning images depicting different scenes remains a challenging problem. Analogous to optical flow where an image is aligned to its temporally adjacent frame, we propose SIFT flow, a method to align an image to its nearest neighbors in a large image corpus containing a variety of scenes. The SIFT flow algorithm consists of matching densely sampled, pixel-wise SIFT features between two images, while preserving spatial discontinuities. The SIFT features allow robust matching across different scene/object appearances, whereas the discontinuity-preserving spatial model allows matching of objects located at different parts of the scene. Experiments show that the proposed approach robustly aligns complex scene pairs containing significant spatial differences. Based on SIFT flow, we propose an alignment-based large database framework for image analysis and synthesis, where image information is transferred from the nearest neighbors to a query image according to the dense scene correspondence. This framework is demonstrated through concrete applications, such as motion field prediction from a single image, motion synthesis via object transfer, satellite image registration and face recognition
AdaBoost 방법을 통해 학습된 SVM 분류기를 이용한 영상 분류
학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 유석인.This thesis presents the algorithm that categorizes images by objects contained in the images. The images are encoded with bag-of-features (BoF) model which represents an image as a collection of unordered features extracted from the local patches. To deal with the classification of multiple object categories, the one-versus-all method is applied for the implementation of multi-class classifier. The object classifiers are built as the number of object categories, and each classifier decides whether an image is included in the object category or not. The object classifier has been developed on the AdaBoost method. The object classifier is given by the weighted sum of 200 support vector machine (SVM) component classifiers. Among multiple object classifiers, the classifier with the highest output function value finally determines the category of the object image. The classification efficiency of the presented algorithm has been illustrated on the images from Caltech-101 dataset.Abstract i
Contents iii
List of Figures v
List of Tables vi
Chapter 1 Introduction 1
Chapter 2 Related Work 3
2.1 Image classification approaches . . . . . . . . . . . 3
2.2 Boosting methods . . . . . . . . . . . . . . . 6
2.3 Background . . . . . . . . . . . . . . . . . 9
2.3.1 Support vector machine . . . . . . . . . . . . . 9
Chapter 3 Proposed Algorithm 12
3.1 SIFT feature extraction . . . . . . . . . . . . . 13
3.2 Codebook construction . . . . . . . . . . . . . 15
3.3 Bag-of-features representation . . . . . . . . . . . 16
3.4 Classifier design . . . . . . . . . . . . . . . 16
Chapter 4 Experiments 20
4.1 Dataset . . . . . . . . . . . . . . . . . . 20
4.2 Bag-of-features representation . . . . . . . . . . . 22
4.3 Classifiers . . . . . . . . . . . . . . . . . 24
4.4 Classification results . . . . . . . . . . . . . . 25
Chapter 5 Conclusion 29
Bibliography 30
Abstract in Korean 34Maste