814 research outputs found
Clothing Co-Parsing by Joint Image Segmentation and Labeling
This paper aims at developing an integrated system of clothing co-parsing, in
order to jointly parse a set of clothing images (unsegmented but annotated with
tags) into semantic configurations. We propose a data-driven framework
consisting of two phases of inference. The first phase, referred as "image
co-segmentation", iterates to extract consistent regions on images and jointly
refines the regions over all images by employing the exemplar-SVM (E-SVM)
technique [23]. In the second phase (i.e. "region co-labeling"), we construct a
multi-image graphical model by taking the segmented regions as vertices, and
incorporate several contexts of clothing configuration (e.g., item location and
mutual interactions). The joint label assignment can be solved using the
efficient Graph Cuts algorithm. In addition to evaluate our framework on the
Fashionista dataset [30], we construct a dataset called CCP consisting of 2098
high-resolution street fashion photos to demonstrate the performance of our
system. We achieve 90.29% / 88.23% segmentation accuracy and 65.52% / 63.89%
recognition rate on the Fashionista and the CCP datasets, respectively, which
are superior compared with state-of-the-art methods.Comment: 8 pages, 5 figures, CVPR 201
Semantic Object Parsing with Local-Global Long Short-Term Memory
Semantic object parsing is a fundamental task for understanding objects in
detail in computer vision community, where incorporating multi-level contextual
information is critical for achieving such fine-grained pixel-level
recognition. Prior methods often leverage the contextual information through
post-processing predicted confidence maps. In this work, we propose a novel
deep Local-Global Long Short-Term Memory (LG-LSTM) architecture to seamlessly
incorporate short-distance and long-distance spatial dependencies into the
feature learning over all pixel positions. In each LG-LSTM layer, local
guidance from neighboring positions and global guidance from the whole image
are imposed on each position to better exploit complex local and global
contextual information. Individual LSTMs for distinct spatial dimensions are
also utilized to intrinsically capture various spatial layouts of semantic
parts in the images, yielding distinct hidden and memory cells of each position
for each dimension. In our parsing approach, several LG-LSTM layers are stacked
and appended to the intermediate convolutional layers to directly enhance
visual features, allowing network parameters to be learned in an end-to-end
way. The long chains of sequential computation by stacked LG-LSTM layers also
enable each pixel to sense a much larger region for inference benefiting from
the memorization of previous dependencies in all positions along all
dimensions. Comprehensive evaluations on three public datasets well demonstrate
the significant superiority of our LG-LSTM over other state-of-the-art methods.Comment: 10 page
Matching-CNN Meets KNN: Quasi-Parametric Human Parsing
Both parametric and non-parametric approaches have demonstrated encouraging
performances in the human parsing task, namely segmenting a human image into
several semantic regions (e.g., hat, bag, left arm, face). In this work, we aim
to develop a new solution with the advantages of both methodologies, namely
supervision from annotated data and the flexibility to use newly annotated
(possibly uncommon) images, and present a quasi-parametric human parsing model.
Under the classic K Nearest Neighbor (KNN)-based nonparametric framework, the
parametric Matching Convolutional Neural Network (M-CNN) is proposed to predict
the matching confidence and displacements of the best matched region in the
testing image for a particular semantic region in one KNN image. Given a
testing image, we first retrieve its KNN images from the
annotated/manually-parsed human image corpus. Then each semantic region in each
KNN image is matched with confidence to the testing image using M-CNN, and the
matched regions from all KNN images are further fused, followed by a superpixel
smoothing procedure to obtain the ultimate human parsing result. The M-CNN
differs from the classic CNN in that the tailored cross image matching filters
are introduced to characterize the matching between the testing image and the
semantic region of a KNN image. The cross image matching filters are defined at
different convolutional layers, each aiming to capture a particular range of
displacements. Comprehensive evaluations over a large dataset with 7,700
annotated human images well demonstrate the significant performance gain from
the quasi-parametric model over the state-of-the-arts, for the human parsing
task.Comment: This manuscript is the accepted version for CVPR 201
Semantic Object Parsing with Graph LSTM
By taking the semantic object parsing task as an exemplar application
scenario, we propose the Graph Long Short-Term Memory (Graph LSTM) network,
which is the generalization of LSTM from sequential data or multi-dimensional
data to general graph-structured data. Particularly, instead of evenly and
fixedly dividing an image to pixels or patches in existing multi-dimensional
LSTM structures (e.g., Row, Grid and Diagonal LSTMs), we take each
arbitrary-shaped superpixel as a semantically consistent node, and adaptively
construct an undirected graph for each image, where the spatial relations of
the superpixels are naturally used as edges. Constructed on such an adaptive
graph topology, the Graph LSTM is more naturally aligned with the visual
patterns in the image (e.g., object boundaries or appearance similarities) and
provides a more economical information propagation route. Furthermore, for each
optimization step over Graph LSTM, we propose to use a confidence-driven scheme
to update the hidden and memory states of nodes progressively till all nodes
are updated. In addition, for each node, the forgets gates are adaptively
learned to capture different degrees of semantic correlation with neighboring
nodes. Comprehensive evaluations on four diverse semantic object parsing
datasets well demonstrate the significant superiority of our Graph LSTM over
other state-of-the-art solutions.Comment: 18 page
- …