2 research outputs found
Graph U-Nets
We consider the problem of representation learning for graph data.
Convolutional neural networks can naturally operate on images, but have
significant challenges in dealing with graph data. Given images are special
cases of graphs with nodes lie on 2D lattices, graph embedding tasks have a
natural correspondence with image pixel-wise prediction tasks such as
segmentation. While encoder-decoder architectures like U-Nets have been
successfully applied on many image pixel-wise prediction tasks, similar methods
are lacking for graph data. This is due to the fact that pooling and
up-sampling operations are not natural on graph data. To address these
challenges, we propose novel graph pooling (gPool) and unpooling (gUnpool)
operations in this work. The gPool layer adaptively selects some nodes to form
a smaller graph based on their scalar projection values on a trainable
projection vector. We further propose the gUnpool layer as the inverse
operation of the gPool layer. The gUnpool layer restores the graph into its
original structure using the position information of nodes selected in the
corresponding gPool layer. Based on our proposed gPool and gUnpool layers, we
develop an encoder-decoder model on graph, known as the graph U-Nets. Our
experimental results on node classification and graph classification tasks
demonstrate that our methods achieve consistently better performance than
previous models.Comment: 10 pages, ICML1
Wide-Area Crowd Counting: Multi-View Fusion Networks for Counting in Large Scenes
Crowd counting in single-view images has achieved outstanding performance on
existing counting datasets. However, single-view counting is not applicable to
large and wide scenes (e.g., public parks, long subway platforms, or event
spaces) because a single camera cannot capture the whole scene in adequate
detail for counting, e.g., when the scene is too large to fit into the
field-of-view of the camera, too long so that the resolution is too low on
faraway crowds, or when there are too many large objects that occlude large
portions of the crowd. Therefore, to solve the wide-area counting task requires
multiple cameras with overlapping fields-of-view. In this paper, we propose a
deep neural network framework for multi-view crowd counting, which fuses
information from multiple camera views to predict a scene-level density map on
the ground-plane of the 3D world. We consider three versions of the fusion
framework: the late fusion model fuses camera-view density map; the naive early
fusion model fuses camera-view feature maps; and the multi-view multi-scale
early fusion model ensures that features aligned to the same ground-plane point
have consistent scales. A rotation selection module further ensures consistent
rotation alignment of the features. We test our 3 fusion models on 3 multi-view
counting datasets, PETS2009, DukeMTMC, and a newly collected multi-view
counting dataset containing a crowded street intersection. Our methods achieve
state-of-the-art results compared to other multi-view counting baselines.Comment: 29 pages, 13 figures, submitted to IJC