Scene classification has become an increasingly popular topic in computer vision.
The techniques for scene classification can be widely used in many other aspects,
such as detection, action recognition, and content-based image retrieval. Recently,
the stationary property of images has been leveraged in conjunction with convolutional
networks to perform classification tasks. In the existing approach, one
random patch is extracted from each training image to learn filters for convolutional
processes. However, feature learning only from one random patch per image
is not robust because patches selected from di↵erent areas of an image may contain
distinct scene objects which make the features of these patches have di↵erent
descriptive power. In this dissertation, focusing on deep learning techniques, we
propose a multi-scale network that utilizes multiple random patches and di↵erent
patch dimensions to learn feature representations for images in order to improve
the existing approach.
Despite the much better performance the multi-scale network can achieve than
the existing approach, lacking of local features and the spatial layout is one of
the core limitations of both methods. Therefore, we propose a novel Spatial Deep
Network (SDN) to further enhance the existing approach by exploiting the spatial
layout of the image and constraining the random patch extraction to be performed
in di↵erent areas of the image so as to e↵ectively restrict the patches to hold the
necessary characteristics of di↵erent image areas. In this way, SDN yields compact
but discriminative features that incorporate both global descriptors and the local
spatial information for images. Experiment results show that SDN considerably
exceeds the existing approach and multi-scale networks and achieves competitive
performance with some widely used classification techniques on the OT dataset
(developed by Oliva and Torralba). In order to evaluate the robustness of the
proposed SDN, we also apply it to the content-based image retrieval on the Holidays
dataset, where our features attain much better retrieval performance but have much
lower feature dimensions compared to other state-of-the-art feature descriptors