As the distinction between online and physical spaces rapidly degrades, social media have now become an integral component of how many people's everyday experiences are mediated. As such, increasing interest has emerged in exploring how the content shared through those online platforms comes to contribute to the collaborative creation of places in physical space at the urban scale. Exploring digital geographies of social media data using methods such as qualitative coding (i.e., content labelling) is a flexible but complex task, commonly limited to small samples due to its impracticality over large datasets. In this paper, we propose a new tool for studies in digital geographies, bridging qualitative and quantitative approaches, able to learn a set of arbitrary labels (qualitative codes) on a small, manually-created sample and apply the same labels on a larger set. We introduce a semi-supervised, deep neural network approach to classify geo-located social media posts based on their textual and image content, as well as geographical and temporal aspects. Our innovative approach is rooted in our understanding of social media posts as augmentations of the time-space configurations that places are, and it comprises a stacked multi-modal autoencoder neural network to create joint representations of text and images, and a spatio-temporal graph convolution neural network for semi-supervised classification. The results presented in this paper show that our approach performs the classification of social media content with higher accuracy than traditional machine learning models as well as two state-of-art deep learning frameworks