Search CORE

2 research outputs found

User Preference Prediction in Visual Data on Mobile Devices

Author: Demochkin K. V.
Grechikhin I. S.
Savchenko A. V.
Publication venue
Publication date: 10/07/2019
Field of study

In this paper we consider the user modeling given the photos and videos from the gallery on a mobile device. We propose the novel user preference prediction engine based on scene understanding, object detection and face recognition. At first, all faces in a gallery are clustered and all private photos and videos with faces from large clusters are processed on the embedded system in offline mode. Other photos are sent to the remote server to be analyzed by very deep models. The visual features of each photo are aggregated into a single user descriptor using the neural attention block. The proposed pipeline is implemented for the Android mobile platform. Experimental results with a subset of Amazon Home and Kitchen, Places2 and Open Images datasets demonstrate the possibility to process images very efficiently without accuracy degradation.Comment: 5 pages; 2 figure

arXiv.org e-Print Archive

Event Recognition with Automatic Album Detection based on Sequential Processing, Neural Attention and Image Captioning

Author: Savchenko Andrey V.
Publication venue
Publication date: 15/01/2020
Field of study

In this paper a new formulation of event recognition task is examined: it is required to predict event categories in a gallery of images, for which albums (groups of photos corresponding to a single event) are unknown. We propose the novel two-stage approach. At first, features are extracted in each photo using the pre-trained convolutional neural network. These features are classified individually. The scores of the classifier are used to group sequential photos into several clusters. Finally, the features of photos in each group are aggregated into a single descriptor using neural attention mechanism. This algorithm is optionally extended to improve the accuracy for classification of each image in an album. In contrast to conventional fine-tuning of convolutional neural networks (CNN) we proposed to use image captioning, i.e., generative model that converts images to textual descriptions. They are one-hot encoded and summarized into sparse feature vector suitable for learning of arbitrary classifier. Experimental study with Photo Event Collection and Multi-Label Curation of Flickr Events Dataset demonstrates that our approach is 9-20% more accurate than event recognition on single photos. Moreover, proposed method has 13-16% lower error rate than classification of groups of photos obtained with hierarchical clustering. It is experimentally shown that the image captions trained on Conceptual Captions dataset can be classified more accurately than the features from object detector, though they both are obviously not as rich as the CNN-based features. However, it is possible to combine our approach with conventional CNNs in an ensemble to provide the state-of-the-art results for several event datasets.Comment: 11 pages, 5 figure

arXiv.org e-Print Archive