49 research outputs found
Image Understanding by Socializing the Semantic Gap
Several technological developments like the Internet, mobile devices and Social Networks have spurred the sharing of images in unprecedented volumes, making tagging and commenting a common habit. Despite the recent progress in image analysis, the problem of Semantic Gap still hinders machines in fully understand the rich semantic of a shared photo. In this book, we tackle this problem by exploiting social network contributions. A comprehensive treatise of three linked problems on image annotation is presented, with a novel experimental protocol used to test eleven state-of-the-art methods. Three novel approaches to annotate, under stand the sentiment and predict the popularity of an image are presented. We conclude with the many challenges and opportunities ahead for the multimedia community
Localization of JPEG double compression through multi-domain convolutional neural networks
When an attacker wants to falsify an image, in most of cases she/he will
perform a JPEG recompression. Different techniques have been developed based on
diverse theoretical assumptions but very effective solutions have not been
developed yet. Recently, machine learning based approaches have been started to
appear in the field of image forensics to solve diverse tasks such as
acquisition source identification and forgery detection. In this last case, the
aim ahead would be to get a trained neural network able, given a to-be-checked
image, to reliably localize the forged areas. With this in mind, our paper
proposes a step forward in this direction by analyzing how a single or double
JPEG compression can be revealed and localized using convolutional neural
networks (CNNs). Different kinds of input to the CNN have been taken into
consideration, and various experiments have been carried out trying also to
evidence potential issues to be further investigated.Comment: Accepted to CVPRW 2017, Workshop on Media Forensic
Tracing images back to their social network of origin: A CNN-based approach
Recovering information about the history of a digital content, such as an image or a video, can be strategic to address an investigation from the early stages. Storage devices, smart-phones and PCs, belonging to a suspect, are usually confiscated as soon as a warrant is issued. Any multimedia content found is analyzed in depth, in order to trace back its provenance and, if possible, its original source. This is particularly important when dealing with social networks, where most of the user-generated photos and videos are uploaded and shared daily. Being able to discern if images are downloaded from a social network or directly captured by a digital camera, can be crucial in leading consecutive investigations. In this paper, we propose a novel method based on convolutional neural networks (CNN) to determine the image provenance, whether it originates from a social network, a messaging application or directly from a photo-camera. By considering only the visual content, the method works irrespective of an eventual manipulation of metadata performed by an attacker. We have tested the proposed technique on three publicly available datasets of images downloaded from seven popular social networks, obtaining state-of-the-art results
Am I Done? Predicting Action Progress in Videos
In this paper we deal with the problem of predicting action progress in
videos. We argue that this is an extremely important task since it can be
valuable for a wide range of interaction applications. To this end we introduce
a novel approach, named ProgressNet, capable of predicting when an action takes
place in a video, where it is located within the frames, and how far it has
progressed during its execution. To provide a general definition of action
progress, we ground our work in the linguistics literature, borrowing terms and
concepts to understand which actions can be the subject of progress estimation.
As a result, we define a categorization of actions and their phases. Motivated
by the recent success obtained from the interaction of Convolutional and
Recurrent Neural Networks, our model is based on a combination of the Faster
R-CNN framework, to make frame-wise predictions, and LSTM networks, to estimate
action progress through time. After introducing two evaluation protocols for
the task at hand, we demonstrate the capability of our model to effectively
predict action progress on the UCF-101 and J-HMDB datasets
Effective Triplet Mining Improves Training of Multi-scale Pooled CNN for Image Retrieval
In this paper, we address the problem of content-based image retrieval (CBIR) by learning images representations based on the activations of a Convolutional Neural Network. We propose an end-to-end trainable network architecture that exploits a novel multi-scale local pooling based on the trainable aggregation layer NetVLAD (Arandjelovic et al in Proceedings of the IEEE conference on computer vision and pattern recognition CVPR, NetVLAD, 2016) and bags of local features obtained by splitting the activations, allowing to reduce the dimensionality of the descriptor and to increase the performance of retrieval. Training is performed using an improved triplet mining procedure that selects samples based on their difficulty to obtain an effective image representation, reducing the risk of overfitting and loss of generalization. Extensive experiments show that our approach, that can be effectively used with different CNN architectures, obtains state-of-the-art results on standard and challenging CBIR datasets
Deep Sentiment Features of Context and Faces for Affective Video Analysis
Given the huge quantity of hours of video available on video sharing platforms such as YouTube, Vimeo, etc. development of automatic tools that help users nd videos that t their interests has attracted the attention of both scienti c and industrial communities. So far the majority of the works have addressed semantic analysis, to identify objects, scenes and events depicted in videos, but more recently a ective analysis of videos has started to gain more at- tention. In this work we investigate the use of sentiment driven features to classify the induced sentiment of a video, i.e. the senti- ment reaction of the user. Instead of using standard computer vision features such as CNN features or SIFT features trained to recognize objects and scenes, we exploit sentiment related features such as the ones provided by Deep-SentiBank [4], and features extracted from models that exploit deep networks trained on face expressions. We experiment on two recently introduced datasets: LIRIS-ACCEDE [2] and MEDIAEVAL-2015, that provide sentiment annotations of a large set of short videos. We show that our approach not only outperforms the current state-of-the-art in terms of valence and arousal classi cation accuracy, but it also uses a smaller number of features, requiring thus less video processing
Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features
Given a query composed of a reference image and a relative caption, the
Composed Image Retrieval goal is to retrieve images visually similar to the
reference one that integrates the modifications expressed by the caption. Given
that recent research has demonstrated the efficacy of large-scale vision and
language pre-trained (VLP) models in various tasks, we rely on features from
the OpenAI CLIP model to tackle the considered task. We initially perform a
task-oriented fine-tuning of both CLIP encoders using the element-wise sum of
visual and textual features. Then, in the second stage, we train a Combiner
network that learns to combine the image-text features integrating the bimodal
information and providing combined features used to perform the retrieval. We
use contrastive learning in both stages of training. Starting from the bare
CLIP features as a baseline, experimental results show that the task-oriented
fine-tuning and the carefully crafted Combiner network are highly effective and
outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two
popular and challenging datasets for composed image retrieval. Code and
pre-trained models are available at https://github.com/ABaldrati/CLIP4CirComment: Accepted in ACM Transactions on Multimedia Computing Communications
and Applications (TOMM
Outdoor object recognition for smart audio guides
We present a smart audio guide that adapts itself to the environment the user is navigating into. The system builds automatically a point of interest database exploiting Wikipedia and Google APIs as source. We rely on a computer vision system, to overcome the likely sensor limitations, and determine with high accuracy if the user is facing a certain landmark or if he is not facing any. Thanks to this the guide presents audio description at the most appropriate moment without any user intervention, using text-to-speech augmenting the experience
A multimodal feature learning approach for sentiment analysis of social network multimedia
Investigation of the use of a multimodal feature learning approach, using neural network based models such as Skip-gram and Denoising Autoencoders, to address sentiment analysis of micro-blogging content, such as Twitter short messages, that are composed by a short text and, possibly, an image
Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval
Where previous reviews on content-based image retrieval emphasize on what can
be seen in an image to bridge the semantic gap, this survey considers what
people tag about an image. A comprehensive treatise of three closely linked
problems, i.e., image tag assignment, refinement, and tag-based image retrieval
is presented. While existing works vary in terms of their targeted tasks and
methodology, they rely on the key functionality of tag relevance, i.e.
estimating the relevance of a specific tag with respect to the visual content
of a given image and its social context. By analyzing what information a
specific method exploits to construct its tag relevance function and how such
information is exploited, this paper introduces a taxonomy to structure the
growing literature, understand the ingredients of the main works, clarify their
connections and difference, and recognize their merits and limitations. For a
head-to-head comparison between the state-of-the-art, a new experimental
protocol is presented, with training sets containing 10k, 100k and 1m images
and an evaluation on three test sets, contributed by various research groups.
Eleven representative works are implemented and evaluated. Putting all this
together, the survey aims to provide an overview of the past and foster
progress for the near future.Comment: to appear in ACM Computing Survey