5,447 research outputs found
From Pixels to Sentiment: Fine-tuning CNNs for Visual Sentiment Prediction
Visual multimedia have become an inseparable part of our digital social
lives, and they often capture moments tied with deep affections. Automated
visual sentiment analysis tools can provide a means of extracting the rich
feelings and latent dispositions embedded in these media. In this work, we
explore how Convolutional Neural Networks (CNNs), a now de facto computational
machine learning tool particularly in the area of Computer Vision, can be
specifically applied to the task of visual sentiment prediction. We accomplish
this through fine-tuning experiments using a state-of-the-art CNN and via
rigorous architecture analysis, we present several modifications that lead to
accuracy improvements over prior art on a dataset of images from a popular
social media platform. We additionally present visualizations of local patterns
that the network learned to associate with image sentiment for insight into how
visual positivity (or negativity) is perceived by the model.Comment: Accepted for publication in Image and Vision Computing. Models and
source code available at https://github.com/imatge-upc/sentiment-201
Semantic levels of domain-independent commonsense knowledgebase for visual indexing and retrieval applications
Building intelligent tools for searching, indexing and retrieval applications is needed to congregate the rapidly increasing amount of visual data. This raised the need for building and maintaining ontologies and knowledgebases to support textual semantic representation of visual contents, which is an important block in these applications. This paper proposes a commonsense knowledgebase that forms the link between the visual world and its semantic textual representation. This domain-independent knowledge is provided at different levels of semantics by a fully automated engine that analyses, fuses and integrates previous commonsense knowledgebases. This knowledgebase satisfies the levels of semantic by adding two new levels: temporal event scenarios and psycholinguistic understanding. Statistical properties and an experiment evaluation, show coherency and effectiveness of the proposed knowledgebase in providing the knowledge needed for wide-domain visual applications
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
In this paper, we introduce SoccerNet, a benchmark for action spotting in
soccer videos. The dataset is composed of 500 complete soccer games from six
main European leagues, covering three seasons from 2014 to 2017 and a total
duration of 764 hours. A total of 6,637 temporal annotations are automatically
parsed from online match reports at a one minute resolution for three main
classes of events (Goal, Yellow/Red Card, and Substitution). As such, the
dataset is easily scalable. These annotations are manually refined to a one
second resolution by anchoring them at a single timestamp following
well-defined soccer rules. With an average of one event every 6.9 minutes, this
dataset focuses on the problem of localizing very sparse events within long
videos. We define the task of spotting as finding the anchors of soccer events
in a video. Making use of recent developments in the realm of generic action
recognition and detection in video, we provide strong baselines for detecting
soccer events. We show that our best model for classifying temporal segments of
length one minute reaches a mean Average Precision (mAP) of 67.8%. For the
spotting task, our baseline reaches an Average-mAP of 49.7% for tolerances
ranging from 5 to 60 seconds. Our dataset and models are available at
https://silviogiancola.github.io/SoccerNet.Comment: CVPR Workshop on Computer Vision in Sports 201
High-level feature detection from video in TRECVid: a 5-year retrospective of achievements
Successful and effective content-based access to digital
video requires fast, accurate and scalable methods to determine the video content automatically. A variety of contemporary approaches to this rely on text taken from speech within the video, or on matching one video frame against others using low-level characteristics like
colour, texture, or shapes, or on determining and matching objects appearing within the video. Possibly the most important technique, however, is one which determines the presence or absence of a high-level or semantic feature, within a video clip or shot. By utilizing dozens, hundreds or even thousands of such semantic features we can support many kinds of content-based video navigation. Critically however, this depends on being able to determine whether each feature is or is not present in a video clip.
The last 5 years have seen much progress in the development of techniques to determine the presence of semantic features within video. This progress can be tracked in the annual TRECVid benchmarking activity where dozens of research groups measure the effectiveness of their techniques on common data and using an open, metrics-based approach. In this chapter we summarise the work
done on the TRECVid high-level feature task, showing the
progress made year-on-year. This provides a fairly comprehensive statement on where the state-of-the-art is regarding this important task, not just for one research group or for one approach, but across the spectrum. We then use this past and on-going work as a basis for highlighting the trends that are emerging in this area, and the questions which remain to be addressed before we can
achieve large-scale, fast and reliable high-level feature detection on video
Learning to detect chest radiographs containing lung nodules using visual attention networks
Machine learning approaches hold great potential for the automated detection
of lung nodules in chest radiographs, but training the algorithms requires vary
large amounts of manually annotated images, which are difficult to obtain. Weak
labels indicating whether a radiograph is likely to contain pulmonary nodules
are typically easier to obtain at scale by parsing historical free-text
radiological reports associated to the radiographs. Using a repositotory of
over 700,000 chest radiographs, in this study we demonstrate that promising
nodule detection performance can be achieved using weak labels through
convolutional neural networks for radiograph classification. We propose two
network architectures for the classification of images likely to contain
pulmonary nodules using both weak labels and manually-delineated bounding
boxes, when these are available. Annotated nodules are used at training time to
deliver a visual attention mechanism informing the model about its localisation
performance. The first architecture extracts saliency maps from high-level
convolutional layers and compares the estimated position of a nodule against
the ground truth, when this is available. A corresponding localisation error is
then back-propagated along with the softmax classification error. The second
approach consists of a recurrent attention model that learns to observe a short
sequence of smaller image portions through reinforcement learning. When a
nodule annotation is available at training time, the reward function is
modified accordingly so that exploring portions of the radiographs away from a
nodule incurs a larger penalty. Our empirical results demonstrate the potential
advantages of these architectures in comparison to competing methodologies
AudioPairBank: Towards A Large-Scale Tag-Pair-Based Audio Content Analysis
Recently, sound recognition has been used to identify sounds, such as car and
river. However, sounds have nuances that may be better described by
adjective-noun pairs such as slow car, and verb-noun pairs such as flying
insects, which are under explored. Therefore, in this work we investigate the
relation between audio content and both adjective-noun pairs and verb-noun
pairs. Due to the lack of datasets with these kinds of annotations, we
collected and processed the AudioPairBank corpus consisting of a combined total
of 1,123 pairs and over 33,000 audio files. One contribution is the previously
unavailable documentation of the challenges and implications of collecting
audio recordings with these type of labels. A second contribution is to show
the degree of correlation between the audio content and the labels through
sound recognition experiments, which yielded results of 70% accuracy, hence
also providing a performance benchmark. The results and study in this paper
encourage further exploration of the nuances in audio and are meant to
complement similar research performed on images and text in multimedia
analysis.Comment: This paper is a revised version of "AudioSentibank: Large-scale
Semantic Ontology of Acoustic Concepts for Audio Content Analysis
- …