25 research outputs found
Deep Multimodal Image-Text Embeddings for Automatic Cross-Media Retrieval
This paper considers the task of matching images and sentences by learning a
visual-textual embedding space for cross-modal retrieval. Finding such a space
is a challenging task since the features and representations of text and image
are not comparable. In this work, we introduce an end-to-end deep multimodal
convolutional-recurrent network for learning both vision and language
representations simultaneously to infer image-text similarity. The model learns
which pairs are a match (positive) and which ones are a mismatch (negative)
using a hinge-based triplet ranking. To learn about the joint representations,
we leverage our newly extracted collection of tweets from Twitter. The main
characteristic of our dataset is that the images and tweets are not
standardized the same as the benchmarks. Furthermore, there can be a higher
semantic correlation between the pictures and tweets contrary to benchmarks in
which the descriptions are well-organized. Experimental results on MS-COCO
benchmark dataset show that our model outperforms certain methods presented
previously and has competitive performance compared to the state-of-the-art.
The code and dataset have been made available publicly.Comment: 6 pages and 2 figures, Learn more about this project at
https://iasbs.ac.ir/~ansari/deeptwitte
Linking Image and Text with 2-Way Nets
Linking two data sources is a basic building block in numerous computer
vision problems. Canonical Correlation Analysis (CCA) achieves this by
utilizing a linear optimizer in order to maximize the correlation between the
two views. Recent work makes use of non-linear models, including deep learning
techniques, that optimize the CCA loss in some feature space. In this paper, we
introduce a novel, bi-directional neural network architecture for the task of
matching vectors from two data sources. Our approach employs two tied neural
network channels that project the two views into a common, maximally correlated
space using the Euclidean loss. We show a direct link between the
correlation-based loss and Euclidean loss, enabling the use of Euclidean loss
for correlation maximization. To overcome common Euclidean regression
optimization problems, we modify well-known techniques to our problem,
including batch normalization and dropout. We show state of the art results on
a number of computer vision matching tasks including MNIST image matching and
sentence-image matching on the Flickr8k, Flickr30k and COCO datasets.Comment: 14 pages, 2 figures, 6 table
SoccerNet: A Scalable Dataset for Action Spotting in Soccer Videos
In this paper, we introduce SoccerNet, a benchmark for action spotting in
soccer videos. The dataset is composed of 500 complete soccer games from six
main European leagues, covering three seasons from 2014 to 2017 and a total
duration of 764 hours. A total of 6,637 temporal annotations are automatically
parsed from online match reports at a one minute resolution for three main
classes of events (Goal, Yellow/Red Card, and Substitution). As such, the
dataset is easily scalable. These annotations are manually refined to a one
second resolution by anchoring them at a single timestamp following
well-defined soccer rules. With an average of one event every 6.9 minutes, this
dataset focuses on the problem of localizing very sparse events within long
videos. We define the task of spotting as finding the anchors of soccer events
in a video. Making use of recent developments in the realm of generic action
recognition and detection in video, we provide strong baselines for detecting
soccer events. We show that our best model for classifying temporal segments of
length one minute reaches a mean Average Precision (mAP) of 67.8%. For the
spotting task, our baseline reaches an Average-mAP of 49.7% for tolerances
ranging from 5 to 60 seconds. Our dataset and models are available at
https://silviogiancola.github.io/SoccerNet.Comment: CVPR Workshop on Computer Vision in Sports 201