3,810 research outputs found
A Dataset for Movie Description
Descriptive video service (DVS) provides linguistic descriptions of movies
and allows visually impaired people to follow a movie along with their peers.
Such descriptions are by design mainly visual and thus naturally form an
interesting data source for computer vision and computational linguistics. In
this work we propose a novel dataset which contains transcribed DVS, which is
temporally aligned to full length HD movies. In addition we also collected the
aligned movie scripts which have been used in prior work and compare the two
different sources of descriptions. In total the Movie Description dataset
contains a parallel corpus of over 54,000 sentences and video snippets from 72
HD movies. We characterize the dataset by benchmarking different approaches for
generating video descriptions. Comparing DVS to scripts, we find that DVS is
far more visual and describes precisely what is shown rather than what should
happen according to the scripts created prior to movie production
Adaptive Nonparametric Image Parsing
In this paper, we present an adaptive nonparametric solution to the image
parsing task, namely annotating each image pixel with its corresponding
category label. For a given test image, first, a locality-aware retrieval set
is extracted from the training data based on super-pixel matching similarities,
which are augmented with feature extraction for better differentiation of local
super-pixels. Then, the category of each super-pixel is initialized by the
majority vote of the -nearest-neighbor super-pixels in the retrieval set.
Instead of fixing as in traditional non-parametric approaches, here we
propose a novel adaptive nonparametric approach which determines the
sample-specific k for each test image. In particular, is adaptively set to
be the number of the fewest nearest super-pixels which the images in the
retrieval set can use to get the best category prediction. Finally, the initial
super-pixel labels are further refined by contextual smoothing. Extensive
experiments on challenging datasets demonstrate the superiority of the new
solution over other state-of-the-art nonparametric solutions.Comment: 11 page
Impact of Ground Truth Annotation Quality on Performance of Semantic Image Segmentation of Traffic Conditions
Preparation of high-quality datasets for the urban scene understanding is a
labor-intensive task, especially, for datasets designed for the autonomous
driving applications. The application of the coarse ground truth (GT)
annotations of these datasets without detriment to the accuracy of semantic
image segmentation (by the mean intersection over union - mIoU) could simplify
and speedup the dataset preparation and model fine tuning before its practical
application. Here the results of the comparative analysis for semantic
segmentation accuracy obtained by PSPNet deep learning architecture are
presented for fine and coarse annotated images from Cityscapes dataset. Two
scenarios were investigated: scenario 1 - the fine GT images for training and
prediction, and scenario 2 - the fine GT images for training and the coarse GT
images for prediction. The obtained results demonstrated that for the most
important classes the mean accuracy values of semantic image segmentation for
coarse GT annotations are higher than for the fine GT ones, and the standard
deviation values are vice versa. It means that for some applications some
unimportant classes can be excluded and the model can be tuned further for some
classes and specific regions on the coarse GT dataset without loss of the
accuracy even. Moreover, this opens the perspectives to use deep neural
networks for the preparation of such coarse GT datasets.Comment: 10 pages, 6 figures, 2 tables, The Second International Conference on
Computer Science, Engineering and Education Applications (ICCSEEA2019) 26-27
January 2019, Kiev, Ukrain
Movie Description
Audio Description (AD) provides linguistic descriptions of movies and allows
visually impaired people to follow a movie along with their peers. Such
descriptions are by design mainly visual and thus naturally form an interesting
data source for computer vision and computational linguistics. In this work we
propose a novel dataset which contains transcribed ADs, which are temporally
aligned to full length movies. In addition we also collected and aligned movie
scripts used in prior work and compare the two sources of descriptions. In
total the Large Scale Movie Description Challenge (LSMDC) contains a parallel
corpus of 118,114 sentences and video clips from 202 movies. First we
characterize the dataset by benchmarking different approaches for generating
video descriptions. Comparing ADs to scripts, we find that ADs are indeed more
visual and describe precisely what is shown rather than what should happen
according to the scripts created prior to movie production. Furthermore, we
present and compare the results of several teams who participated in a
challenge organized in the context of the workshop "Describing and
Understanding Video & The Large Scale Movie Description Challenge (LSMDC)", at
ICCV 2015
- …