461,896 research outputs found
Object-Centric Unsupervised Image Captioning
Image captioning is a longstanding problem in the field of computer vision
and natural language processing. To date, researchers have produced impressive
state-of-the-art performance in the age of deep learning. Most of these
state-of-the-art, however, requires large volume of annotated image-caption
pairs in order to train their models. When given an image dataset of interests,
practitioner needs to annotate the caption for each image in the training set
and this process needs to happen for each newly collected image dataset. In
this paper, we explore the task of unsupervised image captioning which utilizes
unpaired images and texts to train the model so that the texts can come from
different sources than the images. A main school of research on this topic that
has been shown to be effective is to construct pairs from the images and texts
in the training set according to their overlap of objects. Unlike in the
supervised setting, these constructed pairings are however not guaranteed to
have fully overlapping set of objects. Our work in this paper overcomes this by
harvesting objects corresponding to a given sentence from the training set,
even if they don't belong to the same image. When used as input to a
transformer, such mixture of objects enables larger if not full object
coverage, and when supervised by the corresponding sentence, produced results
that outperform current state of the art unsupervised methods by a significant
margin. Building upon this finding, we further show that (1) additional
information on relationship between objects and attributes of objects also
helps in boosting performance; and (2) our method also extends well to
non-English image captioning, which usually suffers from a scarcer level of
annotations. Our findings are supported by strong empirical results. Our code
is available at https://github.com/zihangm/obj-centric-unsup-caption.Comment: ECCV 202
A New Geometric Approach to Latent Topic Modeling and Discovery
A new geometrically-motivated algorithm for nonnegative matrix factorization
is developed and applied to the discovery of latent "topics" for text and image
"document" corpora. The algorithm is based on robustly finding and clustering
extreme points of empirical cross-document word-frequencies that correspond to
novel "words" unique to each topic. In contrast to related approaches that are
based on solving non-convex optimization problems using suboptimal
approximations, locally-optimal methods, or heuristics, the new algorithm is
convex, has polynomial complexity, and has competitive qualitative and
quantitative performance compared to the current state-of-the-art approaches on
synthetic and real-world datasets.Comment: This paper was submitted to the IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP) 2013 on November 30, 201
Evaluating a workspace's usefulness for image retrieval
Image searching is a creative process. We have proposed a novel image retrieval system that supports creative search sessions by allowing the user to organise their search results on a workspace. The workspace’s usefulness is evaluated in a task-oriented and user-centred comparative experiment, involving design professionals and several types of realistic search tasks. In particular, we focus on its effect on task conceptualisation and query formulation. A traditional relevance feedback system serves as a baseline. The results of this study show that the workspace is more useful in terms of both of the above aspects and that the proposed approach leads to a more effective and enjoyable search experience. This paper also highlights the influence of tasks on the users’ search and organisation strategy
Automatic tagging and geotagging in video collections and communities
Automatically generated tags and geotags hold great promise
to improve access to video collections and online communi-
ties. We overview three tasks offered in the MediaEval 2010
benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features
- …