171,128 research outputs found
Visually-Aware Context Modeling for News Image Captioning
The goal of News Image Captioning is to generate an image caption according
to the content of both a news article and an image. To leverage the visual
information effectively, it is important to exploit the connection between the
context in the articles/captions and the images. Psychological studies indicate
that human faces in images draw higher attention priorities. On top of that,
humans often play a central role in news stories, as also proven by the
face-name co-occurrence pattern we discover in existing News Image Captioning
datasets. Therefore, we design a face-naming module for faces in images and
names in captions/articles to learn a better name embedding. Apart from names,
which can be directly linked to an image area (faces), news image captions
mostly contain context information that can only be found in the article.
Humans typically address this by searching for relevant information from the
article based on the image. To emulate this thought process, we design a
retrieval strategy using CLIP to retrieve sentences that are semantically close
to the image. We conduct extensive experiments to demonstrate the efficacy of
our framework. Without using additional paired data, we establish the new
state-of-the-art performance on two News Image Captioning datasets, exceeding
the previous state-of-the-art by 5 CIDEr points. We will release code upon
acceptance
Taking the bite out of automated naming of characters in TV video
We investigate the problem of automatically labelling appearances of characters in TV or film material
with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying
when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series āāBuffy the Vampire Slayerā
Captured by the camera's eye: Guantanamo and the shifting frame of the Global War on Terror
In January 2002, images of the detention of prisoners held at US Naval Station Guantanamo Bay as part of the Global War on Terrorism were released by the US Department of Defense, a public relations move that Secretary of Defense Donald Rumsfeld later referred to as āprobably unfortunateā. These images, widely reproduced in the media,
quickly came to symbolise the facility and the practices at work there. Nine years on, the images of orange-clad ādetaineesā ā the āorange seriesā ā remain a powerful symbol of US military practices and play a significant role in the resistance to the site. However, as the site has evolved, so too has its visual representation. Official images of these new facilities not only document this evolution but work to constitute, through a careful (re)framing (literal and figurative), a new (re)presentation of the site, and therefore the identities of those
involved. The new series of images not only (re)inscribes the identities of detainees as dangerous but, more importantly, work to constitute the US State as humane and modern. These images are part of a broader effort by the US administration to resituate its image, and remind us, as IR scholars, to look at the diverse set of practices (beyond simply spoken language) to understand the complexity of international politic
Level Playing Field for Million Scale Face Recognition
Face recognition has the perception of a solved problem, however when tested
at the million-scale exhibits dramatic variation in accuracies across the
different algorithms. Are the algorithms very different? Is access to good/big
training data their secret weapon? Where should face recognition improve? To
address those questions, we created a benchmark, MF2, that requires all
algorithms to be trained on same data, and tested at the million scale. MF2 is
a public large-scale set with 672K identities and 4.7M photos created with the
goal to level playing field for large scale face recognition. We contrast our
results with findings from the other two large-scale benchmarks MegaFace
Challenge and MS-Celebs-1M where groups were allowed to train on any
private/public/big/small set. Some key discoveries: 1) algorithms, trained on
MF2, were able to achieve state of the art and comparable results to algorithms
trained on massive private sets, 2) some outperformed themselves once trained
on MF2, 3) invariance to aging suffers from low accuracies as in MegaFace,
identifying the need for larger age variations possibly within identities or
adjustment of algorithms in future testings
Information extraction from multimedia web documents: an open-source platform and testbed
The LivingKnowledge project aimed to enhance the current state of the art in search, retrieval and knowledge management on the web by advancing the use of sentiment and opinion analysis within multimedia applications. To achieve this aim, a diverse set of novel and complementary analysis techniques have been integrated into a single, but extensible software platform on which such applications can be built. The platform combines state-of-the-art techniques for extracting facts, opinions and sentiment from multimedia documents, and unlike earlier platforms, it exploits both visual and textual techniques to support multimedia information retrieval. Foreseeing the usefulness of this software in the wider community, the platform has been made generally available as an open-source project. This paper describes the platform design, gives an overview of the analysis algorithms integrated into the system and describes two applications that utilise the system for multimedia information retrieval
- ā¦