108,141 research outputs found
Learnable PINs: Cross-Modal Embeddings for Person Identity
We propose and investigate an identity sensitive joint embedding of face and
voice. Such an embedding enables cross-modal retrieval from voice to face and
from face to voice. We make the following four contributions: first, we show
that the embedding can be learnt from videos of talking faces, without
requiring any identity labels, using a form of cross-modal self-supervision;
second, we develop a curriculum learning schedule for hard negative mining
targeted to this task, that is essential for learning to proceed successfully;
third, we demonstrate and evaluate cross-modal retrieval for identities unseen
and unheard during training over a number of scenarios and establish a
benchmark for this novel task; finally, we show an application of using the
joint embedding for automatically retrieving and labelling characters in TV
dramas.Comment: To appear in ECCV 201
Small Object Detection and Recognition Using Context and Representation Learning
University of Technology Sydney. Faculty of Engineering and Information Technology.Small object detection and recognition is very common in real world applications, such as remote sensing images analysis for Earth Vision, Unmanned Aerial Vehicle vision and video surveillance for identity recognition. Recently, the existing methods have achieved impressive results on large and medium objects. But the detection and recognition performance for small or even tiny objects is still far from satisfaction.
The problem is highly challenging because small objects in low-resolution images may contain fewer than a hundred pixels, and lack sufficient details. Context plays an important role on small object detection and recognition. Aiming to boost the detection performance, we propose a novel discriminative learning and graph-cut framework to exploit the semantic information between targeting objects’ neighbours. What is more, to depict a local neighbourhood relationship, we introduce a pairwise constraint into a tiny face detector to improve the detection accuracy. At last, to describe such a constraint, we convert the problem of regression that estimates the similarity between different candidates into a classification problem that produces the score of classification for each pair of candidates.
In representation learning, we propose an RL-GAN architecture, which enhances the discriminability of the low-resolution (LR) image representation, resulting in comparable classification performance with that conducted on high-resolution (HR) images. In addition, we propose a method based on a Residual Representation to generate a more effective representation of LR images. The Residual Representation is adapted to fuel back the lost details in the representation space of LR images. At last, we produce a new dataset WIDER-SHIP, which provides paired images of multiple resolutions of ships in satellite images and can be used to evaluate not only LR image classification but also LR object recognition.
In the domain of a small sample training, we explore a novel data augmentation framework, which extends a training set to achieve a better coverage of varying orientations of objects in a testing data, so as to improve the performance of CNNs for object detection. Then, we design a principal-axis orientation descriptor based on super-pixel segmentation to represent the orientation of an object in an image. We propose a similarity measure method of two datasets based on a principal-axis orientation distribution. We evaluate the performance and show the effectivity of CNNs for object detection with and without rotating images in the testing set.
Dissertation is directed by Professor Xiangjian He and DoctorWenjing Jia of University of Technology Sydney, Australia, and Professor Jiangbin Zheng of Northwestern Polytechnical University, China
Finding Person Relations in Image Data of the Internet Archive
The multimedia content in the World Wide Web is rapidly growing and contains
valuable information for many applications in different domains. For this
reason, the Internet Archive initiative has been gathering billions of
time-versioned web pages since the mid-nineties. However, the huge amount of
data is rarely labeled with appropriate metadata and automatic approaches are
required to enable semantic search. Normally, the textual content of the
Internet Archive is used to extract entities and their possible relations
across domains such as politics and entertainment, whereas image and video
content is usually neglected. In this paper, we introduce a system for person
recognition in image content of web news stored in the Internet Archive. Thus,
the system complements entity recognition in text and allows researchers and
analysts to track media coverage and relations of persons more precisely. Based
on a deep learning face recognition approach, we suggest a system that
automatically detects persons of interest and gathers sample material, which is
subsequently used to identify them in the image data of the Internet Archive.
We evaluate the performance of the face recognition system on an appropriate
standard benchmark dataset and demonstrate the feasibility of the approach with
two use cases
- …