548 research outputs found
Binary Constrained Deep Hashing Network for Image Retrieval Without Manual Annotation
Learning compact binary codes for image retrieval task using deep neural
networks has attracted increasing attention recently. However, training deep
hashing networks for the task is challenging due to the binary constraints on
the hash codes, the similarity preserving property, and the requirement for a
vast amount of labelled images. To the best of our knowledge, none of the
existing methods has tackled all of these challenges completely in a unified
framework. In this work, we propose a novel end-to-end deep learning approach
for the task, in which the network is trained to produce binary codes directly
from image pixels without the need of manual annotation. In particular, to deal
with the non-smoothness of binary constraints, we propose a novel pairwise
constrained loss function, which simultaneously encodes the distances between
pairs of hash codes, and the binary quantization error. In order to train the
network with the proposed loss function, we propose an efficient parameter
learning algorithm. In addition, to provide similar / dissimilar training
images to train the network, we exploit 3D models reconstructed from unlabelled
images for automatic generation of enormous training image pairs. The extensive
experiments on image retrieval benchmark datasets demonstrate the improvements
of the proposed method over the state-of-the-art compact representation methods
on the image retrieval problem.Comment: Accepted to WACV 201
Cross-Paced Representation Learning with Partial Curricula for Sketch-based Image Retrieval
In this paper we address the problem of learning robust cross-domain
representations for sketch-based image retrieval (SBIR). While most SBIR
approaches focus on extracting low- and mid-level descriptors for direct
feature matching, recent works have shown the benefit of learning coupled
feature representations to describe data from two related sources. However,
cross-domain representation learning methods are typically cast into non-convex
minimization problems that are difficult to optimize, leading to unsatisfactory
performance. Inspired by self-paced learning, a learning methodology designed
to overcome convergence issues related to local optima by exploiting the
samples in a meaningful order (i.e. easy to hard), we introduce the cross-paced
partial curriculum learning (CPPCL) framework. Compared with existing
self-paced learning methods which only consider a single modality and cannot
deal with prior knowledge, CPPCL is specifically designed to assess the
learning pace by jointly handling data from dual sources and modality-specific
prior information provided in the form of partial curricula. Additionally,
thanks to the learned dictionaries, we demonstrate that the proposed CPPCL
embeds robust coupled representations for SBIR. Our approach is extensively
evaluated on four publicly available datasets (i.e. CUFS, Flickr15K, QueenMary
SBIR and TU-Berlin Extension datasets), showing superior performance over
competing SBIR methods
Deep Learning for Free-Hand Sketch: A Survey
Free-hand sketches are highly illustrative, and have been widely used by
humans to depict objects or stories from ancient times to the present. The
recent prevalence of touchscreen devices has made sketch creation a much easier
task than ever and consequently made sketch-oriented applications increasingly
popular. The progress of deep learning has immensely benefited free-hand sketch
research and applications. This paper presents a comprehensive survey of the
deep learning techniques oriented at free-hand sketch data, and the
applications that they enable. The main contents of this survey include: (i) A
discussion of the intrinsic traits and unique challenges of free-hand sketch,
to highlight the essential differences between sketch data and other data
modalities, e.g., natural photos. (ii) A review of the developments of
free-hand sketch research in the deep learning era, by surveying existing
datasets, research topics, and the state-of-the-art methods through a detailed
taxonomy and experimental evaluation. (iii) Promotion of future work via a
discussion of bottlenecks, open problems, and potential research directions for
the community.Comment: This paper is accepted by IEEE TPAM
Contrastive Masked Autoencoders for Self-Supervised Video Hashing
Self-Supervised Video Hashing (SSVH) models learn to generate short binary
representations for videos without ground-truth supervision, facilitating
large-scale video retrieval efficiency and attracting increasing research
attention. The success of SSVH lies in the understanding of video content and
the ability to capture the semantic relation among unlabeled videos. Typically,
state-of-the-art SSVH methods consider these two points in a two-stage training
pipeline, where they firstly train an auxiliary network by instance-wise
mask-and-predict tasks and secondly train a hashing model to preserve the
pseudo-neighborhood structure transferred from the auxiliary network. This
consecutive training strategy is inflexible and also unnecessary. In this
paper, we propose a simple yet effective one-stage SSVH method called ConMH,
which incorporates video semantic information and video similarity relationship
understanding in a single stage. To capture video semantic information for
better hashing learning, we adopt an encoder-decoder structure to reconstruct
the video from its temporal-masked frames. Particularly, we find that a higher
masking ratio helps video understanding. Besides, we fully exploit the
similarity relationship between videos by maximizing agreement between two
augmented views of a video, which contributes to more discriminative and robust
hash codes. Extensive experiments on three large-scale video datasets (i.e.,
FCVID, ActivityNet and YFCC) indicate that ConMH achieves state-of-the-art
results. Code is available at https://github.com/huangmozhi9527/ConMH.Comment: This work is accepted by the AAAI 2023. 9 pages, 6 figures, 6 table
Faster Person Re-Identification: One-shot-Filter and Coarse-to-Fine Search.
Fast person re-identification (ReID) aims to search person images quickly and accurately. The main idea of recent fast ReID methods is the hashing algorithm, which learns compact binary codes and performs fast Hamming distance and counting sort. However, a very long code is needed for high accuracy (e.g. 2048), which compromises search speed. In this work, we introduce a new solution for fast ReID by formulating a novel Coarse-to-Fine (CtF) hashing code search strategy, which complementarily uses short and long codes, achieving both faster speed and better accuracy. It uses shorter codes to coarsely rank broad matching similarities and longer codes to refine only a few top candidates for more accurate instance ReID. Specifically, we design an All-in-One (AiO) module together with a Distance Threshold Optimization (DTO) algorithm. In AiO, we simultaneously learn and enhance multiple codes of different lengths in a single model. It learns multiple codes in a pyramid structure, and encourage shorter codes to mimic longer codes by self-distillation. DTO solves a complex threshold search problem by a simple optimization process, and the balance between accuracy and speed is easily controlled by a single parameter. It formulates the optimization target as a Fฮฒ score that can be optimised by Gaussian cumulative distribution functions. Besides, we find even short code (e.g. 32) still takes a long time under large-scale gallery due to the O(n) time complexity. To solve the problem, we propose a gallery-size-free latent-attributes-based One-Shot-Filter (OSF) strategy, that is always O(1) time complexity, to quickly filter major easy negative gallery images, Specifically, we design a Latent-Attribute-Learning (LAL) module supervised a Single-Direction-Metric (SDM) Loss. LAL is derived from principal component analysis (PCA) that keeps largest variance using shortest feature vector, meanwhile enabling batch and end-to-end learning. Every logit of a feature vector represents a meaningful attribute. SDM is carefully designed for fine-grained attribute supervision, outperforming common metrics such as Euclidean and Cosine metrics. Experimental results on 2 datasets show that CtF+OSF is not only 2% more accurate but also 5ร faster than contemporary hashing ReID methods. Compared with non-hashing ReID methods, CtF is 50ร faster with comparable accuracy. OSF further speeds CtF by 2ร again and upto 10ร in total with almost no accuracy drop
Recommended from our members
Semi-Supervised Learning for Scalable and Robust Visual Search
Unlike textual document retrieval, searching of visual data is still far from satisfactory. There exist major gaps between the available solutions and practical needs in both accuracy and computational cost. This thesis aims at the development of robust and scalable solutions for visual search and retrieval. Specifically, we investigate two classes of approaches: graph-based semi-supervised learning and hashing techniques. The graph-based approaches are used to improve accuracy, while hashing approaches are used to improve efficiency and cope with large-scale applications. A common theme shared between these two subareas of our work is the focus on semi-supervised learning paradigm, in which a small set of labeled data is complemented with large unlabeled datasets. Graph-based approaches have emerged as methods of choice for general semi-supervised tasks when no parametric information is available about the data distribution. It treats both labeled and unlabeled samples as vertices in a graph and then instantiates pairwise edges between these vertices to capture affinity between the corresponding samples. A quadratic regularization framework has been widely used for label prediction over such graphs. However, most of the existing graph-based semi-supervised learning methods are sensitive to the graph construction process and the initial labels. We propose a new bivariate graph transduction formulation and an efficient solution via an alternating minimization procedure. Based on this bivariate framework, we also develop new methods to filter unreliable and noisy labels. Extensive experiments over diverse benchmark datasets demonstrate the superior performance of our proposed methods. However, graph-based approaches suffer from the critical bottleneck in scalability since graph construction requires a quadratic complexity and the inference procedure costs even more. The widely used graph construction method relies on nearest neighbor search, which is prohibitive for large-scale applications. In addition, most large-scale visual search problems involve handling high-dimensional visual descriptors, thereby causing another challenge in excessive storage requirement. To handle the scalability issue of both computation and storage, the second part of the thesis focuses on efficient techniques for conducting approximate nearest neighbor (ANN) search, which is key to many machine learning algorithms, including graph-based semi-supervised learning and clustering. Specifically, we propose Semi-Supervised Hashing (SSH) methods that leverage semantic similarity over a small set of labeled data while preventing overfitting. We derive a rigorous formulation in which a supervised term minimizes the empirical errors on the labeled data and an unsupervised term provides effective regularization by maximizing variance and independence of individual bits. Experiments on several large datasets demonstrate the clear performance gain over several state-of-the-art methods without significant increase of the computational cost. The main contributions of the thesis include the following. Bivariate graph transduction: a) a bivariate formulation for graph-based semi-supervised learning with an efficient solution by alternating optimization; b) theoretic analysis from the view of graph cut for the bivariate optimization procedure; c) novel applications of the proposed techniques, such as interactive image retrieval, automatic re-ranking for text based image search, and a brain computer interface (BCI) for image retrieval. Semi-supervised hashing: a) a rigorous semi-supervised paradigm for hash functions learning with a tradeoff between empirical fitness on pair-wise label consistence and an information-theoretic regularizer; b) several efficient solutions for deriving semi-supervised hash functions, including an orthogonal solution using eigen-decomposition, a revised strategy for learning non-orthogonal hash functions, a sequential learning algorithm to derive boosted hash functions, and an extension to unsupervised cases by using pseudo labels. Two parts of the thesis - bivariate graph transduction and semi-supervised hashing - are complimentary and can be combined to achieve significant performance improvement in both speed and accuracy. Hash methods can help build sparse graphs in a linear time fashion and greatly reduce the data size, but they lack sufficient accuracy. Graph-based methods provide unique capabilities to handle non-linear data structures with noisy labels but suffer from high computational complexity. The synergistic combination of the two offers great potential for advancing the state-of-the-art in large-scale visual search and many other applications
๋ค์ํ ๋ฅ ๋ฌ๋ ํ์ต ํ๊ฒฝ ํ์ ์ปจํ ์ธ ๊ธฐ๋ฐ ์ด๋ฏธ์ง ๊ฒ์
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2022.2. ์กฐ๋จ์ต.๋ฐฉ๋ํ ๋ฐ์ดํฐ๋ฒ ์ด์ค์์ ์ง์์ ๋ํ ๊ด๋ จ ์ด๋ฏธ์ง๋ฅผ ์ฐพ๋ ์ฝํ
์ธ ๊ธฐ๋ฐ ์ด๋ฏธ์ง ๊ฒ์์ ์ปดํจํฐ ๋น์ ๋ถ์ผ์ ๊ทผ๋ณธ์ ์ธ ์์
์ค ํ๋์ด๋ค. ํนํ ๋น ๋ฅด๊ณ ์ ํํ ๊ฒ์์ ์ํํ๊ธฐ ์ํด ํด์ฑ (Hashing) ๋ฐ ๊ณฑ ์์ํ (Product Quantization, PQ) ๋ก ๋ํ๋๋ ๊ทผ์ฌ์ต๊ทผ์ ์ด์ (Approximate Nearest Neighbor, ANN) ๊ฒ์ ๋ฐฉ์์ด ์ด๋ฏธ์ง ๊ฒ์ ์ปค๋ฎค๋ํฐ์์ ์ฃผ๋ชฉ๋ฐ๊ณ ์๋ค. ์ ๊ฒฝ๋ง ๊ธฐ๋ฐ ๋ฅ ๋ฌ๋ (CNN-based deep learning) ์ด ๋ง์ ์ปดํจํฐ ๋น์ ์์
์์ ์ฐ์ํ ์ฑ๋ฅ์ ๋ณด์ฌ์ค ์ดํ๋ก, ํด์ฑ ๋ฐ ๊ณฑ ์์ํ ๊ธฐ๋ฐ ์ด๋ฏธ์ง ๊ฒ์ ์์คํ
๋ชจ๋ ๊ฐ์ ์ ์ํด ๋ฅ ๋ฌ๋์ ์ฑํํ๊ณ ์๋ค. ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ์ ์ ํ ๊ฒ์ ์์คํ
์ ์ ์ํ๊ธฐ ์ํด ๋ค์ํ ๋ฅ ๋ฌ๋ ํ์ต ํ๊ฒฝ์๋์์ ์ด๋ฏธ์ง ๊ฒ์ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ๊ตฌ์ฒด์ ์ผ๋ก, ์ด๋ฏธ์ง ๊ฒ์์ ๋ชฉ์ ์ ๊ณ ๋ คํ์ฌ ์๋ฏธ์ ์ผ๋ก ์ ์ฌํ ์ด๋ฏธ์ง๋ฅผ ๊ฒ์ํ๋ ๋ฅ ๋ฌ๋ ํด์ฑ ์์คํ
์ ๊ฐ๋ฐํ๊ธฐ ์ํ ์ง๋ ํ์ต ๋ฐฉ๋ฒ์ ์ ์ํ๊ณ , ์๋ฏธ์ , ์๊ฐ์ ์ผ๋ก ๋ชจ๋ ์ ์ฌํ ์ด๋ฏธ์ง๋ฅผ ๊ฒ์ํ๋ ๋ฅ ๋ฌ๋ ๊ณฑ ์์ํ ๊ธฐ๋ฐ์ ์์คํ
์ ๊ตฌ์ถํ๊ธฐ ์ํ ์ค์ง๋, ๋น์ง๋ ํ์ต ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ๋ํ, ์ด๋ฏธ์ง ๊ฒ์ ๋ฐ์ดํฐ๋ฒ ์ด์ค์ ํน์ฑ์ ๊ณ ๋ คํ์ฌ, ๋ถ๋ฅํด์ผํ ํด๋์ค (class category) ๊ฐ ๋ง์ ์ผ๊ตด ์ด๋ฏธ์ง ๋ฐ์ดํฐ ์ธํธ์ ํ๋ ์ด์์ ๋ ์ด๋ธ (label) ์ด ์ง์ ๋ ์ผ๋ฐ ์ด๋ฏธ์ง ์ธํธ๋ฅผ ๋ถ๋ฆฌํ์ฌ ๋ฐ๋ก ๊ฒ์ ์์คํ
์ ๊ตฌ์ถํ๋ค.
๋จผ์ ์ด๋ฏธ์ง์ ๋ถ์ฌ๋ ์๋ฏธ๋ก ์ ๋ ์ด๋ธ์ ์ฌ์ฉํ๋ ์ง๋ ํ์ต์ ๋์
ํ์ฌ ํด์ฑ ๊ธฐ๋ฐ ๊ฒ์ ์์คํ
์ ๊ตฌ์ถํ๋ค. ํด๋์ค ๊ฐ ์ ์ฌ์ฑ (๋ค๋ฅธ ์ฌ๋ ์ฌ์ด์ ์ ์ฌํ ์ธ๋ชจ) ๊ณผ ํด๋์ค ๋ด ๋ณํ(๊ฐ์ ์ฌ๋์ ๋ค๋ฅธ ํฌ์ฆ, ํ์ , ์กฐ๋ช
) ์ ๊ฐ์ ์ผ๊ตด ์ด๋ฏธ์ง ๊ตฌ๋ณ์ ์ด๋ ค์์ ํด๊ฒฐํ๊ธฐ ์ํด ๊ฐ ์ด๋ฏธ์ง์ ํด๋์ค ๋ ์ด๋ธ์ ์ฌ์ฉํ๋ค. ์ผ๊ตด ์ด๋ฏธ์ง ๊ฒ์ ํ์ง์ ๋์ฑ ํฅ์์ํค๊ธฐ ์ํด SGH (Similarity Guided Hashing) ๋ฐฉ์์ ์ ์ํ๋ฉฐ, ์ฌ๊ธฐ์ ๋ค์ค ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ฒฐ๊ณผ๋ฅผ ์ฌ์ฉํ ์๊ธฐ ์ ์ฌ์ฑ ํ์ต์ด ํ๋ จ ์ค์ ์ฌ์ฉ๋๋ค. ๊ทธ๋ฆฌ๊ณ ํด์ฑ ๊ธฐ๋ฐ์ ์ผ๋ฐ ์ด๋ฏธ์ง ๊ฒ์ ์์คํ
์ ๊ตฌ์ฑํ๊ธฐ ์ํด DHD(Deep Hash Distillation) ๋ฐฉ์์ ์ ์ํ๋ค. DHD์์๋ ์ง๋ ์ ํธ๋ฅผ ํ์ฉํ๊ธฐ ์ํด ํด๋์ค๋ณ ๋ํ์ฑ์ ๋ํ๋ด๋ ํ๋ จ ๊ฐ๋ฅํ ํด์ ํ๋ก์ (proxy) ๋ฅผ ๋์
ํ๋ค. ๋ํ, ํด์ฑ์ ์ ํฉํ ์์ฒด ์ฆ๋ฅ ๊ธฐ๋ฒ์ ์ ์ํ์ฌ ์ฆ๊ฐ ๋ฐ์ดํฐ์ ์ ์ฌ๋ ฅ์ ์ผ๋ฐ์ ์ธ ์ด๋ฏธ์ง ๊ฒ์ ์ฑ๋ฅ ํฅ์์ ์ ์ฉํ๋ค.
๋์งธ๋ก, ๋ ์ด๋ธ์ด ์ง์ ๋ ์ด๋ฏธ์ง ๋ฐ์ดํฐ์ ๋ ์ด๋ธ์ด ์ง์ ๋์ง ์์ ์ด๋ฏธ์ง ๋ฐ์ดํฐ๋ฅผ ๋ชจ๋ ํ์ฉํ๋ ์ค์ง๋ ํ์ต์ ์กฐ์ฌํ์ฌ ๊ณฑ ์์ํ ๊ธฐ๋ฐ ๊ฒ์ ์์คํ
์ ๊ตฌ์ถํ๋ค. ์ง๋ ํ์ต ๋ฅ ๋ฌ๋ ๊ธฐ๋ฐ์ ์ด๋ฏธ์ง ๊ฒ์ ๋ฐฉ๋ฒ๋ค์ ์ฐ์ํ ์ฑ๋ฅ์ ๋ณด์ด๋ ค๋ฉด ๊ฐ๋น์ผ ๋ ์ด๋ธ ์ ๋ณด๊ฐ ์ถฉ๋ถํด์ผ ํ๋ค๋ ๋จ์ ์ด ์๋ค. ๊ฒ๋ค๊ฐ, ๋ ์ด๋ธ์ด ์ง์ ๋์ง ์์ ์๋ง์ ์ด๋ฏธ์ง ๋ฐ์ดํฐ๋ ํ๋ จ์์ ์ ์ธ๋๋ค๋ ํ๊ณ๊ฐ ์๋ค. ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ฒกํฐ ์์ํ ๊ธฐ๋ฐ ๋ฐ์ง๋ ์์ ๊ฒ์ ๋ฐฉ์์ธ GPQ (Generalized Product Quantization) ๋คํธ์ํฌ๋ฅผ ์ ์ํ๋ค. ๋ ์ด๋ธ์ด ์ง์ ๋ ๋ฐ์ดํฐ ๊ฐ์ ์๋ฏธ๋ก ์ ์ ์ฌ์ฑ์ ์ ์งํ๋ ์๋ก์ด ๋ฉํธ๋ฆญ ํ์ต (Metric learning) ์ ๋ต๊ณผ ๋ ์ด๋ธ์ด ์ง์ ๋์ง ์์ ๋ฐ์ดํฐ์ ๊ณ ์ ํ ์ ์ฌ๋ ฅ์ ์ต๋ํ ํ์ฉํ๋ ์ํธ๋กํผ ์ ๊ทํ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์ฌ ๊ฒ์ ์์คํ
์ ๊ฐ์ ํ๋ค. ์ด ์๋ฃจ์
์ ์์ํ ๋คํธ์ํฌ์ ์ผ๋ฐํ ์ฉ๋์ ์ฆ๊ฐ์์ผ ์ด์ ์ ํ๊ณ๋ฅผ ๊ทน๋ณตํ ์ ์๊ฒํ๋ค.
๋ง์ง๋ง์ผ๋ก, ๋ฅ ๋ฌ๋ ๋ชจ๋ธ์ด ์ฌ๋์ ์ง๋ ์์ด ์๊ฐ์ ์ผ๋ก ์ ์ฌํ ์ด๋ฏธ์ง ๊ฒ์์ ์ํํ ์ ์๋๋ก ํ๊ธฐ ์ํด ๋น์ง๋ ํ์ต ์๊ณ ๋ฆฌ์ฆ์ ํ์ํ๋ค. ๋น๋ก ๋ ์ด๋ธ ์ฃผ์์ ํ์ฉํ ์ฌ์ธต ์ง๋ ๊ธฐ๋ฐ์ ๋ฐฉ๋ฒ๋ค์ด ๊ธฐ์กด ๋ฐฉ๋ฒ๋ค์ ๋๋น ์ฐ์ํ ๊ฒ์ ์ฑ๋ฅ์ ๋ณด์ผ์ง๋ผ๋, ๋ฐฉ๋ํ ์์ ํ๋ จ ๋ฐ์ดํฐ์ ๋ํด ์ ํํ๊ฒ ๋ ์ด๋ธ์ ์ง์ ํ๋ ๊ฒ์ ํ๋ค๊ณ ์ฃผ์์์ ์ค๋ฅ๊ฐ ๋ฐ์ํ๊ธฐ ์ฝ๋ค๋ ํ๊ณ๊ฐ ์๋ค. ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ ์ด๋ธ ์์ด ์์ฒด ์ง๋ ๋ฐฉ์์ผ๋ก ํ๋ จํ๋ SPQ (Self-supervised Product Quantization) ๋คํธ์ํฌ ๋ผ๋ ์ฌ์ธต ๋น์ง๋ ์ด๋ฏธ์ง ๊ฒ์ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ์๋กญ๊ฒ ์ค๊ณ๋ ๊ต์ฐจ ์์ํ ๋์กฐ ํ์ต ๋ฐฉ์์ผ๋ก ์๋ก ๋ค๋ฅด๊ฒ ๋ณํ๋ ์ด๋ฏธ์ง๋ฅผ ๋น๊ตํ์ฌ ๊ณฑ ์์ํ์ ์ฝ๋์๋์ ์ฌ์ธต ์๊ฐ์ ํํ์ ๋์์ ํ์ตํ๋ค. ์ด ๋ฐฉ์์ ํตํด ์ด๋ฏธ์ง์ ๋ด์ ๋ ๋ด์ฉ์ ๋ณ๋์ ์ฌ๋ ์ง๋ ์์ด ๋คํธ์ํฌ๊ฐ ์ค์ค๋ก ์ดํดํ๊ฒ ๋๊ณ , ์๊ฐ์ ์ผ๋ก ์ ํํ ๊ฒ์์ ์ํํ ์ ์๋ ์ค๋ช
๊ธฐ๋ฅ์ ์ถ์ถํ ์ ์๊ฒ ๋๋ค.
๋ฒค์น๋งํฌ ๋ฐ์ดํฐ ์ธํธ์ ๋ํ ๊ด๋ฒ์ํ ์ด๋ฏธ์ง ๊ฒ์ ์คํ์ ์ํํ์ฌ ์ ์๋ ๋ฐฉ๋ฒ์ด ๋ค์ํ ํ๊ฐ ํ๋กํ ์ฝ์์ ๋ฐ์ด๋ ๊ฒฐ๊ณผ๋ฅผ ์ฐ์ถํจ์ ํ์ธํ๋ค. ์ง๋ ํ์ต ๊ธฐ๋ฐ์ ์ผ๊ตด ์์ ๊ฒ์์ ๊ฒฝ์ฐ SGH๋ ์ ํด์๋ ๋ฐ ๊ณ ํด์๋ ์ผ๊ตด ์์ ๋ชจ๋์์ ์ต๊ณ ์ ๊ฒ์ ์ฑ๋ฅ์ ๋ฌ์ฑํ์๊ณ , DHD๋ ์ต๊ณ ์ ๊ฒ์ ์ ํ๋๋ก ์ผ๋ฐ ์์ ๊ฒ์ ์คํ์์ ํจ์จ์ฑ์ ์
์ฆํ๋ค. ์ค์ง๋ ์ผ๋ฐ ์ด๋ฏธ์ง ๊ฒ์์ ๊ฒฝ์ฐ GPQ๋ ๋ ์ด๋ธ์ด ์๋ ์ด๋ฏธ์ง ๋ฐ์ดํฐ์ ๋ ์ด๋ธ์ด ์๋ ์ด๋ฏธ์ง ๋ฐ์ดํฐ๋ฅผ ๋ชจ๋ ์ฌ์ฉํ๋ ํ๋กํ ์ฝ์ ๋ํ ์ต์์ ๊ฒ์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ฌ์ค๋ค. ๋ง์ง๋ง์ผ๋ก, ๋น์ง๋ ํ์ต ์ด๋ฏธ์ง ๊ฒ์์ ๊ฒฝ์ฐ ์ง๋ ๋ฐฉ์์ผ๋ก ๋ฏธ๋ฆฌ ํ์ต๋ ์ด๊ธฐ ๊ฐ ์์ด๋ SPQ๋ฅผ ์ฌ์ฉํ์ฌ ์ต์์ ๊ฒ์ ์ ์๋ฅผ ์ป์์ผ๋ฉฐ ์๊ฐ์ ์ผ๋ก ์ ์ฌํ ์ด๋ฏธ์ง๊ฐ ๊ฒ์ ๊ฒฐ๊ณผ๋ก ์ฑ๊ณต์ ์ผ๋ก ๊ฒ์๋๋ ๊ฒ์ ๊ด์ฐฐํ ์ ์๋ค.Content-based image retrieval, which finds relevant images to a query from a huge database, is one of the fundamental tasks in the field of computer vision. Especially for conducting fast and accurate retrieval, Approximate Nearest Neighbor (ANN) search approaches represented by Hashing and Product Quantization (PQ) have been proposed to image retrieval community. Ever since neural network based deep learning has shown excellent performance in many computer vision tasks, both Hashing and product quantization-based image retrieval systems are also adopting deep learning for improvement. In this dissertation, image retrieval methods under various deep learning conditions are investigated to suggest the appropriate retrieval systems. Specifically, by considering the purpose of image retrieval, the supervised learning methods are proposed to develop the deep Hashing systems that retrieve semantically similar images, and the semi-supervised, unsupervised learning methods are proposed to establish the deep product quantization systems that retrieve both semantically and visually similar images. Moreover, by considering the characteristics of image retrieval database, the face image sets with numerous class categories, and the general image sets of one or more labeled images are separated to be explored when building a retrieval system.
First, supervised learning with the semantic labels given to images is introduced to build a Hashing-based retrieval system. To address the difficulties of distinguishing face images, such as the inter-class similarities (similar appearance between different persons) and the intra-class variations (same person with different pose, facial expressions, illuminations), the identity label of each image is employed to derive the discriminative binary codes. To further develop the face image retrieval quality, Similarity Guided Hashing (SGH) scheme is proposed, where the self-similarity learning with multiple data augmentation results are employed during training. In terms of Hashing-based general image retrieval systems, Deep Hash Distillation (DHD) scheme is proposed, where the trainable hash proxy that presents class-wise representative is introduced to take advantage of supervised signals. Moreover, self-distillation scheme adapted for Hashing is utilized to improve general image retrieval performance by exploiting the potential of augmented data appropriately.
Second, semi-supervised learning that utilizes both labeled and unlabeled image data is investigated to build a PQ-based retrieval system. Even if the supervised deep methods show excellent performance, they do not meet the expectations unless expensive label information is sufficient. Besides, there is a limitation that a tons of unlabeled image data is excluded from training. To resolve this issue, the vector quantization-based semi-supervised image retrieval scheme: Generalized Product Quantization (GPQ) network is proposed. A novel metric learning strategy that preserves semantic similarity between labeled data, and a entropy regularization term that fully exploits inherent potentials of unlabeled data are employed to improve the retrieval system. This solution increases the generalization capacity of the quantization network, which allows to overcome previous limitations.
Lastly, to enable the network to perform a visually similar image retrieval on its own without any human supervision, unsupervised learning algorithm is explored. Although, deep supervised Hashing and PQ methods achieve the outstanding retrieval performances compared to the conventional methods by fully exploiting the label annotations, however, it is painstaking to assign labels precisely for a vast amount of training data, and also, the annotation process is error-prone. To tackle these issues, the deep unsupervised image retrieval method dubbed Self-supervised Product Quantization (SPQ) network, which is label-free and trained in a self-supervised manner is proposed. A newly designed Cross Quantized Contrastive learning strategy is applied to jointly learn the PQ codewords and the deep visual representations by comparing individually transformed images (views). This allows to understand the image content and extract descriptive features so that the visually accurate retrieval can be performed.
By conducting extensive image retrieval experiments on the benchmark datasets, the proposed methods are confirmed to yield the outstanding results under various evaluation protocols. For supervised face image retrieval, SGH achieves the best retrieval performance for both low and high resolution face image, and DHD also demonstrates its efficiency in general image retrieval experiments with the state-of-the-art retrieval performance. For semi-supervised general image retrieval, GPQ shows the best search results for protocols that use both labeled and unlabeled image data. Finally, for unsupervised general image retrieval, the best retrieval scores are achieved with SPQ even without supervised pre-training, and it can be observed that visually similar images are successfully retrieved as search results.Abstract i
Contents iv
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Contribution 3
1.2 Contents 4
2 Supervised Learning for Deep Hashing: Similarity Guided Hashing for Face Image Retrieval / Deep Hash Distillation for General Image Retrieval 5
2.1 Motivation and Overview for Face Image Retrieval 5
2.1.1 Related Works 9
2.2 Similarity Guided Hashing 10
2.3 Experiments 16
2.3.1 Datasets and Setup 16
2.3.2 Results on Small Face Images 18
2.3.3 Results on Large Face Images 19
2.4 Motivation and Overview for General Image Retrieval 20
2.5 Related Works 22
2.6 Deep Hash Distillation 24
2.6.1 Self-distilled Hashing 24
2.6.2 Teacher loss 27
2.6.3 Training 29
2.6.4 Hamming Distance Analysis 29
2.7 Experiments 32
2.7.1 Setup 32
2.7.2 Implementation Details 32
2.7.3 Results 34
2.7.4 Analysis 37
3 Semi-supervised Learning for Product Quantization: Generalized Product Quantization Network for Semi-supervised Image Retrieval 42
3.1 Motivation and Overview 42
3.1.1 Related Work 45
3.2 Generalized Product Quantization 47
3.2.1 Semi-Supervised Learning 48
3.2.2 Retrieval 52
3.3 Experiments 53
3.3.1 Setup 53
3.3.2 Results and Analysis 55
4 Unsupervised Learning for Product Quantization: Self-supervised Product Quantization for Deep Unsupervised Image Retrieval 58
4.1 Motivation and Overview 58
4.1.1 Related Works 61
4.2 Self-supervised Product Quantization 62
4.2.1 Overall Framework 62
4.2.2 Self-supervised Training 64
4.3 Experiments 67
4.3.1 Datasets 67
4.3.2 Experimental Settings 68
4.3.3 Results 71
4.3.4 Empirical Analysis 71
5 Conclusion 75
Abstract (In Korean) 88๋ฐ
- โฆ