276 research outputs found
Orthonormal Product Quantization Network for Scalable Face Image Retrieval
Recently, deep hashing with Hamming distance metric has drawn increasing
attention for face image retrieval tasks. However, its counterpart deep
quantization methods, which learn binary code representations with
dictionary-related distance metrics, have seldom been explored for the task.
This paper makes the first attempt to integrate product quantization into an
end-to-end deep learning framework for face image retrieval. Unlike prior deep
quantization methods where the codewords for quantization are learned from
data, we propose a novel scheme using predefined orthonormal vectors as
codewords, which aims to enhance the quantization informativeness and reduce
the codewords' redundancy. To make the most of the discriminative information,
we design a tailored loss function that maximizes the identity discriminability
in each quantization subspace for both the quantized and the original features.
Furthermore, an entropy-based regularization term is imposed to reduce the
quantization error. We conduct experiments on three commonly-used datasets
under the settings of both single-domain and cross-domain retrieval. It shows
that the proposed method outperforms all the compared deep hashing/quantization
methods under both settings with significant superiority. The proposed
codewords scheme consistently improves both regular model performance and model
generalization ability, verifying the importance of codewords' distribution for
the quantization quality. Besides, our model's better generalization ability
than deep hashing models indicates that it is more suitable for scalable face
image retrieval tasks
Deep Hashing with Triplet Quantization Loss
With the explosive growth of image databases, deep hashing, which learns
compact binary descriptors for images, has become critical for fast image
retrieval. Many existing deep hashing methods leverage quantization loss,
defined as distance between the features before and after quantization, to
reduce the error from binarizing features. While minimizing the quantization
loss guarantees that quantization has minimal effect on retrieval accuracy, it
unfortunately significantly reduces the expressiveness of features even before
the quantization. In this paper, we show that the above definition of
quantization loss is too restricted and in fact not necessary for maintaining
high retrieval accuracy. We therefore propose a new form of quantization loss
measured in triplets. The core idea of the triplet quantization loss is to
learn discriminative real-valued descriptors which lead to minimal loss on
retrieval accuracy after quantization. Extensive experiments on two widely used
benchmark data sets of different scales, CIFAR-10 and In-shop, demonstrate that
the proposed method outperforms the state-of-the-art deep hashing methods.
Moreover, we show that the compact binary descriptors obtained with triplet
quantization loss lead to very small performance drop after quantization.Comment: 4 pages, to be presented at IEEE VCIP 201
Recent Advance in Content-based Image Retrieval: A Literature Survey
The explosive increase and ubiquitous accessibility of visual data on the Web
have led to the prosperity of research activity in image search or retrieval.
With the ignorance of visual content as a ranking clue, methods with text
search techniques for visual retrieval may suffer inconsistency between the
text words and visual content. Content-based image retrieval (CBIR), which
makes use of the representation of visual content to identify relevant images,
has attracted sustained attention in recent two decades. Such a problem is
challenging due to the intention gap and the semantic gap problems. Numerous
techniques have been developed for content-based image retrieval in the last
decade. The purpose of this paper is to categorize and evaluate those
algorithms proposed during the period of 2003 to 2016. We conclude with several
promising directions for future research.Comment: 22 page
EBoWs: An End-to-End Bag-of-Words Model via Deep Convolutional Neural Network
Traditional Bag-of-visual Words (BoWs) model is commonly generated with many
steps including local feature extraction, codebook generation, and feature
quantization, etc. Those steps are relatively independent with each other and
are hard to be jointly optimized. Moreover, the dependency on hand-crafted
local feature makes BoWs model not effective in conveying high-level semantics.
These issues largely hinder the performance of BoWs model in large-scale image
applications. To conquer these issues, we propose an End-to-End BoWs
(EBoWs) model based on Deep Convolutional Neural Network (DCNN). Our model
takes an image as input, then identifies and separates the semantic objects in
it, and finally outputs the visual words with high semantic discriminative
power. Specifically, our model firstly generates Semantic Feature Maps (SFMs)
corresponding to different object categories through convolutional layers, then
introduces Bag-of-Words Layers (BoWL) to generate visual words for each
individual feature map. We also introduce a novel learning algorithm to
reinforce the sparsity of the generated EBoWs model, which further ensures
the time and memory efficiency. We evaluate the proposed EBoWs model on
several image search datasets including CIFAR-10, CIFAR-100, MIRFLICKR-25K and
NUS-WIDE. Experimental results show that our method achieves promising accuracy
and efficiency compared with recent deep learning based retrieval works.Comment: 8 pages, ChinaMM 2017, image retrieva
Deep LDA Hashing
The conventional supervised hashing methods based on classification do not
entirely meet the requirements of hashing technique, but Linear Discriminant
Analysis (LDA) does. In this paper, we propose to perform a revised LDA
objective over deep networks to learn efficient hashing codes in a truly
end-to-end fashion. However, the complicated eigenvalue decomposition within
each mini-batch in every epoch has to be faced with when simply optimizing the
deep network w.r.t. the LDA objective. In this work, the revised LDA objective
is transformed into a simple least square problem, which naturally overcomes
the intractable problems and can be easily solved by the off-the-shelf
optimizer. Such deep extension can also overcome the weakness of LDA Hashing in
the limited linear projection and feature learning. Amounts of experiments are
conducted on three benchmark datasets. The proposed Deep LDA Hashing shows
nearly 70 points improvement over the conventional one on the CIFAR-10 dataset.
It also beats several state-of-the-art methods on various metrics.Comment: 10 pages, 3 figure
Exploring Auxiliary Context: Discrete Semantic Transfer Hashing for Scalable Image Retrieval
Unsupervised hashing can desirably support scalable content-based image
retrieval (SCBIR) for its appealing advantages of semantic label independence,
memory and search efficiency. However, the learned hash codes are embedded with
limited discriminative semantics due to the intrinsic limitation of image
representation. To address the problem, in this paper, we propose a novel
hashing approach, dubbed as \emph{Discrete Semantic Transfer Hashing} (DSTH).
The key idea is to \emph{directly} augment the semantics of discrete image hash
codes by exploring auxiliary contextual modalities. To this end, a unified
hashing framework is formulated to simultaneously preserve visual similarities
of images and perform semantic transfer from contextual modalities. Further, to
guarantee direct semantic transfer and avoid information loss, we explicitly
impose the discrete constraint, bit--uncorrelation constraint and bit-balance
constraint on hash codes. A novel and effective discrete optimization method
based on augmented Lagrangian multiplier is developed to iteratively solve the
optimization problem. The whole learning process has linear computation
complexity and desirable scalability. Experiments on three benchmark datasets
demonstrate the superiority of DSTH compared with several state-of-the-art
approaches
๋ค์ํ ๋ฅ ๋ฌ๋ ํ์ต ํ๊ฒฝ ํ์ ์ปจํ ์ธ ๊ธฐ๋ฐ ์ด๋ฏธ์ง ๊ฒ์
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ) -- ์์ธ๋ํ๊ต๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2022.2. ์กฐ๋จ์ต.๋ฐฉ๋ํ ๋ฐ์ดํฐ๋ฒ ์ด์ค์์ ์ง์์ ๋ํ ๊ด๋ จ ์ด๋ฏธ์ง๋ฅผ ์ฐพ๋ ์ฝํ
์ธ ๊ธฐ๋ฐ ์ด๋ฏธ์ง ๊ฒ์์ ์ปดํจํฐ ๋น์ ๋ถ์ผ์ ๊ทผ๋ณธ์ ์ธ ์์
์ค ํ๋์ด๋ค. ํนํ ๋น ๋ฅด๊ณ ์ ํํ ๊ฒ์์ ์ํํ๊ธฐ ์ํด ํด์ฑ (Hashing) ๋ฐ ๊ณฑ ์์ํ (Product Quantization, PQ) ๋ก ๋ํ๋๋ ๊ทผ์ฌ์ต๊ทผ์ ์ด์ (Approximate Nearest Neighbor, ANN) ๊ฒ์ ๋ฐฉ์์ด ์ด๋ฏธ์ง ๊ฒ์ ์ปค๋ฎค๋ํฐ์์ ์ฃผ๋ชฉ๋ฐ๊ณ ์๋ค. ์ ๊ฒฝ๋ง ๊ธฐ๋ฐ ๋ฅ ๋ฌ๋ (CNN-based deep learning) ์ด ๋ง์ ์ปดํจํฐ ๋น์ ์์
์์ ์ฐ์ํ ์ฑ๋ฅ์ ๋ณด์ฌ์ค ์ดํ๋ก, ํด์ฑ ๋ฐ ๊ณฑ ์์ํ ๊ธฐ๋ฐ ์ด๋ฏธ์ง ๊ฒ์ ์์คํ
๋ชจ๋ ๊ฐ์ ์ ์ํด ๋ฅ ๋ฌ๋์ ์ฑํํ๊ณ ์๋ค. ๋ณธ ํ์ ๋
ผ๋ฌธ์์๋ ์ ์ ํ ๊ฒ์ ์์คํ
์ ์ ์ํ๊ธฐ ์ํด ๋ค์ํ ๋ฅ ๋ฌ๋ ํ์ต ํ๊ฒฝ์๋์์ ์ด๋ฏธ์ง ๊ฒ์ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ๊ตฌ์ฒด์ ์ผ๋ก, ์ด๋ฏธ์ง ๊ฒ์์ ๋ชฉ์ ์ ๊ณ ๋ คํ์ฌ ์๋ฏธ์ ์ผ๋ก ์ ์ฌํ ์ด๋ฏธ์ง๋ฅผ ๊ฒ์ํ๋ ๋ฅ ๋ฌ๋ ํด์ฑ ์์คํ
์ ๊ฐ๋ฐํ๊ธฐ ์ํ ์ง๋ ํ์ต ๋ฐฉ๋ฒ์ ์ ์ํ๊ณ , ์๋ฏธ์ , ์๊ฐ์ ์ผ๋ก ๋ชจ๋ ์ ์ฌํ ์ด๋ฏธ์ง๋ฅผ ๊ฒ์ํ๋ ๋ฅ ๋ฌ๋ ๊ณฑ ์์ํ ๊ธฐ๋ฐ์ ์์คํ
์ ๊ตฌ์ถํ๊ธฐ ์ํ ์ค์ง๋, ๋น์ง๋ ํ์ต ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ๋ํ, ์ด๋ฏธ์ง ๊ฒ์ ๋ฐ์ดํฐ๋ฒ ์ด์ค์ ํน์ฑ์ ๊ณ ๋ คํ์ฌ, ๋ถ๋ฅํด์ผํ ํด๋์ค (class category) ๊ฐ ๋ง์ ์ผ๊ตด ์ด๋ฏธ์ง ๋ฐ์ดํฐ ์ธํธ์ ํ๋ ์ด์์ ๋ ์ด๋ธ (label) ์ด ์ง์ ๋ ์ผ๋ฐ ์ด๋ฏธ์ง ์ธํธ๋ฅผ ๋ถ๋ฆฌํ์ฌ ๋ฐ๋ก ๊ฒ์ ์์คํ
์ ๊ตฌ์ถํ๋ค.
๋จผ์ ์ด๋ฏธ์ง์ ๋ถ์ฌ๋ ์๋ฏธ๋ก ์ ๋ ์ด๋ธ์ ์ฌ์ฉํ๋ ์ง๋ ํ์ต์ ๋์
ํ์ฌ ํด์ฑ ๊ธฐ๋ฐ ๊ฒ์ ์์คํ
์ ๊ตฌ์ถํ๋ค. ํด๋์ค ๊ฐ ์ ์ฌ์ฑ (๋ค๋ฅธ ์ฌ๋ ์ฌ์ด์ ์ ์ฌํ ์ธ๋ชจ) ๊ณผ ํด๋์ค ๋ด ๋ณํ(๊ฐ์ ์ฌ๋์ ๋ค๋ฅธ ํฌ์ฆ, ํ์ , ์กฐ๋ช
) ์ ๊ฐ์ ์ผ๊ตด ์ด๋ฏธ์ง ๊ตฌ๋ณ์ ์ด๋ ค์์ ํด๊ฒฐํ๊ธฐ ์ํด ๊ฐ ์ด๋ฏธ์ง์ ํด๋์ค ๋ ์ด๋ธ์ ์ฌ์ฉํ๋ค. ์ผ๊ตด ์ด๋ฏธ์ง ๊ฒ์ ํ์ง์ ๋์ฑ ํฅ์์ํค๊ธฐ ์ํด SGH (Similarity Guided Hashing) ๋ฐฉ์์ ์ ์ํ๋ฉฐ, ์ฌ๊ธฐ์ ๋ค์ค ๋ฐ์ดํฐ ์ฆ๊ฐ ๊ฒฐ๊ณผ๋ฅผ ์ฌ์ฉํ ์๊ธฐ ์ ์ฌ์ฑ ํ์ต์ด ํ๋ จ ์ค์ ์ฌ์ฉ๋๋ค. ๊ทธ๋ฆฌ๊ณ ํด์ฑ ๊ธฐ๋ฐ์ ์ผ๋ฐ ์ด๋ฏธ์ง ๊ฒ์ ์์คํ
์ ๊ตฌ์ฑํ๊ธฐ ์ํด DHD(Deep Hash Distillation) ๋ฐฉ์์ ์ ์ํ๋ค. DHD์์๋ ์ง๋ ์ ํธ๋ฅผ ํ์ฉํ๊ธฐ ์ํด ํด๋์ค๋ณ ๋ํ์ฑ์ ๋ํ๋ด๋ ํ๋ จ ๊ฐ๋ฅํ ํด์ ํ๋ก์ (proxy) ๋ฅผ ๋์
ํ๋ค. ๋ํ, ํด์ฑ์ ์ ํฉํ ์์ฒด ์ฆ๋ฅ ๊ธฐ๋ฒ์ ์ ์ํ์ฌ ์ฆ๊ฐ ๋ฐ์ดํฐ์ ์ ์ฌ๋ ฅ์ ์ผ๋ฐ์ ์ธ ์ด๋ฏธ์ง ๊ฒ์ ์ฑ๋ฅ ํฅ์์ ์ ์ฉํ๋ค.
๋์งธ๋ก, ๋ ์ด๋ธ์ด ์ง์ ๋ ์ด๋ฏธ์ง ๋ฐ์ดํฐ์ ๋ ์ด๋ธ์ด ์ง์ ๋์ง ์์ ์ด๋ฏธ์ง ๋ฐ์ดํฐ๋ฅผ ๋ชจ๋ ํ์ฉํ๋ ์ค์ง๋ ํ์ต์ ์กฐ์ฌํ์ฌ ๊ณฑ ์์ํ ๊ธฐ๋ฐ ๊ฒ์ ์์คํ
์ ๊ตฌ์ถํ๋ค. ์ง๋ ํ์ต ๋ฅ ๋ฌ๋ ๊ธฐ๋ฐ์ ์ด๋ฏธ์ง ๊ฒ์ ๋ฐฉ๋ฒ๋ค์ ์ฐ์ํ ์ฑ๋ฅ์ ๋ณด์ด๋ ค๋ฉด ๊ฐ๋น์ผ ๋ ์ด๋ธ ์ ๋ณด๊ฐ ์ถฉ๋ถํด์ผ ํ๋ค๋ ๋จ์ ์ด ์๋ค. ๊ฒ๋ค๊ฐ, ๋ ์ด๋ธ์ด ์ง์ ๋์ง ์์ ์๋ง์ ์ด๋ฏธ์ง ๋ฐ์ดํฐ๋ ํ๋ จ์์ ์ ์ธ๋๋ค๋ ํ๊ณ๊ฐ ์๋ค. ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ฒกํฐ ์์ํ ๊ธฐ๋ฐ ๋ฐ์ง๋ ์์ ๊ฒ์ ๋ฐฉ์์ธ GPQ (Generalized Product Quantization) ๋คํธ์ํฌ๋ฅผ ์ ์ํ๋ค. ๋ ์ด๋ธ์ด ์ง์ ๋ ๋ฐ์ดํฐ ๊ฐ์ ์๋ฏธ๋ก ์ ์ ์ฌ์ฑ์ ์ ์งํ๋ ์๋ก์ด ๋ฉํธ๋ฆญ ํ์ต (Metric learning) ์ ๋ต๊ณผ ๋ ์ด๋ธ์ด ์ง์ ๋์ง ์์ ๋ฐ์ดํฐ์ ๊ณ ์ ํ ์ ์ฌ๋ ฅ์ ์ต๋ํ ํ์ฉํ๋ ์ํธ๋กํผ ์ ๊ทํ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์ฌ ๊ฒ์ ์์คํ
์ ๊ฐ์ ํ๋ค. ์ด ์๋ฃจ์
์ ์์ํ ๋คํธ์ํฌ์ ์ผ๋ฐํ ์ฉ๋์ ์ฆ๊ฐ์์ผ ์ด์ ์ ํ๊ณ๋ฅผ ๊ทน๋ณตํ ์ ์๊ฒํ๋ค.
๋ง์ง๋ง์ผ๋ก, ๋ฅ ๋ฌ๋ ๋ชจ๋ธ์ด ์ฌ๋์ ์ง๋ ์์ด ์๊ฐ์ ์ผ๋ก ์ ์ฌํ ์ด๋ฏธ์ง ๊ฒ์์ ์ํํ ์ ์๋๋ก ํ๊ธฐ ์ํด ๋น์ง๋ ํ์ต ์๊ณ ๋ฆฌ์ฆ์ ํ์ํ๋ค. ๋น๋ก ๋ ์ด๋ธ ์ฃผ์์ ํ์ฉํ ์ฌ์ธต ์ง๋ ๊ธฐ๋ฐ์ ๋ฐฉ๋ฒ๋ค์ด ๊ธฐ์กด ๋ฐฉ๋ฒ๋ค์ ๋๋น ์ฐ์ํ ๊ฒ์ ์ฑ๋ฅ์ ๋ณด์ผ์ง๋ผ๋, ๋ฐฉ๋ํ ์์ ํ๋ จ ๋ฐ์ดํฐ์ ๋ํด ์ ํํ๊ฒ ๋ ์ด๋ธ์ ์ง์ ํ๋ ๊ฒ์ ํ๋ค๊ณ ์ฃผ์์์ ์ค๋ฅ๊ฐ ๋ฐ์ํ๊ธฐ ์ฝ๋ค๋ ํ๊ณ๊ฐ ์๋ค. ์ด ๋ฌธ์ ๋ฅผ ํด๊ฒฐํ๊ธฐ ์ํด ๋ ์ด๋ธ ์์ด ์์ฒด ์ง๋ ๋ฐฉ์์ผ๋ก ํ๋ จํ๋ SPQ (Self-supervised Product Quantization) ๋คํธ์ํฌ ๋ผ๋ ์ฌ์ธต ๋น์ง๋ ์ด๋ฏธ์ง ๊ฒ์ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ์๋กญ๊ฒ ์ค๊ณ๋ ๊ต์ฐจ ์์ํ ๋์กฐ ํ์ต ๋ฐฉ์์ผ๋ก ์๋ก ๋ค๋ฅด๊ฒ ๋ณํ๋ ์ด๋ฏธ์ง๋ฅผ ๋น๊ตํ์ฌ ๊ณฑ ์์ํ์ ์ฝ๋์๋์ ์ฌ์ธต ์๊ฐ์ ํํ์ ๋์์ ํ์ตํ๋ค. ์ด ๋ฐฉ์์ ํตํด ์ด๋ฏธ์ง์ ๋ด์ ๋ ๋ด์ฉ์ ๋ณ๋์ ์ฌ๋ ์ง๋ ์์ด ๋คํธ์ํฌ๊ฐ ์ค์ค๋ก ์ดํดํ๊ฒ ๋๊ณ , ์๊ฐ์ ์ผ๋ก ์ ํํ ๊ฒ์์ ์ํํ ์ ์๋ ์ค๋ช
๊ธฐ๋ฅ์ ์ถ์ถํ ์ ์๊ฒ ๋๋ค.
๋ฒค์น๋งํฌ ๋ฐ์ดํฐ ์ธํธ์ ๋ํ ๊ด๋ฒ์ํ ์ด๋ฏธ์ง ๊ฒ์ ์คํ์ ์ํํ์ฌ ์ ์๋ ๋ฐฉ๋ฒ์ด ๋ค์ํ ํ๊ฐ ํ๋กํ ์ฝ์์ ๋ฐ์ด๋ ๊ฒฐ๊ณผ๋ฅผ ์ฐ์ถํจ์ ํ์ธํ๋ค. ์ง๋ ํ์ต ๊ธฐ๋ฐ์ ์ผ๊ตด ์์ ๊ฒ์์ ๊ฒฝ์ฐ SGH๋ ์ ํด์๋ ๋ฐ ๊ณ ํด์๋ ์ผ๊ตด ์์ ๋ชจ๋์์ ์ต๊ณ ์ ๊ฒ์ ์ฑ๋ฅ์ ๋ฌ์ฑํ์๊ณ , DHD๋ ์ต๊ณ ์ ๊ฒ์ ์ ํ๋๋ก ์ผ๋ฐ ์์ ๊ฒ์ ์คํ์์ ํจ์จ์ฑ์ ์
์ฆํ๋ค. ์ค์ง๋ ์ผ๋ฐ ์ด๋ฏธ์ง ๊ฒ์์ ๊ฒฝ์ฐ GPQ๋ ๋ ์ด๋ธ์ด ์๋ ์ด๋ฏธ์ง ๋ฐ์ดํฐ์ ๋ ์ด๋ธ์ด ์๋ ์ด๋ฏธ์ง ๋ฐ์ดํฐ๋ฅผ ๋ชจ๋ ์ฌ์ฉํ๋ ํ๋กํ ์ฝ์ ๋ํ ์ต์์ ๊ฒ์ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ฌ์ค๋ค. ๋ง์ง๋ง์ผ๋ก, ๋น์ง๋ ํ์ต ์ด๋ฏธ์ง ๊ฒ์์ ๊ฒฝ์ฐ ์ง๋ ๋ฐฉ์์ผ๋ก ๋ฏธ๋ฆฌ ํ์ต๋ ์ด๊ธฐ ๊ฐ ์์ด๋ SPQ๋ฅผ ์ฌ์ฉํ์ฌ ์ต์์ ๊ฒ์ ์ ์๋ฅผ ์ป์์ผ๋ฉฐ ์๊ฐ์ ์ผ๋ก ์ ์ฌํ ์ด๋ฏธ์ง๊ฐ ๊ฒ์ ๊ฒฐ๊ณผ๋ก ์ฑ๊ณต์ ์ผ๋ก ๊ฒ์๋๋ ๊ฒ์ ๊ด์ฐฐํ ์ ์๋ค.Content-based image retrieval, which finds relevant images to a query from a huge database, is one of the fundamental tasks in the field of computer vision. Especially for conducting fast and accurate retrieval, Approximate Nearest Neighbor (ANN) search approaches represented by Hashing and Product Quantization (PQ) have been proposed to image retrieval community. Ever since neural network based deep learning has shown excellent performance in many computer vision tasks, both Hashing and product quantization-based image retrieval systems are also adopting deep learning for improvement. In this dissertation, image retrieval methods under various deep learning conditions are investigated to suggest the appropriate retrieval systems. Specifically, by considering the purpose of image retrieval, the supervised learning methods are proposed to develop the deep Hashing systems that retrieve semantically similar images, and the semi-supervised, unsupervised learning methods are proposed to establish the deep product quantization systems that retrieve both semantically and visually similar images. Moreover, by considering the characteristics of image retrieval database, the face image sets with numerous class categories, and the general image sets of one or more labeled images are separated to be explored when building a retrieval system.
First, supervised learning with the semantic labels given to images is introduced to build a Hashing-based retrieval system. To address the difficulties of distinguishing face images, such as the inter-class similarities (similar appearance between different persons) and the intra-class variations (same person with different pose, facial expressions, illuminations), the identity label of each image is employed to derive the discriminative binary codes. To further develop the face image retrieval quality, Similarity Guided Hashing (SGH) scheme is proposed, where the self-similarity learning with multiple data augmentation results are employed during training. In terms of Hashing-based general image retrieval systems, Deep Hash Distillation (DHD) scheme is proposed, where the trainable hash proxy that presents class-wise representative is introduced to take advantage of supervised signals. Moreover, self-distillation scheme adapted for Hashing is utilized to improve general image retrieval performance by exploiting the potential of augmented data appropriately.
Second, semi-supervised learning that utilizes both labeled and unlabeled image data is investigated to build a PQ-based retrieval system. Even if the supervised deep methods show excellent performance, they do not meet the expectations unless expensive label information is sufficient. Besides, there is a limitation that a tons of unlabeled image data is excluded from training. To resolve this issue, the vector quantization-based semi-supervised image retrieval scheme: Generalized Product Quantization (GPQ) network is proposed. A novel metric learning strategy that preserves semantic similarity between labeled data, and a entropy regularization term that fully exploits inherent potentials of unlabeled data are employed to improve the retrieval system. This solution increases the generalization capacity of the quantization network, which allows to overcome previous limitations.
Lastly, to enable the network to perform a visually similar image retrieval on its own without any human supervision, unsupervised learning algorithm is explored. Although, deep supervised Hashing and PQ methods achieve the outstanding retrieval performances compared to the conventional methods by fully exploiting the label annotations, however, it is painstaking to assign labels precisely for a vast amount of training data, and also, the annotation process is error-prone. To tackle these issues, the deep unsupervised image retrieval method dubbed Self-supervised Product Quantization (SPQ) network, which is label-free and trained in a self-supervised manner is proposed. A newly designed Cross Quantized Contrastive learning strategy is applied to jointly learn the PQ codewords and the deep visual representations by comparing individually transformed images (views). This allows to understand the image content and extract descriptive features so that the visually accurate retrieval can be performed.
By conducting extensive image retrieval experiments on the benchmark datasets, the proposed methods are confirmed to yield the outstanding results under various evaluation protocols. For supervised face image retrieval, SGH achieves the best retrieval performance for both low and high resolution face image, and DHD also demonstrates its efficiency in general image retrieval experiments with the state-of-the-art retrieval performance. For semi-supervised general image retrieval, GPQ shows the best search results for protocols that use both labeled and unlabeled image data. Finally, for unsupervised general image retrieval, the best retrieval scores are achieved with SPQ even without supervised pre-training, and it can be observed that visually similar images are successfully retrieved as search results.Abstract i
Contents iv
List of Tables vii
List of Figures viii
1 Introduction 1
1.1 Contribution 3
1.2 Contents 4
2 Supervised Learning for Deep Hashing: Similarity Guided Hashing for Face Image Retrieval / Deep Hash Distillation for General Image Retrieval 5
2.1 Motivation and Overview for Face Image Retrieval 5
2.1.1 Related Works 9
2.2 Similarity Guided Hashing 10
2.3 Experiments 16
2.3.1 Datasets and Setup 16
2.3.2 Results on Small Face Images 18
2.3.3 Results on Large Face Images 19
2.4 Motivation and Overview for General Image Retrieval 20
2.5 Related Works 22
2.6 Deep Hash Distillation 24
2.6.1 Self-distilled Hashing 24
2.6.2 Teacher loss 27
2.6.3 Training 29
2.6.4 Hamming Distance Analysis 29
2.7 Experiments 32
2.7.1 Setup 32
2.7.2 Implementation Details 32
2.7.3 Results 34
2.7.4 Analysis 37
3 Semi-supervised Learning for Product Quantization: Generalized Product Quantization Network for Semi-supervised Image Retrieval 42
3.1 Motivation and Overview 42
3.1.1 Related Work 45
3.2 Generalized Product Quantization 47
3.2.1 Semi-Supervised Learning 48
3.2.2 Retrieval 52
3.3 Experiments 53
3.3.1 Setup 53
3.3.2 Results and Analysis 55
4 Unsupervised Learning for Product Quantization: Self-supervised Product Quantization for Deep Unsupervised Image Retrieval 58
4.1 Motivation and Overview 58
4.1.1 Related Works 61
4.2 Self-supervised Product Quantization 62
4.2.1 Overall Framework 62
4.2.2 Self-supervised Training 64
4.3 Experiments 67
4.3.1 Datasets 67
4.3.2 Experimental Settings 68
4.3.3 Results 71
4.3.4 Empirical Analysis 71
5 Conclusion 75
Abstract (In Korean) 88๋ฐ
A Comprehensive Survey on Cross-modal Retrieval
In recent years, cross-modal retrieval has drawn much attention due to the
rapid growth of multimodal data. It takes one type of data as the query to
retrieve relevant data of another type. For example, a user can use a text to
retrieve relevant pictures or videos. Since the query and its retrieved results
can be of different modalities, how to measure the content similarity between
different modalities of data remains a challenge. Various methods have been
proposed to deal with such a problem. In this paper, we first review a number
of representative methods for cross-modal retrieval and classify them into two
main groups: 1) real-valued representation learning, and 2) binary
representation learning. Real-valued representation learning methods aim to
learn real-valued common representations for different modalities of data. To
speed up the cross-modal retrieval, a number of binary representation learning
methods are proposed to map different modalities of data into a common Hamming
space. Then, we introduce several multimodal datasets in the community, and
show the experimental results on two commonly used multimodal datasets. The
comparison reveals the characteristic of different kinds of cross-modal
retrieval methods, which is expected to benefit both practical applications and
future research. Finally, we discuss open problems and future research
directions.Comment: 20 pages, 11 figures, 9 table
Learning Structured Ordinal Measures for Video based Face Recognition
This paper presents a structured ordinal measure method for video-based face
recognition that simultaneously learns ordinal filters and structured ordinal
features. The problem is posed as a non-convex integer program problem that
includes two parts. The first part learns stable ordinal filters to project
video data into a large-margin ordinal space. The second seeks self-correcting
and discrete codes by balancing the projected data and a rank-one ordinal
matrix in a structured low-rank way. Unsupervised and supervised structures are
considered for the ordinal matrix. In addition, as a complement to hierarchical
structures, deep feature representations are integrated into our method to
enhance coding stability. An alternating minimization method is employed to
handle the discrete and low-rank constraints, yielding high-quality codes that
capture prior structures well. Experimental results on three commonly used face
video databases show that our method with a simple voting classifier can
achieve state-of-the-art recognition rates using fewer features and samples
A Decade Survey of Content Based Image Retrieval using Deep Learning
The content based image retrieval aims to find the similar images from a
large scale dataset against a query image. Generally, the similarity between
the representative features of the query image and dataset images is used to
rank the images for retrieval. In early days, various hand designed feature
descriptors have been investigated based on the visual cues such as color,
texture, shape, etc. that represent the images. However, the deep learning has
emerged as a dominating alternative of hand-designed feature engineering from a
decade. It learns the features automatically from the data. This paper presents
a comprehensive survey of deep learning based developments in the past decade
for content based image retrieval. The categorization of existing
state-of-the-art methods from different perspectives is also performed for
greater understanding of the progress. The taxonomy used in this survey covers
different supervision, different networks, different descriptor type and
different retrieval type. A performance analysis is also performed using the
state-of-the-art methods. The insights are also presented for the benefit of
the researchers to observe the progress and to make the best choices. The
survey presented in this paper will help in further research progress in image
retrieval using deep learning
- โฆ