37,668 research outputs found
Easy Transfer Learning By Exploiting Intra-domain Structures
Transfer learning aims at transferring knowledge from a well-labeled domain
to a similar but different domain with limited or no labels. Unfortunately,
existing learning-based methods often involve intensive model selection and
hyperparameter tuning to obtain good results. Moreover, cross-validation is not
possible for tuning hyperparameters since there are often no labels in the
target domain. This would restrict wide applicability of transfer learning
especially in computationally-constraint devices such as wearables. In this
paper, we propose a practically Easy Transfer Learning (EasyTL) approach which
requires no model selection and hyperparameter tuning, while achieving
competitive performance. By exploiting intra-domain structures, EasyTL is able
to learn both non-parametric transfer features and classifiers. Extensive
experiments demonstrate that, compared to state-of-the-art traditional and deep
methods, EasyTL satisfies the Occam's Razor principle: it is extremely easy to
implement and use while achieving comparable or better performance in
classification accuracy and much better computational efficiency. Additionally,
it is shown that EasyTL can increase the performance of existing transfer
feature learning methods.Comment: Camera-ready version of IEEE International Conference on Multimedia
and Expo (ICME) 2019; code available at
http://transferlearning.xyz/code/traditional/EasyT
Snap and Find: Deep Discrete Cross-domain Garment Image Retrieval
With the increasing number of online stores, there is a pressing need for
intelligent search systems to understand the item photos snapped by customers
and search against large-scale product databases to find their desired items.
However, it is challenging for conventional retrieval systems to match up the
item photos captured by customers and the ones officially released by stores,
especially for garment images. To bridge the customer- and store- provided
garment photos, existing studies have been widely exploiting the clothing
attributes (\textit{e.g.,} black) and landmarks (\textit{e.g.,} collar) to
learn a common embedding space for garment representations. Unfortunately they
omit the sequential correlation of attributes and consume large quantity of
human labors to label the landmarks. In this paper, we propose a deep
multi-task cross-domain hashing termed \textit{DMCH}, in which cross-domain
embedding and sequential attribute learning are modeled simultaneously.
Sequential attribute learning not only provides the semantic guidance for
embedding, but also generates rich attention on discriminative local details
(\textit{e.g.,} black buttons) of clothing items without requiring extra
landmark labels. This leads to promising performance and 306 boost on
efficiency when compared with the state-of-the-art models, which is
demonstrated through rigorous experiments on two public fashion datasets
Fine-grained Apparel Classification and Retrieval without rich annotations
The ability to correctly classify and retrieve apparel images has a variety
of applications important to e-commerce, online advertising and internet
search. In this work, we propose a robust framework for fine-grained apparel
classification, in-shop and cross-domain retrieval which eliminates the
requirement of rich annotations like bounding boxes and human-joints or
clothing landmarks, and training of bounding box/ key-landmark detector for the
same. Factors such as subtle appearance differences, variations in human poses,
different shooting angles, apparel deformations, and self-occlusion add to the
challenges in classification and retrieval of apparel items. Cross-domain
retrieval is even harder due to the presence of large variation between online
shopping images, usually taken in ideal lighting, pose, positive angle and
clean background as compared with street photos captured by users in
complicated conditions with poor lighting and cluttered scenes. Our framework
uses compact bilinear CNN with tensor sketch algorithm to generate embeddings
that capture local pairwise feature interactions in a translationally invariant
manner. For apparel classification, we pass the feature embeddings through a
softmax classifier, while, the in-shop and cross-domain retrieval pipelines use
a triplet-loss based optimization approach, such that squared Euclidean
distance between embeddings measures the dissimilarity between the images.
Unlike previous works that relied on bounding box, key clothing landmarks or
human joint detectors to assist the final deep classifier, proposed framework
can be trained directly on the provided category labels or generated triplets
for triplet loss optimization. Lastly, Experimental results on the DeepFashion
fine-grained categorization, and in-shop and consumer-to-shop retrieval
datasets provide a comparative analysis with previous work performed in the
domain.Comment: 14 pages, 6 figures, 3 tables, Submitted to Springer Journal of
Applied Intelligenc
Modeling Text with Graph Convolutional Network for Cross-Modal Information Retrieval
Cross-modal information retrieval aims to find heterogeneous data of various
modalities from a given query of one modality. The main challenge is to map
different modalities into a common semantic space, in which distance between
concepts in different modalities can be well modeled. For cross-modal
information retrieval between images and texts, existing work mostly uses
off-the-shelf Convolutional Neural Network (CNN) for image feature extraction.
For texts, word-level features such as bag-of-words or word2vec are employed to
build deep learning models to represent texts. Besides word-level semantics,
the semantic relations between words are also informative but less explored. In
this paper, we model texts by graphs using similarity measure based on
word2vec. A dual-path neural network model is proposed for couple feature
learning in cross-modal information retrieval. One path utilizes Graph
Convolutional Network (GCN) for text modeling based on graph representations.
The other path uses a neural network with layers of nonlinearities for image
modeling based on off-the-shelf features. The model is trained by a pairwise
similarity loss function to maximize the similarity of relevant text-image
pairs and minimize the similarity of irrelevant pairs. Experimental results
show that the proposed model outperforms the state-of-the-art methods
significantly, with 17% improvement on accuracy for the best case.Comment: 7 pages, 11 figure
Cross Domain Knowledge Learning with Dual-branch Adversarial Network for Vehicle Re-identification
The widespread popularization of vehicles has facilitated all people's life
during the last decades. However, the emergence of a large number of vehicles
poses the critical but challenging problem of vehicle re-identification (reID).
Till now, for most vehicle reID algorithms, both the training and testing
processes are conducted on the same annotated datasets under supervision.
However, even a well-trained model will still cause fateful performance drop
due to the severe domain bias between the trained dataset and the real-world
scenes.
To address this problem, this paper proposes a domain adaptation framework
for vehicle reID (DAVR), which narrows the cross-domain bias by fully
exploiting the labeled data from the source domain to adapt the target domain.
DAVR develops an image-to-image translation network named Dual-branch
Adversarial Network (DAN), which could promote the images from the source
domain (well-labeled) to learn the style of target domain (unlabeled) without
any annotation and preserve identity information from source domain. Then the
generated images are employed to train the vehicle reID model by a proposed
attention-based feature learning model with more reasonable styles. Through the
proposed framework, the well-trained reID model has better domain adaptation
ability for various scenes in real-world situations. Comprehensive experimental
results have demonstrated that our proposed DAVR can achieve excellent
performances on both VehicleID dataset and VeRi-776 dataset.Comment: arXiv admin note: substantial text overlap with arXiv:1903.0786
Cross-media Similarity Metric Learning with Unified Deep Networks
As a highlighting research topic in the multimedia area, cross-media
retrieval aims to capture the complex correlations among multiple media types.
Learning better shared representation and distance metric for multimedia data
is important to boost the cross-media retrieval. Motivated by the strong
ability of deep neural network in feature representation and comparison
functions learning, we propose the Unified Network for Cross-media Similarity
Metric (UNCSM) to associate cross-media shared representation learning with
distance metric in a unified framework. First, we design a two-pathway deep
network pretrained with contrastive loss, and employ double triplet similarity
loss for fine-tuning to learn the shared representation for each media type by
modeling the relative semantic similarity. Second, the metric network is
designed for effectively calculating the cross-media similarity of the shared
representation, by modeling the pairwise similar and dissimilar constraints.
Compared to the existing methods which mostly ignore the dissimilar constraints
and only use sample distance metric as Euclidean distance separately, our UNCSM
approach unifies the representation learning and distance metric to preserve
the relative similarity as well as embrace more complex similarity functions
for further improving the cross-media retrieval accuracy. The experimental
results show that our UNCSM approach outperforms 8 state-of-the-art methods on
4 widely-used cross-media datasets.Comment: 19 pages, submitted to Multimedia Tools and Application
PM-GANs: Discriminative Representation Learning for Action Recognition Using Partial-modalities
Data of different modalities generally convey complimentary but heterogeneous
information, and a more discriminative representation is often preferred by
combining multiple data modalities like the RGB and infrared features. However
in reality, obtaining both data channels is challenging due to many
limitations. For example, the RGB surveillance cameras are often restricted
from private spaces, which is in conflict with the need of abnormal activity
detection for personal security. As a result, using partial data channels to
build a full representation of multi-modalities is clearly desired. In this
paper, we propose a novel Partial-modal Generative Adversarial Networks
(PM-GANs) that learns a full-modal representation using data from only partial
modalities. The full representation is achieved by a generated representation
in place of the missing data channel. Extensive experiments are conducted to
verify the performance of our proposed method on action recognition, compared
with four state-of-the-art methods. Meanwhile, a new Infrared-Visible Dataset
for action recognition is introduced, and will be the first publicly available
action dataset that contains paired infrared and visible spectrum
Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning
For real-world speech recognition applications, noise robustness is still a
challenge. In this work, we adopt the teacher-student (T/S) learning technique
using a parallel clean and noisy corpus for improving automatic speech
recognition (ASR) performance under multimedia noise. On top of that, we apply
a logits selection method which only preserves the k highest values to prevent
wrong emphasis of knowledge from the teacher and to reduce bandwidth needed for
transferring data. We incorporate up to 8000 hours of untranscribed data for
training and present our results on sequence trained models apart from cross
entropy trained ones. The best sequence trained student model yields relative
word error rate (WER) reductions of approximately 10.1%, 28.7% and 19.6% on our
clean, simulated noisy and real test sets respectively comparing to a sequence
trained teacher.Comment: To Appear in ICASSP 201
Current Challenges and Visions in Music Recommender Systems Research
Music recommender systems (MRS) have experienced a boom in recent years,
thanks to the emergence and success of online streaming services, which
nowadays make available almost all music in the world at the user's fingertip.
While today's MRS considerably help users to find interesting music in these
huge catalogs, MRS research is still facing substantial challenges. In
particular when it comes to build, incorporate, and evaluate recommendation
strategies that integrate information beyond simple user--item interactions or
content-based descriptors, but dig deep into the very essence of listener
needs, preferences, and intentions, MRS research becomes a big endeavor and
related publications quite sparse.
The purpose of this trends and survey article is twofold. We first identify
and shed light on what we believe are the most pressing challenges MRS research
is facing, from both academic and industry perspectives. We review the state of
the art towards solving these challenges and discuss its limitations. Second,
we detail possible future directions and visions we contemplate for the further
evolution of the field. The article should therefore serve two purposes: giving
the interested reader an overview of current challenges in MRS research and
providing guidance for young researchers by identifying interesting, yet
under-researched, directions in the field
Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation
Supervised deep learning with pixel-wise training labels has great successes
on multi-person part segmentation. However, data labeling at pixel-level is
very expensive. To solve the problem, people have been exploring to use
synthetic data to avoid the data labeling. Although it is easy to generate
labels for synthetic data, the results are much worse compared to those using
real data and manual labeling. The degradation of the performance is mainly due
to the domain gap, i.e., the discrepancy of the pixel value statistics between
real and synthetic data. In this paper, we observe that real and synthetic
humans both have a skeleton (pose) representation. We found that the skeletons
can effectively bridge the synthetic and real domains during the training. Our
proposed approach takes advantage of the rich and realistic variations of the
real data and the easily obtainable labels of the synthetic data to learn
multi-person part segmentation on real images without any human-annotated
labels. Through experiments, we show that without any human labeling, our
method performs comparably to several state-of-the-art approaches which require
human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other
hand, if part labels are also available in the real-images during training, our
method outperforms the supervised state-of-the-art methods by a large margin.
We further demonstrate the generalizability of our method on predicting novel
keypoints in real images where no real data labels are available for the novel
keypoints detection. Code and pre-trained models are available at
https://github.com/kevinlin311tw/CDCL-human-part-segmentationComment: To appear in IEEE Transactions on Circuits and Systems for Video
Technology; Presented at ICCV 2019 Demonstratio
- …