In apparel recognition, specialized models (e.g. models trained for a
particular vertical like dresses) can significantly outperform general models
(i.e. models that cover a wide range of verticals). Therefore, deep neural
network models are often trained separately for different verticals. However,
using specialized models for different verticals is not scalable and expensive
to deploy. This paper addresses the problem of learning one unified embedding
model for multiple object verticals (e.g. all apparel classes) without
sacrificing accuracy. The problem is tackled from two aspects: training data
and training difficulty. On the training data aspect, we figure out that for a
single model trained with triplet loss, there is an accuracy sweet spot in
terms of how many verticals are trained together. To ease the training
difficulty, a novel learning scheme is proposed by using the output from
specialized models as learning targets so that L2 loss can be used instead of
triplet loss. This new loss makes the training easier and make it possible for
more efficient use of the feature space. The end result is a unified model
which can achieve the same retrieval accuracy as a number of separate
specialized models, while having the model complexity as one. The effectiveness
of our approach is shown in experiments.Comment: 8 page