In recent years, convolutional neural networks (CNNs) have achieved
impressive performance for various visual recognition scenarios. CNNs trained
on large labeled datasets can not only obtain significant performance on most
challenging benchmarks but also provide powerful representations, which can be
used to a wide range of other tasks. However, the requirement of massive
amounts of data to train deep neural networks is a major drawback of these
models, as the data available is usually limited or imbalanced. Fine-tuning
(FT) is an effective way to transfer knowledge learned in a source dataset to a
target task. In this paper, we introduce and systematically investigate several
factors that influence the performance of fine-tuning for visual recognition.
These factors include parameters for the retraining procedure (e.g., the
initial learning rate of fine-tuning), the distribution of the source and
target data (e.g., the number of categories in the source dataset, the distance
between the source and target datasets) and so on. We quantitatively and
qualitatively analyze these factors, evaluate their influence, and present many
empirical observations. The results reveal insights into what fine-tuning
changes CNN parameters and provide useful and evidence-backed intuitions about
how to implement fine-tuning for computer vision tasks.Comment: Accepted by ACM Transactions on Data Scienc