Convolutional neural networks (CNNs) have achieved astonishing advances over
the past decade, defining state-of-the-art in several computer vision tasks.
CNNs are capable of learning robust representations of the data directly from
the RGB pixels. However, most image data are usually available in compressed
format, from which the JPEG is the most widely used due to transmission and
storage purposes demanding a preliminary decoding process that have a high
computational load and memory usage. For this reason, deep learning methods
capable of leaning directly from the compressed domain have been gaining
attention in recent years. These methods adapt typical CNNs to work on the
compressed domain, but the common architectural modifications lead to an
increase in computational complexity and the number of parameters. In this
paper, we investigate the usage of CNNs that are designed to work directly with
the DCT coefficients available in JPEG compressed images, proposing a
handcrafted and data-driven techniques for reducing the computational
complexity and the number of parameters for these models in order to keep their
computational cost similar to their RGB baselines. We make initial ablation
studies on a subset of ImageNet in order to analyse the impact of different
frequency ranges, image resolution, JPEG quality and classification task
difficulty on the performance of the models. Then, we evaluate the models on
the complete ImageNet dataset. Our results indicate that DCT models are capable
of obtaining good performance, and that it is possible to reduce the
computational complexity and the number of parameters from these models while
retaining a similar classification accuracy through the use of our proposed
techniques.Comment: arXiv admin note: substantial text overlap with arXiv:2012.1372