In the current golden age of multimedia, human visualization is no longer the
single main target, with the final consumer often being a machine which
performs some processing or computer vision tasks. In both cases, deep learning
plays a undamental role in extracting features from the multimedia
representation data, usually producing a compressed representation referred to
as latent representation. The increasing development and adoption of deep
learning-based solutions in a wide area of multimedia applications have opened
an exciting new vision where a common compressed multimedia representation is
used for both man and machine. The main benefits of this vision are two-fold:
i) improved performance for the computer vision tasks, since the effects of
coding artifacts are mitigated; and ii) reduced computational complexity, since
prior decoding is not required. This paper proposes the first taxonomy for
designing compressed domain computer vision solutions driven by the
architecture and weights compatibility with an available spatio-temporal
computer vision processor. The potential of the proposed taxonomy is
demonstrated for the specific case of point cloud classification by designing
novel compressed domain processors using the JPEG Pleno Point Cloud Coding
standard under development and adaptations of the PointGrid classifier.
Experimental results show that the designed compressed domain point cloud
classification solutions can significantly outperform the spatial-temporal
domain classification benchmarks when applied to the decompressed data,
containing coding artifacts, and even surpass their performance when applied to
the original uncompressed data