In multimodal-aware recommendation, the extraction of meaningful multimodal
features is at the basis of high-quality recommendations. Generally, each
recommendation framework implements its multimodal extraction procedures with
specific strategies and tools. This is limiting for two reasons: (i) different
extraction strategies do not ease the interdependence among multimodal
recommendation frameworks; thus, they cannot be efficiently and fairly
compared; (ii) given the large plethora of pre-trained deep learning models
made available by different open source tools, model designers do not have
access to shared interfaces to extract features. Motivated by the outlined
aspects, we propose Ducho, a unified framework for the extraction of multimodal
features in recommendation. By integrating three widely-adopted deep learning
libraries as backends, namely, TensorFlow, PyTorch, and Transformers, we
provide a shared interface to extract and process features where each backend's
specific methods are abstracted to the end user. Noteworthy, the extraction
pipeline is easily configurable with a YAML-based file where the user can
specify, for each modality, the list of models (and their specific
backends/parameters) to perform the extraction. Finally, to make Ducho
accessible to the community, we build a public Docker image equipped with a
ready-to-use CUDA environment and propose three demos to test its
functionalities for different scenarios and tasks. The GitHub repository and
the documentation is accessible at this link:
https://github.com/sisinflab/Ducho