The agro-food industry is turning to robots to address the challenge of
labour shortage. However, agro-food environments pose difficulties for robots
due to high variation and occlusions. In the presence of these challenges,
accurate world models, with information about object location, shape, and
properties, are crucial for robots to perform tasks accurately. Building such
models is challenging due to the complex and unique nature of agro-food
environments, and errors in the model can lead to task execution issues. In
this paper, we propose MinkSORT, a novel method for generating tracking
features using a 3D sparse convolutional network in a deepSORT-like approach to
improve the accuracy of world models in agro-food environments. We evaluated
our feature extractor network using real-world data collected in a tomato
greenhouse, which significantly improved the performance of our baseline model
that tracks tomato positions in 3D using a Kalman filter and Mahalanobis
distance. Our deep learning feature extractor improved the HOTA from 42.8% to
44.77%, the association accuracy from 32.55% to 35.55%, and the MOTA from
57.63% to 58.81%. We also evaluated different contrastive loss functions for
training our deep learning feature extractor and demonstrated that our approach
leads to improved performance in terms of three separate precision and recall
detection outcomes. Our method improves world model accuracy, enabling robots
to perform tasks such as harvesting and plant maintenance with greater
efficiency and accuracy, which is essential for meeting the growing demand for
food in a sustainable manner