Dense object counting or crowd counting has come a long way thanks to the
recent development in the vision community. However, indiscernible object
counting, which aims to count the number of targets that are blended with
respect to their surroundings, has been a challenge. Image-based object
counting datasets have been the mainstream of the current publicly available
datasets. Therefore, we propose a large-scale dataset called YoutubeFish-35,
which contains a total of 35 sequences of high-definition videos with high
frame-per-second and more than 150,000 annotated center points across a
selected variety of scenes. For benchmarking purposes, we select three
mainstream methods for dense object counting and carefully evaluate them on the
newly collected dataset. We propose TransVidCount, a new strong baseline that
combines density and regression branches along the temporal domain in a unified
framework and can effectively tackle indiscernible object counting with
state-of-the-art performance on YoutubeFish-35 dataset.Comment: Accepted by ICASSP 2024 (IEEE International Conference on Acoustics,
Speech, and Signal Processing