People detection in single 2D images has improved greatly in recent years.
However, comparatively little of this progress has percolated into multi-camera
multi-people tracking algorithms, whose performance still degrades severely
when scenes become very crowded. In this work, we introduce a new architecture
that combines Convolutional Neural Nets and Conditional Random Fields to
explicitly model those ambiguities. One of its key ingredients are high-order
CRF terms that model potential occlusions and give our approach its robustness
even when many people are present. Our model is trained end-to-end and we show
that it outperforms several state-of-art algorithms on challenging scenes