22 research outputs found
Scene-specific crowd counting using synthetic training images
Crowd counting is a computer vision task on which considerable progress has recently been made thanks to convolutional neural networks. However, it remains a challenging task even in scene-specific settings, in real-world application scenarios where no representative images of the target scene are available, not even unlabelled, for training or fine-tuning a crowd counting model. Inspired by previous work in other computer vision tasks, we propose a simple but effective solution for the above application scenario, which consists of automatically building a scene-specific training set of synthetic images. Our solution does not require from end-users any manual annotation effort nor the collection of representative images of the target scene. Extensive experiments on several benchmark data sets show that the proposed solution can improve the effectiveness of existing crowd counting methods
Human-in-the-loop cross-domain person re-identification
Person re-identification is a challenging cross-camera matching problem, which is inherently subject to domain shift. To mitigate it, many solutions have been proposed so far, based on four kinds of approaches: supervised and unsupervised domain adaptation, direct transfer, and domain generalisation; in particular, the first two approaches require target data during system design, respectively labelled and unlabelled. In this work, we consider a very different approach, known as human-in-the-loopHITL), which consists of exploiting user’s feedback on target data processed during system operation to improve re-identification accuracy. Although it seems particularly suited to this application, given the inherent interaction with a human operator, HITL methods have been proposed for person re-identification by only a few works so far, and with a different purpose than addressing domain shift. However, we argue that HITL deserves further consideration in person re-identification, also as a potential alternative solution against domain shift. To substantiate our view, we consider simple HITL implementations which do not require model re-training or fine-tuning: they are based on well-known relevance feedback algorithms for content-based image retrieval, and of novel versions of them we devise specifically for person re-identification. We then conduct an extensive, cross-data set experimental evaluation of our HITL implementations on benchmark data sets, and compare them with a large set of existing methods against domain shift, belonging to the four categories mentioned above. Our results provide evidence that HITL can be as effective as, or even outperform, existing ad hoc solutions against domain shift for person re-identification, even under the simple implementations we consider. We believe that these results can foster further research on HITL in the person re-identification field, where, in our opinion, its potential has not been thoroughly explored so far
On the Effectiveness of Synthetic Data Sets for Training Person Re-identification Models
Person re-identification is a prominent topic in computer vision due to its security-related applications, and to the fact that issues such as variations in illumination, background, pedestrian pose and clothing appearance make it a very challenging task in real-world scenarios. State-of-the-art supervised methods require a huge manual annotation effort for training data and exhibit limited generalisation capability to unknown target domains. Synthetic data sets have recently been proposed as one possible solution to mitigate these problems, aimed at improving generalisation capability by encompassing a larger amount of variations in the above mentioned visual factors, with no need for manual annotation. However, existing synthetic data sets differ in many aspects, including the number of images, identities and cameras, and in their degree of photorealism, and there is not yet a clear understanding of how all such factors affect person re-identification performance. This work makes a first step towards filling this gap through an in-depth empirical investigation, where we use existing synthetic data sets for model training and real benchmark ones for performance evaluation. Our results provide interesting insights towards developing effective synthetic data sets for person re-identification
Online domain adaptation for person Re-identification with a human in the loop
Supervised deep learning methods have recently achieved remarkable performance in person re-identification. Unsupervised domain adaptation (UDA) approaches have also been proposed for application scenarios where only unlabelled data are available from target camera views. We consider a more challenging scenario when even collecting a suitable amount of representative, unlabelled target data for offline training or fine-tuning is infeasible. In this context we revisit the human-in-the-loop (HITL) approach, which exploits online the operator's feedback on a small amount of target data. We argue that HITL is a kind of online domain adaptation specifically suited to person re-identification. We then reconsider relevance feedback methods for content-based image retrieval that are computationally much cheaper than state-of-the-art HITL methods for person reidentification, and devise a specific feedback protocol for them. Experimental results show that HITL can achieve comparable or better performance than UDA, and is therefore a valid alternative when the lack of unlabelled target data makes UDA infeasible
How Realistic Should Synthetic Images Be for Training Crowd Counting Models?
Using synthetic images has been proposed to avoid collecting and manually annotating a sufficiently large and representative training set for several computer vision tasks, including crowd counting. While existing methods for crowd counting are based on generating realistic images, we start investigating how crowd counting accuracy is affected by increasing the realism of synthetic training images. Preliminary experiments on state-of-the-art CNN-based methods, focused on image background and pedestrian appearance, show that realism in both of them is beneficial to a different extent, depending on the kind of model (regression- or detection-based) and on pedestrian size in the images
On the Evaluation of Video-Based Crowd Counting Models
Crowd counting is a challenging and relevant computer vision task. Most of the existing methods are image-based, i.e., they only exploit the spatial information of a single image to estimate the corresponding people count. Recently, video-based methods have been proposed to improve counting accuracy by also exploiting temporal information coming from the correlation between adjacent frames. In this work, we point out the need to properly evaluate the temporal information's specific contribution over the spatial one. This issue has not been discussed by existing work, and in some cases such evaluation has been carried out in a way that may lead to overestimating the contribution of the temporal information. To address this issue we propose a categorisation of existing video-based models, discuss how the contribution of the temporal information has been evaluated by existing work, and propose an evaluation approach aimed at providing a more complete evaluation for two different categories of video-based methods. We finally illustrate our approach, for a specific category, through experiments on several benchmark video data sets
Trustworthy AI in Video Surveillance: The IMMAGINA Project
The increasing adoption of machine learning and deep learning models in critical applications raises the issue of ensuring their trustworthiness, which can be addressed by quantifying the uncertainty of their predictions. However, the black-box nature of many such models allows only to quantify uncertainty through ad hoc superstructures, which require to develop and train a model in an uncertainty-aware fashion. However, for applications where previously trained models are already in operation, it would be interesting to develop uncertainty quantification approaches acting as lightweight “plug-ins” that can be applied on top of such models without modifying and re-training them. In this contribution we present a research activity of the Pattern Recognition and Applications Lab of the University of Cagliari related to a recently proposed post hoc uncertainty quantification method, we named dropout injection, which is a variant of the well-known Monte Carlo dropout, and does not require any re-training nor any further gradient descent-based optimization; this makes it a promising, lightweight solution for integrating uncertainty quantification on any already-trained neural network. We are investigating a theoretically grounded solution to make dropout injection as effective as Monte Carlo dropout through a suitable rescaling of its uncertainty measure; we are also evaluating its effectiveness in the computer vision tasks of crowd counting and density estimation for intelligent video surveillance, thanks to our participation in a project funded by the European Space Agency
BLUES: Before-reLU-EStimates Bayesian Inference for Crowd Counting
Ensuring the trustworthiness of artificial intelligence and machine learning systems is becoming a crucial requirement given their widespread applications, including crowd counting, which we focus on in this work. This is often addressed by integrating uncertainty measures into their predictions. Most Bayesian uncertainty quantification techniques use a Gaussian approximation of the output, whose variance is interpreted as the uncertainty measure. However, in the case of neural network models for crowd counting based on density estimation, where the ReLU activation function is used for the output units, such a prior may lead to an approximated distribution with a significant mass on negative values, although they cannot be produced by the ReLU activation. Interestingly, we found that this is related to “false positive” pedestrian localisation errors in the density map. We propose to address this issue by shifting the Bayesian Inference Before the reLU EStimates (BLUES). This modification allows us to estimate a probability distribution both on the people density and the people presence in each pixel. This allows us to compute a crowd segmentation map, which we exploit for filtering out false positive localisations. Results on several benchmark data sets provide evidence that our BLUES approach allows for improving the accuracy of the estimated density map and the quality of the corresponding uncertainty measure