3,865 research outputs found

    Unsupervised Adaptive Re-identification in Open World Dynamic Camera Networks

    Full text link
    Person re-identification is an open and challenging problem in computer vision. Existing approaches have concentrated on either designing the best feature representation or learning optimal matching metrics in a static setting where the number of cameras are fixed in a network. Most approaches have neglected the dynamic and open world nature of the re-identification problem, where a new camera may be temporarily inserted into an existing system to get additional information. To address such a novel and very practical problem, we propose an unsupervised adaptation scheme for re-identification models in a dynamic camera network. First, we formulate a domain perceptive re-identification method based on geodesic flow kernel that can effectively find the best source camera (already installed) to adapt with a newly introduced target camera, without requiring a very expensive training phase. Second, we introduce a transitive inference algorithm for re-identification that can exploit the information from best source camera to improve the accuracy across other camera pairs in a network of multiple cameras. Extensive experiments on four benchmark datasets demonstrate that the proposed approach significantly outperforms the state-of-the-art unsupervised learning based alternatives whilst being extremely efficient to compute.Comment: CVPR 2017 Spotligh

    Adaptive visual sampling

    Get PDF
    PhDVarious visual tasks may be analysed in the context of sampling from the visual field. In visual psychophysics, human visual sampling strategies have often been shown at a high-level to be driven by various information and resource related factors such as the limited capacity of the human cognitive system, the quality of information gathered, its relevance in context and the associated efficiency of recovering it. At a lower-level, we interpret many computer vision tasks to be rooted in similar notions of contextually-relevant, dynamic sampling strategies which are geared towards the filtering of pixel samples to perform reliable object association. In the context of object tracking, the reliability of such endeavours is fundamentally rooted in the continuing relevance of object models used for such filtering, a requirement complicated by realworld conditions such as dynamic lighting that inconveniently and frequently cause their rapid obsolescence. In the context of recognition, performance can be hindered by the lack of learned context-dependent strategies that satisfactorily filter out samples that are irrelevant or blunt the potency of models used for discrimination. In this thesis we interpret the problems of visual tracking and recognition in terms of dynamic spatial and featural sampling strategies and, in this vein, present three frameworks that build on previous methods to provide a more flexible and effective approach. Firstly, we propose an adaptive spatial sampling strategy framework to maintain statistical object models for real-time robust tracking under changing lighting conditions. We employ colour features in experiments to demonstrate its effectiveness. The framework consists of five parts: (a) Gaussian mixture models for semi-parametric modelling of the colour distributions of multicolour objects; (b) a constructive algorithm that uses cross-validation for automatically determining the number of components for a Gaussian mixture given a sample set of object colours; (c) a sampling strategy for performing fast tracking using colour models; (d) a Bayesian formulation enabling models of object and the environment to be employed together in filtering samples by discrimination; and (e) a selectively-adaptive mechanism to enable colour models to cope with changing conditions and permit more robust tracking. Secondly, we extend the concept to an adaptive spatial and featural sampling strategy to deal with very difficult conditions such as small target objects in cluttered environments undergoing severe lighting fluctuations and extreme occlusions. This builds on previous work on dynamic feature selection during tracking by reducing redundancy in features selected at each stage as well as more naturally balancing short-term and long-term evidence, the latter to facilitate model rigidity under sharp, temporary changes such as occlusion whilst permitting model flexibility under slower, long-term changes such as varying lighting conditions. This framework consists of two parts: (a) Attribute-based Feature Ranking (AFR) which combines two attribute measures; discriminability and independence to other features; and (b) Multiple Selectively-adaptive Feature Models (MSFM) which involves maintaining a dynamic feature reference of target object appearance. We call this framework Adaptive Multi-feature Association (AMA). Finally, we present an adaptive spatial and featural sampling strategy that extends established Local Binary Pattern (LBP) methods and overcomes many severe limitations of the traditional approach such as limited spatial support, restricted sample sets and ad hoc joint and disjoint statistical distributions that may fail to capture important structure. Our framework enables more compact, descriptive LBP type models to be constructed which may be employed in conjunction with many existing LBP techniques to improve their performance without modification. The framework consists of two parts: (a) a new LBP-type model known as Multiscale Selected Local Binary Features (MSLBF); and (b) a novel binary feature selection algorithm called Binary Histogram Intersection Minimisation (BHIM) which is shown to be more powerful than established methods used for binary feature selection such as Conditional Mutual Information Maximisation (CMIM) and AdaBoost

    Attentive monitoring of multiple video streams driven by a Bayesian foraging strategy

    Full text link
    In this paper we shall consider the problem of deploying attention to subsets of the video streams for collating the most relevant data and information of interest related to a given task. We formalize this monitoring problem as a foraging problem. We propose a probabilistic framework to model observer's attentive behavior as the behavior of a forager. The forager, moment to moment, focuses its attention on the most informative stream/camera, detects interesting objects or activities, or switches to a more profitable stream. The approach proposed here is suitable to be exploited for multi-stream video summarization. Meanwhile, it can serve as a preliminary step for more sophisticated video surveillance, e.g. activity and behavior analysis. Experimental results achieved on the UCR Videoweb Activities Dataset, a publicly available dataset, are presented to illustrate the utility of the proposed technique.Comment: Accepted to IEEE Transactions on Image Processin

    Person re-Identification over distributed spaces and time

    Get PDF
    PhDReplicating the human visual system and cognitive abilities that the brain uses to process the information it receives is an area of substantial scientific interest. With the prevalence of video surveillance cameras a portion of this scientific drive has been into providing useful automated counterparts to human operators. A prominent task in visual surveillance is that of matching people between disjoint camera views, or re-identification. This allows operators to locate people of interest, to track people across cameras and can be used as a precursory step to multi-camera activity analysis. However, due to the contrasting conditions between camera views and their effects on the appearance of people re-identification is a non-trivial task. This thesis proposes solutions for reducing the visual ambiguity in observations of people between camera views This thesis first looks at a method for mitigating the effects on the appearance of people under differing lighting conditions between camera views. This thesis builds on work modelling inter-camera illumination based on known pairs of images. A Cumulative Brightness Transfer Function (CBTF) is proposed to estimate the mapping of colour brightness values based on limited training samples. Unlike previous methods that use a mean-based representation for a set of training samples, the cumulative nature of the CBTF retains colour information from underrepresented samples in the training set. Additionally, the bi-directionality of the mapping function is explored to try and maximise re-identification accuracy by ensuring samples are accurately mapped between cameras. Secondly, an extension is proposed to the CBTF framework that addresses the issue of changing lighting conditions within a single camera. As the CBTF requires manually labelled training samples it is limited to static lighting conditions and is less effective if the lighting changes. This Adaptive CBTF (A-CBTF) differs from previous approaches that either do not consider lighting change over time, or rely on camera transition time information to update. By utilising contextual information drawn from the background in each camera view, an estimation of the lighting change within a single camera can be made. This background lighting model allows the mapping of colour information back to the original training conditions and thus remove the need for 3 retraining. Thirdly, a novel reformulation of re-identification as a ranking problem is proposed. Previous methods use a score based on a direct distance measure of set features to form a correct/incorrect match result. Rather than offering an operator a single outcome, the ranking paradigm is to give the operator a ranked list of possible matches and allow them to make the final decision. By utilising a Support Vector Machine (SVM) ranking method, a weighting on the appearance features can be learned that capitalises on the fact that not all image features are equally important to re-identification. Additionally, an Ensemble-RankSVM is proposed to address scalability issues by separating the training samples into smaller subsets and boosting the trained models. Finally, the thesis looks at a practical application of the ranking paradigm in a real world application. The system encompasses both the re-identification stage and the precursory extraction and tracking stages to form an aid for CCTV operators. Segmentation and detection are combined to extract relevant information from the video, while several combinations of matching techniques are combined with temporal priors to form a more comprehensive overall matching criteria. The effectiveness of the proposed approaches is tested on datasets obtained from a variety of challenging environments including offices, apartment buildings, airports and outdoor public spaces

    Interactive and life-long learning for identification and categorization tasks

    Get PDF
    Abstract (engl.) This thesis focuses on life-long and interactive learning for recognition tasks. To achieve these targets the separation into a short-term memory (STM) and a long-term memory (LTM) is proposed. For the incremental build up of the STM a similarity-based one-shot learning method was developed. Furthermore two consolidation algorithms were proposed enabling the incremental learning of LTM representations. Based on the Learning Vector Quantization (LVQ) network architecture an error-based node insertion rule and a node dependent learning rate are proposed to enable life-long learning. For learning of categories additionally a forward-feature selection method was introduced to separate co-occurring categories. In experiments the performance of these learning methods could be shown for difficult visual recognition problems

    Adaptive classifier ensembles for face recognition in video-surveillance

    Get PDF
    Lors de l’implémentation de systèmes de sécurité tels que la vidéo-surveillance intelligente, l’utilisation d’images de visages présente de nombreux avantages par rapport à d’autres traits biométriques. En particulier, cela permet de détecter d’éventuels individus d’intérêt de manière discrète et non intrusive, ce qui peut être particulièrement avantageux dans des situations comme la détection d’individus sur liste noire, la recherche dans des données archivées ou la ré-identification de visages. Malgré cela, la reconnaissance de visages reste confrontée à de nombreuses difficultés propres à la vidéo surveillance. Entre autres, le manque de contrôle sur l’environnement observé implique de nombreuses variations dans les conditions d’éclairage, la résolution de l’image, le flou de mouvement, l’orientation et l’expression des visages. Pour reconnaître des individus, des modèles de visages sont habituellement générés à l’aide d’un nombre limité d’images ou de vidéos de référence collectées lors de sessions d’inscription. Cependant, ces acquisitions ne se déroulant pas nécessairement dans les mêmes conditions d’observation, les données de référence représentent pas toujours la complexité du problème réel. D’autre part, bien qu’il soit possible d’adapter les modèles de visage lorsque de nouvelles données de référence deviennent disponibles, un apprentissage incrémental basé sur des données significativement différentes expose le système à un risque de corruption de connaissances. Enfin, seule une partie de ces connaissances est effectivement pertinente pour la classification d’une image donnée. Dans cette thèse, un nouveau système est proposé pour la détection automatique d’individus d’intérêt en vidéo-surveillance. Plus particulièrement, celle-ci se concentre sur un scénario centré sur l’utilisateur, où un système de reconnaissance de visages est intégré à un outil d’aide à la décision pour alerter un opérateur lorsqu’un individu d’intérêt est détecté sur des flux vidéo. Un tel système se doit d’être capable d’ajouter ou supprimer des individus d’intérêt durant son fonctionnement, ainsi que de mettre à jour leurs modèles de visage dans le temps avec des nouvelles données de référence. Pour cela, le système proposé se base sur de la détection de changement de concepts pour guider une stratégie d’apprentissage impliquant des ensembles de classificateurs. Chaque individu inscrit dans le système est représenté par un ensemble de classificateurs à deux classes, chacun étant spécialisé dans des conditions d’observation différentes, détectées dans les données de référence. De plus, une nouvelle règle pour la fusion dynamique d’ensembles de classificateurs est proposée, utilisant des modèles de concepts pour estimer la pertinence des classificateurs vis-à-vis de chaque image à classifier. Enfin, les visages sont suivis d’une image à l’autre dans le but de les regrouper en trajectoires, et accumuler les décisions dans le temps. Au Chapitre 2, la détection de changement de concept est dans un premier temps utilisée pour limiter l’augmentation de complexité d’un système d’appariement de modèles adoptant une stratégie de mise à jour automatique de ses galeries. Une nouvelle approche sensible au contexte est proposée, dans laquelle seules les images de haute confiance capturées dans des conditions d’observation différentes sont utilisées pour mettre à jour les modèles de visage. Des expérimentations ont été conduites avec trois bases de données de visages publiques. Un système d’appariement de modèles standard a été utilisé, combiné avec un module de détection de changement dans les conditions d’illumination. Les résultats montrent que l’approche proposée permet de diminuer la complexité de ces systèmes, tout en maintenant la performance dans le temps. Au Chapitre 3, un nouveau système adaptatif basé des ensembles de classificateurs est proposé pour la reconnaissance de visages en vidéo-surveillance. Il est composé d’un ensemble de classificateurs incrémentaux pour chaque individu inscrit, et se base sur la détection de changement de concepts pour affiner les modèles de visage lorsque de nouvelles données sont disponibles. Une stratégie hybride est proposée, dans laquelle des classificateurs ne sont ajoutés aux ensembles que lorsqu’un changement abrupt est détecté dans les données de référence. Lors d’un changement graduel, les classificateurs associés sont mis à jour, ce qui permet d’affiner les connaissances propres au concept correspondant. Une implémentation particulière de ce système est proposée, utilisant des ensembles de classificateurs de type Fuzzy-ARTMAP probabilistes, générés et mis à jour à l’aide d’une stratégie basée sur une optimisation par essaims de particules dynamiques, et utilisant la distance de Hellinger entre histogrammes pour détecter des changements. Les simulations réalisées sur la base de donnée de vidéo-surveillance Faces in Action (FIA) montrent que le système proposé permet de maintenir un haut niveau de performance dans le temps, tout en limitant la corruption de connaissance. Il montre des performances de classification supérieure à un système similaire passif (sans détection de changement), ainsi qu’a des systèmes de référence de type kNN probabiliste, et TCM-kNN. Au Chapitre 4, une évolution du système présenté au Chapitre 3 est proposée, intégrant des mécanismes permettant d’adapter dynamiquement le comportement du système aux conditions d’observation changeantes en mode opérationnel. Une nouvelle règle de fusion basée sur de la pondération dynamique est proposée, assignant à chaque classificateur un poids proportionnel à son niveau de compétence estimé vis-à-vis de chaque image à classifier. De plus, ces compétences sont estimées à l’aide des modèles de concepts utilisés en apprentissage pour la détection de changement, ce qui permet un allègement des ressources nécessaires en mode opérationnel. Une évolution de l’implémentation proposée au Chapitre 3 est présentée, dans laquelle les concepts sont modélisés à l’aide de l’algorithme de partitionnement Fuzzy C-Means, et la fusion de classificateurs réalisée avec une moyenne pondérée. Les simulation expérimentales avec les bases de données de vidéo-surveillance FIA et Chokepoint montrent que la méthode de fusion proposée permet d’obtenir des résultats supérieurs à la méthode de sélection dynamique DSOLA, tout en utilisant considérablement moins de ressources de calcul. De plus, la méthode proposée montre des performances de classification supérieures aux systèmes de référence de type kNN probabiliste, TCM-kNN et Adaptive Sparse Coding

    Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

    Get PDF
    Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach