24 research outputs found

    Multigranularity Representations for Human Inter-Actions: Pose, Motion and Intention

    Get PDF
    Tracking people and their body pose in videos is a central problem in computer vision. Standard tracking representations reason about temporal coherence of detected people and body parts. They have difficulty tracking targets under partial occlusions or rare body poses, where detectors often fail, since the number of training examples is often too small to deal with the exponential variability of such configurations. We propose tracking representations that track and segment people and their body pose in videos by exploiting information at multiple detection and segmentation granularities when available, whole body, parts or point trajectories. Detections and motion estimates provide contradictory information in case of false alarm detections or leaking motion affinities. We consolidate contradictory information via graph steering, an algorithm for simultaneous detection and co-clustering in a two-granularity graph of motion trajectories and detections, that corrects motion leakage between correctly detected objects, while being robust to false alarms or spatially inaccurate detections. We first present a motion segmentation framework that exploits long range motion of point trajectories and large spatial support of image regions. We show resulting video segments adapt to targets under partial occlusions and deformations. Second, we augment motion-based representations with object detection for dealing with motion leakage. We demonstrate how to combine dense optical flow trajectory affinities with repulsions from confident detections to reach a global consensus of detection and tracking in crowded scenes. Third, we study human motion and pose estimation. We segment hard to detect, fast moving body limbs from their surrounding clutter and match them against pose exemplars to detect body pose under fast motion. We employ on-the-fly human body kinematics to improve tracking of body joints under wide deformations. We use motion segmentability of body parts for re-ranking a set of body joint candidate trajectories and jointly infer multi-frame body pose and video segmentation. We show empirically that such multi-granularity tracking representation is worthwhile, obtaining significantly more accurate multi-object tracking and detailed body pose estimation in popular datasets

    Vision based system for detecting and counting mobility aids in surveillance videos

    Get PDF
    Automatic surveillance video analysis is popular among computer vision researchers due to its wide range of applications that require automated systems. Automated systems are to replace manual analysis of videos which is tiresome, expensive, and time-consuming. Image and video processing techniques are often used in the design of automatic detection and monitoring systems. Compared with normal indoor videos, outdoor surveillance videos are often difficult to process due to the uncontrolled environment, camera angle, and varying lighting and weather conditions. This research aims to contribute to the computer vision field by proposing an object detection and tracking algorithm that can handle multi-object and multi-class scenarios. The problem is solved by developing an application to count disabled pedestrians in surveillance videos by automatically detecting and tracking mobility aids and pedestrians. The application demonstrates that the proposed ideas achieve the desired outcomes. There are extensive studies on pedestrian detection and gait analysis in the computer vision field, but limited work is carried out on identifying disabled pedestrians or mobility aids. Detection of mobility aids in videos is challenging since the disabled person often occludes mobility aids and visibility of mobility aid depends on the direction of the walk with respect to the camera. For example, a walking stick is visible most times in front-on view while it is occluded when it happens to be on the walker's rear side. Furthermore, people use various mobility aids and their make and type changes with time as technology advances. The system should detect the majority of mobility aids to report reliable counting data. The literature review revealed that no system exists for detecting disabled pedestrians or mobility aids in surveillance videos. A lack of annotated image data containing mobility aids is also an obstacle to developing a machine-learning-based solution to detect mobility aids. In the first part of this thesis, we explored moving pedestrians' video data to extract the gait signals using manual and automated procedures. Manual extraction involved marking the pedestrians' head and leg locations and analysing those signals in the time domain. Analysis of stride length and velocity features indicate an abnormality if a walker is physically disabled. The automated system is built by combining the \acrshort{yolo} object detector, GMM based foreground modelling and star skeletonisation in a pipeline to extract the gait signal. The automated system failed to recognise a disabled person from its gait due to poor localisation by \acrshort{yolo}, incorrect segmentation and silhouette extraction due to moving backgrounds and shadows. The automated gait analysis approach failed due to various factors including environmental constraints, viewing angle, occlusions, shadows, imperfections in foreground modelling, object segmentation and silhouette extraction. In the later part of this thesis, we developed a CNN based approach to detect mobility aids and pedestrians. The task of identifying and counting disabled pedestrians in surveillance videos is divided into three sub-tasks: mobility aid and person detection, tracking and data association of detected objects, and counting healthy and disabled pedestrians. A modern object detector called YOLO, an improved data association algorithm (SORT), and a new pairing approach are applied to complete the three sub-tasks. Improvement of the SORT algorithm and introducing a pairing approach are notable contributions to the computer vision field. The SORT algorithm is strictly one class and without an object counting feature. SORT is enhanced to be multi-class and able to track accelerating or temporarily occluded objects. The pairing strategy associates a mobility aid with the nearest pedestrian and monitors them over time to see if the pair is reliable. A reliable pair represents a disabled pedestrian and counting reliable pairs calculates the number of disabled people in the video. The thesis also introduces an image database that was gathered as part of this study. The dataset comprises 5819 images belonging to eight different object classes, including five mobility aids, pedestrians, cars, and bicycles. The dataset was needed to train a CNN that can detect mobility aids in videos. The proposed mobility aid counting system is evaluated on a range of surveillance videos collected from outdoors with real-world scenarios. The results prove that the proposed solution offers a satisfactory performance in picking mobility aids from outdoor surveillance videos. The counting accuracy of 94% on test videos meets the design goals set by the advocacy group that need this application. Most test videos had objects from multiple classes in them. The system detected five mobility aids (wheelchair, crutch, walking stick, walking frame and mobility scooter), pedestrians and two distractors (car and bicycle). The training system on distractors' classes was to ensure the system can distinguish objects that are similar to mobility aids from mobility aids. In some cases, the convolutional neural network reports a mobility aid with an incorrect type. For example, the shape of crutch and stick are very much alike, and therefore, the system confuses one with the other. However, it does not affect the final counts as the aim was to get the overall counts of mobility aids (of any type) and determining the exact type of mobility aid is optional

    Multi-object tracking in video using labeled random finite sets

    Get PDF
    The safety of industrial mobile platforms (such as fork lifts and boom lifts) is of major concern in the world today as industry embraces the concepts of Industry 4.0. The existing safety methods are predominantly based on Radio Frequency Identification (RFID) technology and therefore can only determine the distance at which a pedestrian who is wearing an RFID tag is standing. Other methods use expensive laser scanners to map the surrounding and warn the driver accordingly. The aim of this research project is to improve the safety of industrial mobile platforms, by detecting and tracking pedestrians in the path of the mobile platform, using readily available cheap camera modules. In order to achieve this aim, this research focuses on multi-object tracking which is one of the most ubiquitously addressed problems in the field of \textit{Computer Vision}. Algorithms that can track targets under severe conditions, such as varying number of objects, occlusion, illumination changes and abrupt movements of the objects are investigated in this research project. Furthermore, a substantial focus is given to improving the accuracy and, performance and to handling misdetections and false alarms. In order to formulate these algorithms, the recently introduced concept of Random Finite Sets (RFS) is used as the underlying mathematical framework. The algorithms formulated to meet the above criteria were tested on standard visual tracking datasets as well as on a dataset which was created by our research group, for performance and accuracy using standard performance and accuracy metrics that are widely used in the computer vision literature. These results were compared with numerous state-of-the-art methods and are shown to outperform or perform favourably in terms of the metrics mentioned above

    Model-Based Environmental Visual Perception for Humanoid Robots

    Get PDF
    The visual perception of a robot should answer two fundamental questions: What? and Where? In order to properly and efficiently reply to these questions, it is essential to establish a bidirectional coupling between the external stimuli and the internal representations. This coupling links the physical world with the inner abstraction models by sensor transformation, recognition, matching and optimization algorithms. The objective of this PhD is to establish this sensor-model coupling

    QUIS-CAMPI: Biometric Recognition in Surveillance Scenarios

    Get PDF
    The concerns about individuals security have justified the increasing number of surveillance cameras deployed both in private and public spaces. However, contrary to popular belief, these devices are in most cases used solely for recording, instead of feeding intelligent analysis processes capable of extracting information about the observed individuals. Thus, even though video surveillance has already proved to be essential for solving multiple crimes, obtaining relevant details about the subjects that took part in a crime depends on the manual inspection of recordings. As such, the current goal of the research community is the development of automated surveillance systems capable of monitoring and identifying subjects in surveillance scenarios. Accordingly, the main goal of this thesis is to improve the performance of biometric recognition algorithms in data acquired from surveillance scenarios. In particular, we aim at designing a visual surveillance system capable of acquiring biometric data at a distance (e.g., face, iris or gait) without requiring human intervention in the process, as well as devising biometric recognition methods robust to the degradation factors resulting from the unconstrained acquisition process. Regarding the first goal, the analysis of the data acquired by typical surveillance systems shows that large acquisition distances significantly decrease the resolution of biometric samples, and thus their discriminability is not sufficient for recognition purposes. In the literature, diverse works point out Pan Tilt Zoom (PTZ) cameras as the most practical way for acquiring high-resolution imagery at a distance, particularly when using a master-slave configuration. In the master-slave configuration, the video acquired by a typical surveillance camera is analyzed for obtaining regions of interest (e.g., car, person) and these regions are subsequently imaged at high-resolution by the PTZ camera. Several methods have already shown that this configuration can be used for acquiring biometric data at a distance. Nevertheless, these methods failed at providing effective solutions to the typical challenges of this strategy, restraining its use in surveillance scenarios. Accordingly, this thesis proposes two methods to support the development of a biometric data acquisition system based on the cooperation of a PTZ camera with a typical surveillance camera. The first proposal is a camera calibration method capable of accurately mapping the coordinates of the master camera to the pan/tilt angles of the PTZ camera. The second proposal is a camera scheduling method for determining - in real-time - the sequence of acquisitions that maximizes the number of different targets obtained, while minimizing the cumulative transition time. In order to achieve the first goal of this thesis, both methods were combined with state-of-the-art approaches of the human monitoring field to develop a fully automated surveillance capable of acquiring biometric data at a distance and without human cooperation, designated as QUIS-CAMPI system. The QUIS-CAMPI system is the basis for pursuing the second goal of this thesis. The analysis of the performance of the state-of-the-art biometric recognition approaches shows that these approaches attain almost ideal recognition rates in unconstrained data. However, this performance is incongruous with the recognition rates observed in surveillance scenarios. Taking into account the drawbacks of current biometric datasets, this thesis introduces a novel dataset comprising biometric samples (face images and gait videos) acquired by the QUIS-CAMPI system at a distance ranging from 5 to 40 meters and without human intervention in the acquisition process. This set allows to objectively assess the performance of state-of-the-art biometric recognition methods in data that truly encompass the covariates of surveillance scenarios. As such, this set was exploited for promoting the first international challenge on biometric recognition in the wild. This thesis describes the evaluation protocols adopted, along with the results obtained by the nine methods specially designed for this competition. In addition, the data acquired by the QUIS-CAMPI system were crucial for accomplishing the second goal of this thesis, i.e., the development of methods robust to the covariates of surveillance scenarios. The first proposal regards a method for detecting corrupted features in biometric signatures inferred by a redundancy analysis algorithm. The second proposal is a caricature-based face recognition approach capable of enhancing the recognition performance by automatically generating a caricature from a 2D photo. The experimental evaluation of these methods shows that both approaches contribute to improve the recognition performance in unconstrained data.A crescente preocupação com a segurança dos indivíduos tem justificado o crescimento do número de câmaras de vídeo-vigilância instaladas tanto em espaços privados como públicos. Contudo, ao contrário do que normalmente se pensa, estes dispositivos são, na maior parte dos casos, usados apenas para gravação, não estando ligados a nenhum tipo de software inteligente capaz de inferir em tempo real informações sobre os indivíduos observados. Assim, apesar de a vídeo-vigilância ter provado ser essencial na resolução de diversos crimes, o seu uso está ainda confinado à disponibilização de vídeos que têm que ser manualmente inspecionados para extrair informações relevantes dos sujeitos envolvidos no crime. Como tal, atualmente, o principal desafio da comunidade científica é o desenvolvimento de sistemas automatizados capazes de monitorizar e identificar indivíduos em ambientes de vídeo-vigilância. Esta tese tem como principal objetivo estender a aplicabilidade dos sistemas de reconhecimento biométrico aos ambientes de vídeo-vigilância. De forma mais especifica, pretende-se 1) conceber um sistema de vídeo-vigilância que consiga adquirir dados biométricos a longas distâncias (e.g., imagens da cara, íris, ou vídeos do tipo de passo) sem requerer a cooperação dos indivíduos no processo; e 2) desenvolver métodos de reconhecimento biométrico robustos aos fatores de degradação inerentes aos dados adquiridos por este tipo de sistemas. No que diz respeito ao primeiro objetivo, a análise aos dados adquiridos pelos sistemas típicos de vídeo-vigilância mostra que, devido à distância de captura, os traços biométricos amostrados não são suficientemente discriminativos para garantir taxas de reconhecimento aceitáveis. Na literatura, vários trabalhos advogam o uso de câmaras Pan Tilt Zoom (PTZ) para adquirir imagens de alta resolução à distância, principalmente o uso destes dispositivos no modo masterslave. Na configuração master-slave um módulo de análise inteligente seleciona zonas de interesse (e.g. carros, pessoas) a partir do vídeo adquirido por uma câmara de vídeo-vigilância e a câmara PTZ é orientada para adquirir em alta resolução as regiões de interesse. Diversos métodos já mostraram que esta configuração pode ser usada para adquirir dados biométricos à distância, ainda assim estes não foram capazes de solucionar alguns problemas relacionados com esta estratégia, impedindo assim o seu uso em ambientes de vídeo-vigilância. Deste modo, esta tese propõe dois métodos para permitir a aquisição de dados biométricos em ambientes de vídeo-vigilância usando uma câmara PTZ assistida por uma câmara típica de vídeo-vigilância. O primeiro é um método de calibração capaz de mapear de forma exata as coordenadas da câmara master para o ângulo da câmara PTZ (slave) sem o auxílio de outros dispositivos óticos. O segundo método determina a ordem pela qual um conjunto de sujeitos vai ser observado pela câmara PTZ. O método proposto consegue determinar em tempo-real a sequência de observações que maximiza o número de diferentes sujeitos observados e simultaneamente minimiza o tempo total de transição entre sujeitos. De modo a atingir o primeiro objetivo desta tese, os dois métodos propostos foram combinados com os avanços alcançados na área da monitorização de humanos para assim desenvolver o primeiro sistema de vídeo-vigilância completamente automatizado e capaz de adquirir dados biométricos a longas distâncias sem requerer a cooperação dos indivíduos no processo, designado por sistema QUIS-CAMPI. O sistema QUIS-CAMPI representa o ponto de partida para iniciar a investigação relacionada com o segundo objetivo desta tese. A análise do desempenho dos métodos de reconhecimento biométrico do estado-da-arte mostra que estes conseguem obter taxas de reconhecimento quase perfeitas em dados adquiridos sem restrições (e.g., taxas de reconhecimento maiores do que 99% no conjunto de dados LFW). Contudo, este desempenho não é corroborado pelos resultados observados em ambientes de vídeo-vigilância, o que sugere que os conjuntos de dados atuais não contêm verdadeiramente os fatores de degradação típicos dos ambientes de vídeo-vigilância. Tendo em conta as vulnerabilidades dos conjuntos de dados biométricos atuais, esta tese introduz um novo conjunto de dados biométricos (imagens da face e vídeos do tipo de passo) adquiridos pelo sistema QUIS-CAMPI a uma distância máxima de 40m e sem a cooperação dos sujeitos no processo de aquisição. Este conjunto permite avaliar de forma objetiva o desempenho dos métodos do estado-da-arte no reconhecimento de indivíduos em imagens/vídeos capturados num ambiente real de vídeo-vigilância. Como tal, este conjunto foi utilizado para promover a primeira competição de reconhecimento biométrico em ambientes não controlados. Esta tese descreve os protocolos de avaliação usados, assim como os resultados obtidos por 9 métodos especialmente desenhados para esta competição. Para além disso, os dados adquiridos pelo sistema QUIS-CAMPI foram essenciais para o desenvolvimento de dois métodos para aumentar a robustez aos fatores de degradação observados em ambientes de vídeo-vigilância. O primeiro é um método para detetar características corruptas em assinaturas biométricas através da análise da redundância entre subconjuntos de características. O segundo é um método de reconhecimento facial baseado em caricaturas automaticamente geradas a partir de uma única foto do sujeito. As experiências realizadas mostram que ambos os métodos conseguem reduzir as taxas de erro em dados adquiridos de forma não controlada

    Human body analysis using depth data

    Get PDF
    Human body analysis is one of the broadest areas within the computer vision field. Researchers have put a strong effort in the human body analysis area, specially over the last decade, due to the technological improvements in both video cameras and processing power. Human body analysis covers topics such as person detection and segmentation, human motion tracking or action and behavior recognition. Even if human beings perform all these tasks naturally, they build-up a challenging problem from a computer vision point of view. Adverse situations such as viewing perspective, clutter and occlusions, lighting conditions or variability of behavior amongst persons may turn human body analysis into an arduous task. In the computer vision field, the evolution of research works is usually tightly related to the technological progress of camera sensors and computer processing power. Traditional human body analysis methods are based on color cameras. Thus, the information is extracted from the raw color data, strongly limiting the proposals. An interesting quality leap was achieved by introducing the multiview concept. That is to say, having multiple color cameras recording a single scene at the same time. With multiview approaches, 3D information is available by means of stereo matching algorithms. The fact of having 3D information is a key aspect in human motion analysis, since the human body moves in a three-dimensional space. Thus, problems such as occlusion and clutter may be overcome with 3D information. The appearance of commercial depth cameras has supposed a second leap in the human body analysis field. While traditional multiview approaches required a cumbersome and expensive setup, as well as a fine camera calibration; novel depth cameras directly provide 3D information with a single camera sensor. Furthermore, depth cameras may be rapidly installed in a wide range of situations, enlarging the range of applications with respect to multiview approaches. Moreover, since depth cameras are based on infra-red light, they do not suffer from illumination variations. In this thesis, we focus on the study of depth data applied to the human body analysis problem. We propose novel ways of describing depth data through specific descriptors, so that they emphasize helpful characteristics of the scene for further body analysis. These descriptors exploit the special 3D structure of depth data to outperform generalist 3D descriptors or color based ones. We also study the problem of person detection, proposing a highly robust and fast method to detect heads. Such method is extended to a hand tracker, which is used throughout the thesis as a helpful tool to enable further research. In the remainder of this dissertation, we focus on the hand analysis problem as a subarea of human body analysis. Given the recent appearance of depth cameras, there is a lack of public datasets. We contribute with a dataset for hand gesture recognition and fingertip localization using depth data. This dataset acts as a starting point of two proposals for hand gesture recognition and fingertip localization based on classification techniques. In these methods, we also exploit the above mentioned descriptor proposals to finely adapt to the nature of depth data.%, and enhance the results in front of traditional color-based methods.L’anàlisi del cos humà és una de les àrees més àmplies del camp de la visió per computador. Els investigadors han posat un gran esforç en el camp de l’anàlisi del cos humà, sobretot durant la darrera dècada, degut als grans avenços tecnològics, tant pel que fa a les càmeres com a la potencia de càlcul. L’anàlisi del cos humà engloba varis temes com la detecció i segmentació de persones, el seguiment del moviment del cos, o el reconeixement d'accions. Tot i que els essers humans duen a terme aquestes tasques d'una manera natural, es converteixen en un difícil problema quan s'ataca des de l’òptica de la visió per computador. Situacions adverses, com poden ser la perspectiva del punt de vista, les oclusions, les condicions d’il•luminació o la variabilitat de comportament entre persones, converteixen l’anàlisi del cos humà en una tasca complicada. En el camp de la visió per computador, l’evolució de la recerca va sovint lligada al progrés tecnològic, tant dels sensors com de la potencia de càlcul dels ordinadors. Els mètodes tradicionals d’anàlisi del cos humà estan basats en càmeres de color. Això limita molt els enfocaments, ja que la informació disponible prové únicament de les dades de color. El concepte multivista va suposar salt de qualitat important. En els enfocaments multivista es tenen múltiples càmeres gravant una mateixa escena simultàniament, permetent utilitzar informació 3D gràcies a algorismes de combinació estèreo. El fet de disposar d’informació 3D es un punt clau, ja que el cos humà es mou en un espai tri-dimensional. Això doncs, problemes com les oclusions es poden apaivagar si es disposa de informació 3D. L’aparició de les càmeres de profunditat comercials ha suposat un segon salt en el camp de l’anàlisi del cos humà. Mentre els mètodes multivista tradicionals requereixen un muntatge pesat i car, i una celebració precisa de totes les càmeres; les noves càmeres de profunditat ofereixen informació 3D de forma directa amb un sol sensor. Aquestes càmeres es poden instal•lar ràpidament en una gran varietat d'entorns, ampliant enormement l'espectre d'aplicacions, que era molt reduït amb enfocaments multivista. A més a més, com que les càmeres de profunditat estan basades en llum infraroja, no pateixen problemes relacionats amb canvis d’il•luminació. En aquesta tesi, ens centrem en l'estudi de la informació que ofereixen les càmeres de profunditat, i la seva aplicació al problema d’anàlisi del cos humà. Proposem noves vies per descriure les dades de profunditat mitjançant descriptors específics, capaços d'emfatitzar característiques de l'escena que seran útils de cara a una posterior anàlisi del cos humà. Aquests descriptors exploten l'estructura 3D de les dades de profunditat per superar descriptors 3D generalistes o basats en color. També estudiem el problema de detecció de persones, proposant un mètode per detectar caps robust i ràpid. Ampliem aquest mètode per obtenir un algorisme de seguiment de mans que ha estat utilitzat al llarg de la tesi. En la part final del document, ens centrem en l’anàlisi de les mans com a subàrea de l’anàlisi del cos humà. Degut a la recent aparició de les càmeres de profunditat, hi ha una manca de bases de dades públiques. Contribuïm amb una base de dades pensada per la localització de dits i el reconeixement de gestos utilitzant dades de profunditat. Aquesta base de dades és el punt de partida de dues contribucions sobre localització de dits i reconeixement de gestos basades en tècniques de classificació. En aquests mètodes, també explotem les ja mencionades propostes de descriptors per millor adaptar-nos a la naturalesa de les dades de profunditat

    3D Human Motion Tracking and Pose Estimation using Probabilistic Activity Models

    Get PDF
    This thesis presents work on generative approaches to human motion tracking and pose estimation where a geometric model of the human body is used for comparison with observations. The existing generative tracking literature can be quite clearly divided between two groups. First, approaches that attempt to solve a difficult high-dimensional inference problem in the body model’s full or ambient pose space, recovering freeform or unknown activity. Second, approaches that restrict inference to a low-dimensional latent embedding of the full pose space, recovering activity for which training data is available or known activity. Significant advances have been made in each of these subgroups. Given sufficiently rich multiocular observations and plentiful computational resources, highdimensional approaches have been proven to track fast and complex unknown activities robustly. Conversely, low-dimensional approaches have been able to support monocular tracking and to significantly reduce computational costs for the recovery of known activity. However, their competing advantages have – although complementary – remained disjoint. The central aim of this thesis is to combine low- and high-dimensional generative tracking techniques to benefit from the best of both approaches. First, a simple generative tracking approach is proposed for tracking known activities in a latent pose space using only monocular or binocular observations. A hidden Markov model (HMM) is used to provide dynamics and constrain a particle-based search for poses. The ability of the HMM to classify as well as synthesise poses means that the approach naturally extends to the modelling of a number of different known activities in a single joint-activity latent space. Second, an additional low-dimensional approach is introduced to permit transitions between segmented known activity training data by allowing particles to move between activity manifolds. Both low-dimensional approaches are then fairly and efficiently combined with a simultaneous high-dimensional generative tracking task in the ambient pose space. This combination allows for the recovery of sequences containing multiple known and unknown human activities at an appropriate (dynamic) computational cost. Finally, a rich hierarchical embedding of the ambient pose space is investigated. This representation allows inference to progress from a single full-body or global non-linear latent pose space, through a number of gradually smaller part-based latent models, to the full ambient pose space. By preserving long-range correlations present in training data, the positions of occluded limbs can be inferred during tracking. Alternatively, by breaking the implied coordination between part-based models novel activity combinations, or composite activity, may be recovered

    飛行ロボットにおける人間・ロボットインタラクションの実現に向けて : ユーザー同伴モデルとセンシングインターフェース

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学准教授 矢入 健久, 東京大学教授 堀 浩一, 東京大学教授 岩崎 晃, 東京大学教授 土屋 武司, 東京理科大学教授 溝口 博University of Tokyo(東京大学

    Visual Data Association: Tracking, Re-identification and Retrieval

    Get PDF
    As there is a rapid development of the information society, large amounts of multimedia data are generated, which are shared and transferred on various electronic devices and the Internet every minute. Hence, building intelligent systems capable of associating these visual data at diverse locations and different times is absolutely essential and will significantly facilitate understanding and identifying where an object came from and where it is going. Thus, the estimated traces of motions or changes increasingly make it feasible to implement advanced algorithms to real-world applications, including human-computer interaction, robotic navigation, security in surveillance, biological characteristics association and civil structure vibration detection. However, due to the inherent challenges, such as ambiguity, heterogeneity, noisy data, large-scale property and unknown variations, visual data association is currently far from being established. Therefore, this thesis focuses on the studies of associating visual data at diverse locations and different times for the tasks of tracking, re-identification and retrieval. More specifically, three situations including single camera, across multiple cameras and across multiple modalities have been investigated and four algorithms have been developed at different levels. Chapter 3 The first algorithm is to explore an ensemble system for robust object tracking, primarily considering the independence of classifier members. An empirical analysis is firstly given to show that object tracking is a non-i.i.d. sampling, under-sample and incomplete-dataset problem. Then, a set of independent classifiers trained sequentially on different small datasets is dynamically maintained to overcome the particular machine learning problem. Thus, for every challenge, an optimal classifier can be approximated in a subspace spanned by the selected competitive classifiers. Chapter 4 The second method is to improve the object tracking by exploiting a winner-take-all strategy to select the most suitable trackers. This topic naturally extends the concept of ensemble in the first topic to a more general idea: a multi-expert system, in which members come from different function spaces. Thus, the diversity of the system is more likely to be amplified. Based on a large public dataset, a prediction model of performance for different trackers on various challenges can be obtained off-line. Then, the learned structural regression model can be directly used to efficiently select the winner tracker online. Chapter 5 The third one is to learn cross-view identities for fast person re-identification, in a cross-camera setting, which significantly differs from the single-view object tracking in the first two topics. Two sets of discriminative hash functions for two different views are learned by simultaneously minimising their distance in the Hamming space, and maximising the cross-covariance and margin. Thus, similar binary codes can be found for images of the same person captured at different views by embedding the images into the Hamming space. Chapter 6 The fourth model is to develop a novel Hetero-manifold regularisation framework for efficient cross-modal retrieval. Compared with the first two settings, this is a more general and complex topic, in which the samples can be relaxed to the images captured in the very far distance or very long time, even to text, voice and other formats. Taking advantage of the hetero-manifold, the similarity between each pair of heterogeneous data could be naturally measured by three order random walks on this hetero-manifold. It is concluded that, by fully exploiting the algorithms for solving the problems in the three situations, an integrated trace for an object moving anywhere can be definitely discovered
    corecore