14 research outputs found

    Temporal Model Adaptation for Person Re-Identification

    Full text link
    Person re-identification is an open and challenging problem in computer vision. Majority of the efforts have been spent either to design the best feature representation or to learn the optimal matching metric. Most approaches have neglected the problem of adapting the selected features or the learned model over time. To address such a problem, we propose a temporal model adaptation scheme with human in the loop. We first introduce a similarity-dissimilarity learning method which can be trained in an incremental fashion by means of a stochastic alternating directions methods of multipliers optimization procedure. Then, to achieve temporal adaptation with limited human effort, we exploit a graph-based approach to present the user only the most informative probe-gallery matches that should be used to update the model. Results on three datasets have shown that our approach performs on par or even better than state-of-the-art approaches while reducing the manual pairwise labeling effort by about 80%

    Agent-based framework for person re-identification

    Get PDF
    In computer based human object re-identification, a detected human is recognised to a level sufficient to re-identify a tracked person in either a different camera capturing the same individual, often at a different angle, or the same camera at a different time and/or the person approaching the camera at a different angle. Instead of relying on face recognition technology such systems study the clothing of the individuals being monitored and/or objects being carried to establish correspondence and hence re-identify the human object. Unfortunately present human-object re-identification systems consider the entire human object as one connected region in making the decisions about similarity of two objects being matched. This assumption has a major drawback in that when a person is partially occluded, a part of the occluding foreground will be picked up and used in matching. Our research revealed that when a human observer carries out a manual human-object re-identification task, the attention is often taken over by some parts of the human figure/body, more than the others, e.g. face, brightly colour shirt, presence of texture patterns in clothing etc., and occluding parts are ignored. In this thesis, a novel multi-agent based framework is proposed for the design of a human object re-identification system. Initially a HOG based feature extraction is used in a SVM based classification of a human object as a human of a full-body or of half body nature. Subsequently the relative visual significance of the top and the bottom parts of the human, in re-identification is quantified by the analysis of Gray Level Co-occurrence based texture features and colour histograms obtained in the HSV colour space. Accordingly different weights are assigned to the top and bottom of the human body using a novel probabilistic approach. The weights are then used to modify the Hybrid Spatiogram and Covariance Descriptor (HSCD) feature based re-identification algorithm adopted. A significant novelty of the human object re-identification systems proposed in this thesis is the agent based design procedure adopted that separates the use of computer vision algorithms for feature extraction, comparison etc., from the decision making process of re-identification. Multiple agents are assigned to execute different algorithmic tasks and the agents communicate to make the required logical decisions. Detailed experimental results are provided to prove that the proposed multi agent based framework for human object re-identification performs significantly better than the state of-the-art algorithms. Further it is shown that the design flexibilities and scalabilities of the proposed system allows it to be effectively utilised in more complex computer vision based video analytic/forensic tasks often conducted within distributed, multi-camera systems

    Redes neurais convolucionais baseadas em ritmos visuais e fusão adaptativa para uma arquitetura de múltiplos canais aplicada ao reconhecimento de ações humanas

    Get PDF
    Orientadores: Hélio Pedrini, Marcelo Bernardes VieiraTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A grande quantidade de dados de vídeos produzidos e divulgados todos os dias torna a inspeção visual por um operador humano impraticável. No entanto, o conteúdo desses vídeos pode ser útil para várias tarefas importantes, como vigilância e monitoramento de saúde. Portanto, métodos automáticos são necessários para detectar e compreender eventos relevantes em vídeos. O problema abordado neste trabalho é o reconhecimento das ações humanas em vídeos que visa classificar a ação que está sendo realizada por um ou mais atores. A complexidade do problema e o volume de dados de vídeo sugerem o uso de técnicas baseadas em aprendizado profundo, no entanto, ao contrário de problemas relacionados a imagens, não há uma grande variedade de arquiteturas específicas bem estabelecidas nem conjuntos de dados anotados tão grandes quanto aqueles baseados em imagens. Para contornar essas limitações, propomos e analisamos uma arquitetura de múltiplos canais composta de redes baseadas em imagens pré-treinadas na base ImageNet. Diferentes representações de imagens são extraídas dos vídeos que servem como entrada para os canais, a fim de fornecer informações complementares para o sistema. Neste trabalho, propomos novos canais baseados em ritmo visual que codificam informações de mais longo prazo quando comparados a quadros estáticos e fluxo óptico. Tão importante quanto a definição de aspectos representativos e complementares é a escolha de métodos de combinação adequados que explorem os pontos fortes de cada modalidade. Assim, nós também analisamos diferentes abordagens de fusão para combinar as modalidades. Para definir os melhores parâmetros de nossos métodos de fusão usando o conjunto de treinamento, temos que reduzir o sobreajuste em modalidades individuais, caso contrário, as saídas 100\% precisas não ofereceriam uma representação realista e relevante para o método de fusão. Assim, investigamos uma técnica de parada precoce para treinar redes individuais. Além de reduzir o sobreajuste, esse método também reduz o custo de treinamento, pois normalmente requer menos épocas para concluir o processo de classificação, e se adapta a novos canais e conjuntos de dados graças aos seus parâmetros treináveis. Os experimentos são realizados nos conjuntos de dados UCF101 e HMDB51, que são duas bases desafiadoras no contexto de reconhecimento de açõesAbstract: The large amount of video data produced and released every day makes visual inspection by a human operator impracticable. However, the content of these videos can be useful for various important tasks, such as surveillance and health monitoring. Therefore, automatic methods are needed to detect and understand relevant events in videos. The problem addressed in this work is the recognition of human actions in videos that aims to classify the action that is being performed by one or more actors. The complexity of the problem and the volume of video data suggest the use of deep learning-based techniques, however, unlike image-related problems, there is neither a great variety of specific well-established architectures nor annotated datasets as large as image-based ones. To circumvent these limitations, we propose and analyze a multi-stream architecture containing image-based networks pre-trained on the large ImageNet. Different image representations are extracted from the videos to feed the streams, in order to provide complementary information for the system. Here, we propose new streams based on visual rhythm that encode longer-term information when compared to still frames and optical flow. As important as the definition of representative and complementary aspects is the choice of proper combination methods that explore the strengths of each modality. Thus, here we also analyze different fusion approaches to combine the modalities. In order to define the best parameters of our fusion methods using the training set, we have to reduce overfitting in individual modalities, otherwise, the 100%\%-accurate outputs would not offer a realistic and relevant representation for the fusion method. Thus, we investigate an early stopping technique to train individual networks. In addition to reducing overfitting, this method also reduces the training cost, since it usually requires fewer epochs to complete the classification process, and adapts to new streams and datasets thanks to its trainable parameters. Experiments are conducted on UCF101 and HMDB51 datasets, which are two challenging benchmarks in the context of action recognitionDoutoradoCiência da ComputaçãoDoutora em Ciência da Computação0012017/09160-1CAPESFAPES
    corecore