14 research outputs found
Temporal Model Adaptation for Person Re-Identification
Person re-identification is an open and challenging problem in computer
vision. Majority of the efforts have been spent either to design the best
feature representation or to learn the optimal matching metric. Most approaches
have neglected the problem of adapting the selected features or the learned
model over time. To address such a problem, we propose a temporal model
adaptation scheme with human in the loop. We first introduce a
similarity-dissimilarity learning method which can be trained in an incremental
fashion by means of a stochastic alternating directions methods of multipliers
optimization procedure. Then, to achieve temporal adaptation with limited human
effort, we exploit a graph-based approach to present the user only the most
informative probe-gallery matches that should be used to update the model.
Results on three datasets have shown that our approach performs on par or even
better than state-of-the-art approaches while reducing the manual pairwise
labeling effort by about 80%
Agent-based framework for person re-identification
In computer based human object re-identification, a detected human is recognised to a
level sufficient to re-identify a tracked person in either a different camera capturing the
same individual, often at a different angle, or the same camera at a different time and/or
the person approaching the camera at a different angle. Instead of relying on face
recognition technology such systems study the clothing of the individuals being monitored
and/or objects being carried to establish correspondence and hence re-identify the human
object.
Unfortunately present human-object re-identification systems consider the entire human
object as one connected region in making the decisions about similarity of two objects
being matched. This assumption has a major drawback in that when a person is partially
occluded, a part of the occluding foreground will be picked up and used in matching. Our
research revealed that when a human observer carries out a manual human-object re-identification
task, the attention is often taken over by some parts of the human
figure/body, more than the others, e.g. face, brightly colour shirt, presence of texture
patterns in clothing etc., and occluding parts are ignored.
In this thesis, a novel multi-agent based framework is proposed for the design of a human
object re-identification system. Initially a HOG based feature extraction is used in a SVM
based classification of a human object as a human of a full-body or of half body nature.
Subsequently the relative visual significance of the top and the bottom parts of the human,
in re-identification is quantified by the analysis of Gray Level Co-occurrence based
texture features and colour histograms obtained in the HSV colour space. Accordingly
different weights are assigned to the top and bottom of the human body using a novel
probabilistic approach. The weights are then used to modify the Hybrid Spatiogram and
Covariance Descriptor (HSCD) feature based re-identification algorithm adopted.
A significant novelty of the human object re-identification systems proposed in this thesis
is the agent based design procedure adopted that separates the use of computer vision
algorithms for feature extraction, comparison etc., from the decision making process of re-identification. Multiple agents are assigned to execute different algorithmic tasks and
the agents communicate to make the required logical decisions.
Detailed experimental results are provided to prove that the proposed multi agent based
framework for human object re-identification performs significantly better than the state of-the-art algorithms. Further it is shown that the design flexibilities and scalabilities of
the proposed system allows it to be effectively utilised in more complex computer vision
based video analytic/forensic tasks often conducted within distributed, multi-camera
systems
Redes neurais convolucionais baseadas em ritmos visuais e fusão adaptativa para uma arquitetura de múltiplos canais aplicada ao reconhecimento de ações humanas
Orientadores: Hélio Pedrini, Marcelo Bernardes VieiraTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A grande quantidade de dados de vídeos produzidos e divulgados todos os dias torna a inspeção visual por um operador humano impraticável. No entanto, o conteúdo desses vídeos pode ser útil para várias tarefas importantes, como vigilância e monitoramento de saúde. Portanto, métodos automáticos são necessários para detectar e compreender eventos relevantes em vídeos. O problema abordado neste trabalho é o reconhecimento das ações humanas em vídeos que visa classificar a ação que está sendo realizada por um ou mais atores. A complexidade do problema e o volume de dados de vídeo sugerem o uso de técnicas baseadas em aprendizado profundo, no entanto, ao contrário de problemas relacionados a imagens, não há uma grande variedade de arquiteturas específicas bem estabelecidas nem conjuntos de dados anotados tão grandes quanto aqueles baseados em imagens. Para contornar essas limitações, propomos e analisamos uma arquitetura de múltiplos canais composta de redes baseadas em imagens pré-treinadas na base ImageNet. Diferentes representações de imagens são extraídas dos vídeos que servem como entrada para os canais, a fim de fornecer informações complementares para o sistema. Neste trabalho, propomos novos canais baseados em ritmo visual que codificam informações de mais longo prazo quando comparados a quadros estáticos e fluxo óptico. Tão importante quanto a definição de aspectos representativos e complementares é a escolha de métodos de combinação adequados que explorem os pontos fortes de cada modalidade. Assim, nós também analisamos diferentes abordagens de fusão para combinar as modalidades. Para definir os melhores parâmetros de nossos métodos de fusão usando o conjunto de treinamento, temos que reduzir o sobreajuste em modalidades individuais, caso contrário, as saídas 100\% precisas não ofereceriam uma representação realista e relevante para o método de fusão. Assim, investigamos uma técnica de parada precoce para treinar redes individuais. Além de reduzir o sobreajuste, esse método também reduz o custo de treinamento, pois normalmente requer menos épocas para concluir o processo de classificação, e se adapta a novos canais e conjuntos de dados graças aos seus parâmetros treináveis. Os experimentos são realizados nos conjuntos de dados UCF101 e HMDB51, que são duas bases desafiadoras no contexto de reconhecimento de açõesAbstract: The large amount of video data produced and released every day makes visual inspection by a human operator impracticable. However, the content of these videos can be useful for various important tasks, such as surveillance and health monitoring. Therefore, automatic methods are needed to detect and understand relevant events in videos. The problem addressed in this work is the recognition of human actions in videos that aims to classify the action that is being performed by one or more actors. The complexity of the problem and the volume of video data suggest the use of deep learning-based techniques, however, unlike image-related problems, there is neither a great variety of specific well-established architectures nor annotated datasets as large as image-based ones. To circumvent these limitations, we propose and analyze a multi-stream architecture containing image-based networks pre-trained on the large ImageNet. Different image representations are extracted from the videos to feed the streams, in order to provide complementary information for the system. Here, we propose new streams based on visual rhythm that encode longer-term information when compared to still frames and optical flow. As important as the definition of representative and complementary aspects is the choice of proper combination methods that explore the strengths of each modality. Thus, here we also analyze different fusion approaches to combine the modalities. In order to define the best parameters of our fusion methods using the training set, we have to reduce overfitting in individual modalities, otherwise, the 100-accurate outputs would not offer a realistic and relevant representation for the fusion method. Thus, we investigate an early stopping technique to train individual networks. In addition to reducing overfitting, this method also reduces the training cost, since it usually requires fewer epochs to complete the classification process, and adapts to new streams and datasets thanks to its trainable parameters. Experiments are conducted on UCF101 and HMDB51 datasets, which are two challenging benchmarks in the context of action recognitionDoutoradoCiência da ComputaçãoDoutora em Ciência da Computação0012017/09160-1CAPESFAPES