181 research outputs found

    Hierarchical representations for spatio-temporal visual attention: modeling and understanding

    Get PDF
    Mención Internacional en el título de doctorDentro del marco de la Inteligencia Artificial, la Visión Artificial es una disciplina científica que tiene como objetivo simular automaticamente las funciones del sistema visual humano, tratando de resolver tareas como la localización y el reconocimiento de objetos, la detección de eventos o el seguimiento de objetos....Programa Oficial de Doctorado en Multimedia y ComunicacionesPresidente: Luis Salgado Álvarez de Sotomayor.- Secretario: Ascensión Gallardo Antolín.- Vocal: Jenny Benois Pinea

    Symbolic and Deep Learning Based Data Representation Methods for Activity Recognition and Image Understanding at Pixel Level

    Get PDF
    Efficient representation of large amount of data particularly images and video helps in the analysis, processing and overall understanding of the data. In this work, we present two frameworks that encapsulate the information present in such data. At first, we present an automated symbolic framework to recognize particular activities in real time from videos. The framework uses regular expressions for symbolically representing (possibly infinite) sets of motion characteristics obtained from a video. It is a uniform framework that handles trajectory-based and periodic articulated activities and provides polynomial time graph algorithms for fast recognition. The regular expressions representing motion characteristics can either be provided manually or learnt automatically from positive and negative examples of strings (that describe dynamic behavior) using offline automata learning frameworks. Confidence measures are associated with recognitions using Levenshtein distance between a string representing a motion signature and the regular expression describing an activity. We have used our framework to recognize trajectory-based activities like vehicle turns (U-turns, left and right turns, and K-turns), vehicle start and stop, person running and walking, and periodic articulated activities like digging, waving, boxing, and clapping in videos from the VIRAT public dataset, the KTH dataset, and a set of videos obtained from YouTube. Next, we present a core sampling framework that is able to use activation maps from several layers of a Convolutional Neural Network (CNN) as features to another neural network using transfer learning to provide an understanding of an input image. The intermediate map responses of a Convolutional Neural Network (CNN) contain information about an image that can be used to extract contextual knowledge about it. Our framework creates a representation that combines features from the test data and the contextual knowledge gained from the responses of a pretrained network, processes it and feeds it to a separate Deep Belief Network. We use this representation to extract more information from an image at the pixel level, hence gaining understanding of the whole image. We experimentally demonstrate the usefulness of our framework using a pretrained VGG-16 model to perform segmentation on the BAERI dataset of Synthetic Aperture Radar (SAR) imagery and the CAMVID dataset. Using this framework, we also reconstruct images by removing noise from noisy character images. The reconstructed images are encoded using Quadtrees. Quadtrees can be an efficient representation in learning from sparse features. When we are dealing with handwritten character images, they are quite susceptible to noise. Hence, preprocessing stages to make the raw data cleaner can improve the efficacy of their use. We improve upon the efficiency of probabilistic quadtrees by using a pixel level classifier to extract the character pixels and remove noise from the images. The pixel level denoiser uses a pretrained CNN trained on a large image dataset and uses transfer learning to aid the reconstruction of characters. In this work, we primarily deal with classification of noisy characters and create the noisy versions of handwritten Bangla Numeral and Basic Character datasets and use them and the Noisy MNIST dataset to demonstrate the usefulness of our approach

    Scalable visualization of spatial data in 3D terrain

    Get PDF
    Designing visualizations of spatial data in 3D terrain is challenging because various heterogeneous data aspects need to be considered, including the terrain itself, multiple data attributes, and data uncertainty. It is hardly possible to visualize these data at full detail in a single image. Therefore, this thesis devises a scalable visualization approach that focuses on relevant information to be emphasized, while less-relevant information can be attenuated. In this context, a noval concept of visualizing spatial data in 3D terrain and different soft- and hardware solutions are proposed.Die Erstellung von Visualisierungen für räumliche Daten im 3D-Gelände ist schwierig, da viele heterogene Datenaspekte wie das Gelände selbst, die verschiedenen Datenattribute sowie Unsicherheiten bei der Darstellung zu berücksichtigen sind. Im Allgemeinen ist es nicht möglich, diese Datenaspekte gleichzeitig in einer Visualisierung darzustellen. Daher werden in der Arbeit skalierbare Visualisierungsstrategien entwickelt, welche die wichtigen Informationen hervorheben und trotzdem gleichzeitig Kontextinformationen liefern. Hierfür werden neue Systematisierungen und Konzepte vorgestellt

    Compressing Deep Neural Networks via Knowledge Distillation

    Get PDF
    There has been a continuous evolution in deep neural network architectures since Alex Krizhevsky proposed AlexNet in 2012. Part of this has been due to increased complexity of the data and easier availability of datasets and part of it has been due to increased complexity of applications. These two factors form a self sustaining cycle and thereby have pushed the boundaries of deep learning to new domains in recent years. Many datasets have been proposed for different tasks. In computer vision, notable datasets like ImageNet, CIFAR-10, 100, MS-COCO provide large training data, with different tasks like classification, segmentation and object localization. Interdisciplinary datasets like the Visual Genome Dataset connect computer vision to tasks like natural language processing. All of these have fuelled the advent of architectures like AlexNet, VGG-Net, ResNet to achieve better predictive performance on these datasets. In object detection, networks like YOLO, SSD, Faster-RCNN have made great strides in achieving state of the art performance. However, amidst the growth of the neural networks one aspect that has been neglected is the problem of deploying them on devices which can support the computational and memory requirements of Deep Neural Networks (DNNs). Modern technology is only as good as the number of platforms it can support. Many applications like face detection, person classification and pedestrian detection require real time execution, with devices mounted on cameras. These devices are low powered and do not have the computational resources to run the data through a DNN and get instantaneous results. A natural solution to this problem is to make the DNN size smaller through compression. However, unlike file compression, DNN compression has a goal of not significantly impacting the overall accuracy of the network. In this thesis we consider the problem of model compression and present our end-to-end training algorithm for training a smaller model under the influence of a collection of expert models. The smaller model can be then deployed on resource constrained hardware independently from the expert models. We call this approach a form of compression since by deploying a smaller model we save the memory which would have been consumed by one or more expert models. We additionally introduce memory efficient architectures by building off from key ideas in literature that occupy very small memory and show the results of training them using our approach

    A COMPUTATION METHOD/FRAMEWORK FOR HIGH LEVEL VIDEO CONTENT ANALYSIS AND SEGMENTATION USING AFFECTIVE LEVEL INFORMATION

    No full text
    VIDEO segmentation facilitates e±cient video indexing and navigation in large digital video archives. It is an important process in a content-based video indexing and retrieval (CBVIR) system. Many automated solutions performed seg- mentation by utilizing information about the \facts" of the video. These \facts" come in the form of labels that describe the objects which are captured by the cam- era. This type of solutions was able to achieve good and consistent results for some video genres such as news programs and informational presentations. The content format of this type of videos is generally quite standard, and automated solutions were designed to follow these format rules. For example in [1], the presence of news anchor persons was used as a cue to determine the start and end of a meaningful news segment. The same cannot be said for video genres such as movies and feature films. This is because makers of this type of videos utilized different filming techniques to design their videos in order to elicit certain affective response from their targeted audience. Humans usually perform manual video segmentation by trying to relate changes in time and locale to discontinuities in meaning [2]. As a result, viewers usually have doubts about the boundary locations of a meaningful video segment due to their different affective responses. This thesis presents an entirely new view to the problem of high level video segmentation. We developed a novel probabilistic method for affective level video content analysis and segmentation. Our method had two stages. In the first stage, a®ective content labels were assigned to video shots by means of a dynamic bayesian 0. Abstract 3 network (DBN). A novel hierarchical-coupled dynamic bayesian network (HCDBN) topology was proposed for this stage. The topology was based on the pleasure- arousal-dominance (P-A-D) model of a®ect representation [3]. In principle, this model can represent a large number of emotions. In the second stage, the visual, audio and a®ective information of the video was used to compute a statistical feature vector to represent the content of each shot. Affective level video segmentation was achieved by applying spectral clustering to the feature vectors. We evaluated the first stage of our proposal by comparing its emotion detec- tion ability with all the existing works which are related to the field of a®ective video content analysis. To evaluate the second stage, we used the time adaptive clustering (TAC) algorithm as our performance benchmark. The TAC algorithm was the best high level video segmentation method [2]. However, it is a very computationally intensive algorithm. To accelerate its computation speed, we developed a modified TAC (modTAC) algorithm which was designed to be mapped easily onto a field programmable gate array (FPGA) device. Both the TAC and modTAC algorithms were used as performance benchmarks for our proposed method. Since affective video content is a perceptual concept, the segmentation per- formance and human agreement rates were used as our evaluation criteria. To obtain our ground truth data and viewer agreement rates, a pilot panel study which was based on the work of Gross et al. [4] was conducted. Experiment results will show the feasibility of our proposed method. For the first stage of our proposal, our experiment results will show that an average improvement of as high as 38% was achieved over previous works. As for the second stage, an improvement of as high as 37% was achieved over the TAC algorithm

    Principles and Guidelines for Advancement of Touchscreen-Based Non-visual Access to 2D Spatial Information

    Get PDF
    Graphical materials such as graphs and maps are often inaccessible to millions of blind and visually-impaired (BVI) people, which negatively impacts their educational prospects, ability to travel, and vocational opportunities. To address this longstanding issue, a three-phase research program was conducted that builds on and extends previous work establishing touchscreen-based haptic cuing as a viable alternative for conveying digital graphics to BVI users. Although promising, this approach poses unique challenges that can only be addressed by schematizing the underlying graphical information based on perceptual and spatio-cognitive characteristics pertinent to touchscreen-based haptic access. Towards this end, this dissertation empirically identified a set of design parameters and guidelines through a logical progression of seven experiments. Phase I investigated perceptual characteristics related to touchscreen-based graphical access using vibrotactile stimuli, with results establishing three core perceptual guidelines: (1) a minimum line width of 1mm should be maintained for accurate line-detection (Exp-1), (2) a minimum interline gap of 4mm should be used for accurate discrimination of parallel vibrotactile lines (Exp-2), and (3) a minimum angular separation of 4mm should be used for accurate discrimination of oriented vibrotactile lines (Exp-3). Building on these parameters, Phase II studied the core spatio-cognitive characteristics pertinent to touchscreen-based non-visual learning of graphical information, with results leading to the specification of three design guidelines: (1) a minimum width of 4mm should be used for supporting tasks that require tracing of vibrotactile lines and judging their orientation (Exp-4), (2) a minimum width of 4mm should be maintained for accurate line tracing and learning of complex spatial path patterns (Exp-5), and (3) vibrotactile feedback should be used as a guiding cue to support the most accurate line tracing performance (Exp-6). Finally, Phase III demonstrated that schematizing line-based maps based on these design guidelines leads to development of an accurate cognitive map. Results from Experiment-7 provide theoretical evidence in support of learning from vision and touch as leading to the development of functionally equivalent amodal spatial representations in memory. Findings from all seven experiments contribute to new theories of haptic information processing that can guide the development of new touchscreen-based non-visual graphical access solutions

    Towards Developing Computer Vision Algorithms and Architectures for Real-world Applications

    Get PDF
    abstract: Computer vision technology automatically extracts high level, meaningful information from visual data such as images or videos, and the object recognition and detection algorithms are essential in most computer vision applications. In this dissertation, we focus on developing algorithms used for real life computer vision applications, presenting innovative algorithms for object segmentation and feature extraction for objects and actions recognition in video data, and sparse feature selection algorithms for medical image analysis, as well as automated feature extraction using convolutional neural network for blood cancer grading. To detect and classify objects in video, the objects have to be separated from the background, and then the discriminant features are extracted from the region of interest before feeding to a classifier. Effective object segmentation and feature extraction are often application specific, and posing major challenges for object detection and classification tasks. In this dissertation, we address effective object flow based ROI generation algorithm for segmenting moving objects in video data, which can be applied in surveillance and self driving vehicle areas. Optical flow can also be used as features in human action recognition algorithm, and we present using optical flow feature in pre-trained convolutional neural network to improve performance of human action recognition algorithms. Both algorithms outperform the state-of-the-arts at their time. Medical images and videos pose unique challenges for image understanding mainly due to the fact that the tissues and cells are often irregularly shaped, colored, and textured, and hand selecting most discriminant features is often difficult, thus an automated feature selection method is desired. Sparse learning is a technique to extract the most discriminant and representative features from raw visual data. However, sparse learning with \textit{L1} regularization only takes the sparsity in feature dimension into consideration; we improve the algorithm so it selects the type of features as well; less important or noisy feature types are entirely removed from the feature set. We demonstrate this algorithm to analyze the endoscopy images to detect unhealthy abnormalities in esophagus and stomach, such as ulcer and cancer. Besides sparsity constraint, other application specific constraints and prior knowledge may also need to be incorporated in the loss function in sparse learning to obtain the desired results. We demonstrate how to incorporate similar-inhibition constraint, gaze and attention prior in sparse dictionary selection for gastroscopic video summarization that enable intelligent key frame extraction from gastroscopic video data. With recent advancement in multi-layer neural networks, the automatic end-to-end feature learning becomes feasible. Convolutional neural network mimics the mammal visual cortex and can extract most discriminant features automatically from training samples. We present using convolutinal neural network with hierarchical classifier to grade the severity of Follicular Lymphoma, a type of blood cancer, and it reaches 91\% accuracy, on par with analysis by expert pathologists. Developing real world computer vision applications is more than just developing core vision algorithms to extract and understand information from visual data; it is also subject to many practical requirements and constraints, such as hardware and computing infrastructure, cost, robustness to lighting changes and deformation, ease of use and deployment, etc.The general processing pipeline and system architecture for the computer vision based applications share many similar design principles and architecture. We developed common processing components and a generic framework for computer vision application, and a versatile scale adaptive template matching algorithm for object detection. We demonstrate the design principle and best practices by developing and deploying a complete computer vision application in real life, building a multi-channel water level monitoring system, where the techniques and design methodology can be generalized to other real life applications. The general software engineering principles, such as modularity, abstraction, robust to requirement change, generality, etc., are all demonstrated in this research.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Biologically motivated keypoint detection for RGB-D data

    Get PDF
    With the emerging interest in active vision, computer vision researchers have been increasingly concerned with the mechanisms of attention. Therefore, several visual attention computational models inspired by the human visual system, have been developed, aiming at the detection of regions of interest in images. This thesis is focused on selective visual attention, which provides a mechanism for the brain to focus computational resources on an object at a time, guided by low-level image properties (Bottom-Up attention). The task of recognizing objects in different locations is achieved by focusing on different locations, one at a time. Given the computational requirements of the models proposed, the research in this area has been mainly of theoretical interest. More recently, psychologists, neurobiologists and engineers have developed cooperation's and this has resulted in considerable benefits. The first objective of this doctoral work is to bring together concepts and ideas from these different research areas, providing a study of the biological research on human visual system and a discussion of the interdisciplinary knowledge in this area, as well as the state-of-art on computational models of visual attention (bottom-up). Normally, the visual attention is referred by engineers as saliency: when people fix their look in a particular region of the image, that's because that region is salient. In this research work, saliency methods are presented based on their classification (biological plausible, computational or hybrid) and in a chronological order. A few salient structures can be used for applications like object registration, retrieval or data simplification, being possible to consider these few salient structures as keypoints when aiming at performing object recognition. Generally, object recognition algorithms use a large number of descriptors extracted in a dense set of points, which comes along with very high computational cost, preventing real-time processing. To avoid the problem of the computational complexity required, the features have to be extracted from a small set of points, usually called keypoints. The use of keypoint-based detectors allows the reduction of the processing time and the redundancy in the data. Local descriptors extracted from images have been extensively reported in the computer vision literature. Since there is a large set of keypoint detectors, this suggests the need of a comparative evaluation between them. In this way, we propose to do a description of 2D and 3D keypoint detectors, 3D descriptors and an evaluation of existing 3D keypoint detectors in a public available point cloud library with 3D real objects. The invariance of the 3D keypoint detectors was evaluated according to rotations, scale changes and translations. This evaluation reports the robustness of a particular detector for changes of point-of-view and the criteria used are the absolute and the relative repeatability rate. In our experiments, the method that achieved better repeatability rate was the ISS3D method. The analysis of the human visual system and saliency maps detectors with biological inspiration led to the idea of making an extension for a keypoint detector based on the color information in the retina. Such proposal produced a 2D keypoint detector inspired by the behavior of the early visual system. Our method is a color extension of the BIMP keypoint detector, where we include both color and intensity channels of an image: color information is included in a biological plausible way and multi-scale image features are combined into a single keypoints map. This detector is compared against state-of-art detectors and found particularly well-suited for tasks such as category and object recognition. The recognition process is performed by comparing the extracted 3D descriptors in the locations indicated by the keypoints after mapping the 2D keypoints locations to the 3D space. The evaluation allowed us to obtain the best pair keypoint detector/descriptor on a RGB-D object dataset. Using our keypoint detector and the SHOTCOLOR descriptor a good category recognition rate and object recognition rate were obtained, and it is with the PFHRGB descriptor that we obtain the best results. A 3D recognition system involves the choice of keypoint detector and descriptor. A new method for the detection of 3D keypoints on point clouds is presented and a benchmarking is performed between each pair of 3D keypoint detector and 3D descriptor to evaluate their performance on object and category recognition. These evaluations are done in a public database of real 3D objects. Our keypoint detector is inspired by the behavior and neural architecture of the primate visual system: the 3D keypoints are extracted based on a bottom-up 3D saliency map, which is a map that encodes the saliency of objects in the visual environment. The saliency map is determined by computing conspicuity maps (a combination across different modalities) of the orientation, intensity and color information, in a bottom-up and in a purely stimulusdriven manner. These three conspicuity maps are fused into a 3D saliency map and, finally, the focus of attention (or "keypoint location") is sequentially directed to the most salient points in this map. Inhibiting this location automatically allows the system to attend to the next most salient location. The main conclusions are: with a similar average number of keypoints, our 3D keypoint detector outperforms the other eight 3D keypoint detectors evaluated by achiving the best result in 32 of the evaluated metrics in the category and object recognition experiments, when the second best detector only obtained the best result in 8 of these metrics. The unique drawback is the computational time, since BIK-BUS is slower than the other detectors. Given that differences are big in terms of recognition performance, size and time requirements, the selection of the keypoint detector and descriptor has to be matched to the desired task and we give some directions to facilitate this choice. After proposing the 3D keypoint detector, the research focused on a robust detection and tracking method for 3D objects by using keypoint information in a particle filter. This method consists of three distinct steps: Segmentation, Tracking Initialization and Tracking. The segmentation is made to remove all the background information, reducing the number of points for further processing. In the initialization, we use a keypoint detector with biological inspiration. The information of the object that we want to follow is given by the extracted keypoints. The particle filter does the tracking of the keypoints, so with that we can predict where the keypoints will be in the next frame. In a recognition system, one of the problems is the computational cost of keypoint detectors with this we intend to solve this problem. The experiments with PFBIKTracking method are done indoors in an office/home environment, where personal robots are expected to operate. The Tracking Error evaluates the stability of the general tracking method. We also quantitatively evaluate this method using a "Tracking Error". Our evaluation is done by the computation of the keypoint and particle centroid. Comparing our system that the tracking method which exists in the Point Cloud Library, we archive better results, with a much smaller number of points and computational time. Our method is faster and more robust to occlusion when compared to the OpenniTracker.Com o interesse emergente na visão ativa, os investigadores de visão computacional têm estado cada vez mais preocupados com os mecanismos de atenção. Por isso, uma série de modelos computacionais de atenção visual, inspirado no sistema visual humano, têm sido desenvolvidos. Esses modelos têm como objetivo detetar regiões de interesse nas imagens. Esta tese está focada na atenção visual seletiva, que fornece um mecanismo para que o cérebro concentre os recursos computacionais num objeto de cada vez, guiado pelas propriedades de baixo nível da imagem (atenção Bottom-Up). A tarefa de reconhecimento de objetos em diferentes locais é conseguida através da concentração em diferentes locais, um de cada vez. Dados os requisitos computacionais dos modelos propostos, a investigação nesta área tem sido principalmente de interesse teórico. Mais recentemente, psicólogos, neurobiólogos e engenheiros desenvolveram cooperações e isso resultou em benefícios consideráveis. No início deste trabalho, o objetivo é reunir os conceitos e ideias a partir dessas diferentes áreas de investigação. Desta forma, é fornecido o estudo sobre a investigação da biologia do sistema visual humano e uma discussão sobre o conhecimento interdisciplinar da matéria, bem como um estado de arte dos modelos computacionais de atenção visual (bottom-up). Normalmente, a atenção visual é denominada pelos engenheiros como saliência, se as pessoas fixam o olhar numa determinada região da imagem é porque esta região é saliente. Neste trabalho de investigação, os métodos saliência são apresentados em função da sua classificação (biologicamente plausível, computacional ou híbrido) e numa ordem cronológica. Algumas estruturas salientes podem ser usadas, em vez do objeto todo, em aplicações tais como registo de objetos, recuperação ou simplificação de dados. É possível considerar estas poucas estruturas salientes como pontos-chave, com o objetivo de executar o reconhecimento de objetos. De um modo geral, os algoritmos de reconhecimento de objetos utilizam um grande número de descritores extraídos num denso conjunto de pontos. Com isso, estes têm um custo computacional muito elevado, impedindo que o processamento seja realizado em tempo real. A fim de evitar o problema da complexidade computacional requerido, as características devem ser extraídas a partir de um pequeno conjunto de pontos, geralmente chamados pontoschave. O uso de detetores de pontos-chave permite a redução do tempo de processamento e a quantidade de redundância dos dados. Os descritores locais extraídos a partir das imagens têm sido amplamente reportados na literatura de visão por computador. Uma vez que existe um grande conjunto de detetores de pontos-chave, sugere a necessidade de uma avaliação comparativa entre eles. Desta forma, propomos a fazer uma descrição dos detetores de pontos-chave 2D e 3D, dos descritores 3D e uma avaliação dos detetores de pontos-chave 3D existentes numa biblioteca de pública disponível e com objetos 3D reais. A invariância dos detetores de pontoschave 3D foi avaliada de acordo com variações nas rotações, mudanças de escala e translações. Essa avaliação retrata a robustez de um determinado detetor no que diz respeito às mudanças de ponto-de-vista e os critérios utilizados são as taxas de repetibilidade absoluta e relativa. Nas experiências realizadas, o método que apresentou melhor taxa de repetibilidade foi o método ISS3D. Com a análise do sistema visual humano e dos detetores de mapas de saliência com inspiração biológica, surgiu a ideia de se fazer uma extensão para um detetor de ponto-chave com base na informação de cor na retina. A proposta produziu um detetor de ponto-chave 2D inspirado pelo comportamento do sistema visual. O nosso método é uma extensão com base na cor do detetor de ponto-chave BIMP, onde se incluem os canais de cor e de intensidade de uma imagem. A informação de cor é incluída de forma biológica plausível e as características multi-escala da imagem são combinadas num único mapas de pontos-chave. Este detetor é comparado com os detetores de estado-da-arte e é particularmente adequado para tarefas como o reconhecimento de categorias e de objetos. O processo de reconhecimento é realizado comparando os descritores 3D extraídos nos locais indicados pelos pontos-chave. Para isso, as localizações do pontos-chave 2D têm de ser convertido para o espaço 3D. Isto foi possível porque o conjunto de dados usado contém a localização de cada ponto de no espaço 2D e 3D. A avaliação permitiu-nos obter o melhor par detetor de ponto-chave/descritor num RGB-D object dataset. Usando o nosso detetor de ponto-chave e o descritor SHOTCOLOR, obtemos uma noa taxa de reconhecimento de categorias e para o reconhecimento de objetos é com o descritor PFHRGB que obtemos os melhores resultados. Um sistema de reconhecimento 3D envolve a escolha de detetor de ponto-chave e descritor, por isso é apresentado um novo método para a deteção de pontos-chave em nuvens de pontos 3D e uma análise comparativa é realizada entre cada par de detetor de ponto-chave 3D e descritor 3D para avaliar o desempenho no reconhecimento de categorias e de objetos. Estas avaliações são feitas numa base de dados pública de objetos 3D reais. O nosso detetor de ponto-chave é inspirado no comportamento e na arquitetura neural do sistema visual dos primatas. Os pontos-chave 3D são extraídas com base num mapa de saliências 3D bottom-up, ou seja, um mapa que codifica a saliência dos objetos no ambiente visual. O mapa de saliência é determinada pelo cálculo dos mapas de conspicuidade (uma combinação entre diferentes modalidades) da orientação, intensidade e informações de cor de forma bottom-up e puramente orientada para o estímulo. Estes três mapas de conspicuidade são fundidos num mapa de saliência 3D e, finalmente, o foco de atenção (ou "localização do ponto-chave") está sequencialmente direcionado para os pontos mais salientes deste mapa. Inibir este local permite que o sistema automaticamente orientado para próximo local mais saliente. As principais conclusões são: com um número médio similar de pontos-chave, o nosso detetor de ponto-chave 3D supera os outros oito detetores de pontos-chave 3D avaliados, obtendo o melhor resultado em 32 das métricas avaliadas nas experiências do reconhecimento das categorias e dos objetos, quando o segundo melhor detetor obteve apenas o melhor resultado em 8 dessas métricas. A única desvantagem é o tempo computacional, uma vez que BIK-BUS é mais lento do que os outros detetores. Dado que existem grandes diferenças em termos de desempenho no reconhecimento, de tamanho e de tempo, a seleção do detetor de ponto-chave e descritor tem de ser interligada com a tarefa desejada e nós damos algumas orientações para facilitar esta escolha neste trabalho de investigação. Depois de propor um detetor de ponto-chave 3D, a investigação incidiu sobre um método robusto de deteção e tracking de objetos 3D usando as informações dos pontos-chave num filtro de partículas. Este método consiste em três etapas distintas: Segmentação, Inicialização do Tracking e Tracking. A segmentação é feita de modo a remover toda a informação de fundo, a fim de reduzir o número de pontos para processamento futuro. Na inicialização, usamos um detetor de ponto-chave com inspiração biológica. A informação do objeto que queremos seguir é dada pelos pontos-chave extraídos. O filtro de partículas faz o acompanhamento dos pontoschave, de modo a se poder prever onde os pontos-chave estarão no próximo frame. As experiências com método PFBIK-Tracking são feitas no interior, num ambiente de escritório/casa, onde se espera que robôs pessoais possam operar. Também avaliado quantitativamente este método utilizando um "Tracking Error". A avaliação passa pelo cálculo das centróides dos pontos-chave e das partículas. Comparando o nosso sistema com o método de tracking que existe na biblioteca usada no desenvolvimento, nós obtemos melhores resultados, com um número muito menor de pontos e custo computacional. O nosso método é mais rápido e mais robusto em termos de oclusão, quando comparado com o OpenniTracker
    corecore