1,212 research outputs found

    Attention mechanism in deep neural networks for computer vision tasks

    Get PDF
    “Attention mechanism, which is one of the most important algorithms in the deep Learning community, was initially designed in the natural language processing for enhancing the feature representation of key sentence fragments over the context. In recent years, the attention mechanism has been widely adopted in solving computer vision tasks by guiding deep neural networks (DNNs) to focus on specific image features for better understanding the semantic information of the image. However, the attention mechanism is not only capable of helping DNNs understand semantics, but also useful for the feature fusion, visual cue discovering, and temporal information selection, which are seldom researched. In this study, we take the classic attention mechanism a step further by proposing the Semantic Attention Guidance Unit (SAGU) for multi-level feature fusion to tackle the challenging Biomedical Image Segmentation task. Furthermore, we propose a novel framework that consists of (1) Semantic Attention Unit (SAU), which is an advanced version of SAGU for adaptively bringing high-level semantics to mid-level features, (2) Two-level Spatial Attention Module (TSPAM) for discovering multiple visual cues within the image, and (3) Temporal Attention Module (TAM) for temporal information selection to solve the Videobased Person Re-identification task. To validate our newly proposed attention mechanisms, extensive experiments are conducted on challenging datasets. Our methods obtain competitive performance and outperform state-of-the-art methods. Selective publications are also presented in the Appendix”--Abstract, page iii

    Action recognition in visual sensor networks: a data fusion perspective

    Get PDF
    Visual Sensor Networks have emerged as a new technology to bring computer vision algorithms to the real world. However, they impose restrictions in the computational resources and bandwidth available to solve target problems. This thesis is concerned with the definition of new efficient algorithms to perform Human Action Recognition with Visual Sensor Networks. Human Action Recognition systems apply sequence modelling methods to integrate the temporal sensor measurements available. Among sequence modelling methods, the Hidden Conditional Random Field has shown a great performance in sequence classification tasks, outperforming many other methods. However, a parameter estimation procedure has not been proposed with feature and model selection properties. This thesis fills this lack proposing a new objective function to optimize during training. The L2 regularizer employed in the standard objective function is replaced by an overlapping group-L1 regularizer that produces feature and model selection effects in the optima. A gradient-based search strategy is proposed to find the optimal parameters of the objective function. Experimental evidence shows that Hidden Conditional Random Fields with their parameters estimated employing the proposed method have a higher predictive accuracy than those estimated with the standard method, with an smaller inference cost. This thesis also deals with the problem of human action recognition from multiple cameras, with the focus on reducing the amount of network bandwidth required. A multiple view dimensionality reduction framework is developed to obtain similar low dimensional representation for the motion descriptors extracted from multiple cameras. An alternative is proposed predicting the action class locally at each camera with the motion descriptors extracted from each view and integrating the different action decisions to make a global decision on the action performed. The reported experiments show that the proposed framework has a predictive performance similar to 3D state of the art methods, but with a lower computational complexity and lower bandwidth requirements. ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Las Redes de Sensores Visuales son una nueva tecnología que permite el despliegue de algoritmos de visión por computador en el mundo real. Sin embargo, estas imponen restricciones en los recursos de computo y de ancho de banda disponibles para la resolución del problema en cuestión. Esta tesis tiene por objeto la definición de nuevos algoritmos con los que realizar reconocimiento de actividades humanas en redes de sensores visuales, teniendo en cuenta las restricciones planteadas. Los sistemas de reconocimiento de acciones aplican métodos de modelado de secuencias para la integración de las medidas temporales proporcionadas por los sensores. Entre los modelos para el modelado de secuencias, el Hidden Conditional Random Field a mostrado un gran rendimiento en la clasificación de secuencias, superando a otros métodos existentes. Sin embargo, no se ha definido un procedimiento para la integración de sus parámetros que incluya selección de atributos y selección de modelo. Esta tesis tiene por objeto cubrir esta carencia proponiendo una nueva función objetivo para optimizar durante la estimación de los parámetros obtimos. El regularizador L2 empleado en la función objetivo estandar se va a remplazar for un regularizador grupo-L1 solapado que va a producir los efectos de selección de modelo y atributos deseados en el óptimo. Se va a proponer una estrategia de búsqueda con la que obtener el valor óptimo de estos parámetros. Los experimentos realizados muestran que los modelos estimados utilizando la función objetivo prouesta tienen un mayor poder de predicción, reduciendo al mismo tiempo el coste computacional de la inferencia. Esta tesis también trata el problema del reconocimiento de acciones humanas emepleando multiples cámaras, centrándonos en reducir la cantidad de ancho de banda requerido par el proceso. Para ello se propone un nueva estructura en la que definir algoritmos de reducción de dimensionalidad para datos definidos en multiples vistas. Mediante su aplicación se obtienen representaciones de baja dimensionalidad similares para los descriptores de movimiento calculados en cada una de las cámaras.También se propone un método alternativo basado en la predicción de la acción realizada con los descriptores obtenidos en cada una de las cámaras, para luego combinar las diferentes predicciones en una global. La experimentación realizada muestra que estos métodos tienen una eficacia similar a la alcanzada por los métodos existentes basados en reconstrucción 3D, pero con una menor complejidad computacional y un menor uso de la red

    Proceedings of the 2011 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory

    Get PDF
    This book is a collection of 15 reviewed technical reports summarizing the presentations at the 2011 Joint Workshop of Fraunhofer IOSB and Institute for Anthropomatics, Vision and Fusion Laboratory. The covered topics include image processing, optical signal processing, visual inspection, pattern recognition and classification, human-machine interaction, world and situation modeling, autonomous system localization and mapping, information fusion, and trust propagation in sensor networks

    Vision-based Person Re-identification in a Queue

    Get PDF

    Towards Developing Computer Vision Algorithms and Architectures for Real-world Applications

    Get PDF
    abstract: Computer vision technology automatically extracts high level, meaningful information from visual data such as images or videos, and the object recognition and detection algorithms are essential in most computer vision applications. In this dissertation, we focus on developing algorithms used for real life computer vision applications, presenting innovative algorithms for object segmentation and feature extraction for objects and actions recognition in video data, and sparse feature selection algorithms for medical image analysis, as well as automated feature extraction using convolutional neural network for blood cancer grading. To detect and classify objects in video, the objects have to be separated from the background, and then the discriminant features are extracted from the region of interest before feeding to a classifier. Effective object segmentation and feature extraction are often application specific, and posing major challenges for object detection and classification tasks. In this dissertation, we address effective object flow based ROI generation algorithm for segmenting moving objects in video data, which can be applied in surveillance and self driving vehicle areas. Optical flow can also be used as features in human action recognition algorithm, and we present using optical flow feature in pre-trained convolutional neural network to improve performance of human action recognition algorithms. Both algorithms outperform the state-of-the-arts at their time. Medical images and videos pose unique challenges for image understanding mainly due to the fact that the tissues and cells are often irregularly shaped, colored, and textured, and hand selecting most discriminant features is often difficult, thus an automated feature selection method is desired. Sparse learning is a technique to extract the most discriminant and representative features from raw visual data. However, sparse learning with \textit{L1} regularization only takes the sparsity in feature dimension into consideration; we improve the algorithm so it selects the type of features as well; less important or noisy feature types are entirely removed from the feature set. We demonstrate this algorithm to analyze the endoscopy images to detect unhealthy abnormalities in esophagus and stomach, such as ulcer and cancer. Besides sparsity constraint, other application specific constraints and prior knowledge may also need to be incorporated in the loss function in sparse learning to obtain the desired results. We demonstrate how to incorporate similar-inhibition constraint, gaze and attention prior in sparse dictionary selection for gastroscopic video summarization that enable intelligent key frame extraction from gastroscopic video data. With recent advancement in multi-layer neural networks, the automatic end-to-end feature learning becomes feasible. Convolutional neural network mimics the mammal visual cortex and can extract most discriminant features automatically from training samples. We present using convolutinal neural network with hierarchical classifier to grade the severity of Follicular Lymphoma, a type of blood cancer, and it reaches 91\% accuracy, on par with analysis by expert pathologists. Developing real world computer vision applications is more than just developing core vision algorithms to extract and understand information from visual data; it is also subject to many practical requirements and constraints, such as hardware and computing infrastructure, cost, robustness to lighting changes and deformation, ease of use and deployment, etc.The general processing pipeline and system architecture for the computer vision based applications share many similar design principles and architecture. We developed common processing components and a generic framework for computer vision application, and a versatile scale adaptive template matching algorithm for object detection. We demonstrate the design principle and best practices by developing and deploying a complete computer vision application in real life, building a multi-channel water level monitoring system, where the techniques and design methodology can be generalized to other real life applications. The general software engineering principles, such as modularity, abstraction, robust to requirement change, generality, etc., are all demonstrated in this research.Dissertation/ThesisDoctoral Dissertation Computer Science 201
    corecore