888 research outputs found

    Contributions to region-based image and video analysis: feature aggregation, background subtraction and description constraining

    Full text link
    Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: 22-01-2016Esta tesis tiene embargado el acceso al texto completo hasta el 22-07-2017The use of regions for image and video analysis has been traditionally motivated by their ability to diminish the number of processed units and hence, the number of required decisions. However, as we explore in this thesis, this is just one of the potential advantages that regions may provide. When dealing with regions, two description spaces may be differentiated: the decision space, on which regions are shaped—region segmentation—, and the feature space, on which regions are used for analysis—region-based applications—. These two spaces are highly related. The solutions taken on the decision space severely affect their performance in the feature space. Accordingly, in this thesis we propose contributions on both spaces. Regarding the contributions to region segmentation, these are two-fold. Firstly, we give a twist to a classical region segmentation technique, the Mean-Shift, by exploring new solutions to automatically set the spectral kernel bandwidth. Secondly, we propose a method to describe the micro-texture of a pixel neighbourhood by using an easily customisable filter-bank methodology—which is based on the discrete cosine transform (DCT)—. The rest of the thesis is devoted to describe region-based approaches to several highly topical issues in computer vision; two broad tasks are explored: background subtraction (BS) and local descriptors (LD). Concerning BS, regions are here used as complementary cues to refine pixel-based BS algorithms: by providing robust to illumination cues and by storing the background dynamics in a region-driven background modelling. Relating to LD, the region is here used to reshape the description area usually fixed for local descriptors. Region-masked versions of classical two-dimensional and three-dimensional local descriptions are designed. So-built descriptions are proposed for the task of object identification, under a novel neural-oriented strategy. Furthermore, a local description scheme based on a fuzzy use of the region membership is derived. This characterisation scheme has been geometrically adapted to account for projective deformations, providing a suitable tool for finding corresponding points in wide-baseline scenarios. Experiments have been conducted for every contribution, discussing the potential benefits and the limitations of the proposed schemes. In overall, obtained results suggest that the region—conditioned by successful aggregation processes—is a reliable and useful tool to extrapolate pixel-level results, diminish semantic noise, isolate significant object cues and constrain local descriptions. The methods and approaches described along this thesis present alternative or complementary solutions to pixel-based image processing.El uso de regiones para el análisis de imágenes y secuencias de video ha estado tradicionalmente motivado por su utilidad para disminuir el número de unidades de análisis y, por ende, el número de decisiones. En esta tesis evidenciamos que esta es sólo una de las muchas ventajas adheridas a la utilización de regiones. En el procesamiento por regiones deben distinguirse dos espacios de análisis: el espacio de decisión, en donde se construyen las regiones, y el espacio de características, donde se utilizan. Ambos espacios están altamente relacionados. Las soluciones diseñadas para la construcción de regiones en el espacio de decisión definen su utilidad en el espacio de análisis. Por este motivo, a lo largo de esta tesis estudiamos ambos espacios. En particular, proponemos dos contribuciones en la etapa de construcción de regiones. En la primera, revisitamos una técnica clásica, Mean-Shift, e introducimos un esquema para la selección automática del ancho de banda que permite estimar localmente la densidad de una determinada característica. En la segunda, utilizamos la transformada discreta del coseno para describir la variabilidad local en el entorno de un píxel. En el resto de la tesis exploramos soluciones en el espacio de características, en otras palabras, proponemos aplicaciones que se apoyan en la región para realizar el procesamiento. Dichas aplicaciones se centran en dos ramas candentes en el ámbito de la visión por computador: la segregación del frente por substracción del fondo y la descripción local de los puntos de una imagen. En la rama substracción de fondo, utilizamos las regiones como unidades de apoyo a los algoritmos basados exclusivamente en el análisis a nivel de píxel. En particular, mejoramos la robustez de estos algoritmos a los cambios locales de iluminación y al dinamismo del fondo. Para esta última técnica definimos un modelo de fondo completamente basado en regiones. Las contribuciones asociadas a la rama de descripción local están centradas en el uso de la región para definir, automáticamente, entornos de descripción alrededor de los puntos. En las aproximaciones existentes, estos entornos de descripción suelen ser de tamaño y forma fija. Como resultado de este procedimiento se establece el diseño de versiones enmascaradas de descriptores bidimensionales y tridimensionales. En el algoritmo desarrollado, organizamos los descriptores así diseñados en una estructura neuronal y los utilizamos para la identificación automática de objetos. Por otro lado, proponemos un esquema de descripción mediante asociación difusa de píxeles a regiones. Este entorno de descripción es transformado geométricamente para adaptarse a potenciales deformaciones proyectivas en entornos estéreo donde las cámaras están ampliamente separadas. Cada una de las aproximaciones desarrolladas se evalúa y discute, remarcando las ventajas e inconvenientes asociadas a su utilización. En general, los resultados obtenidos sugieren que la región, asumiendo que ha sido construida de manera exitosa, es una herramienta fiable y de utilidad para: extrapolar resultados a nivel de pixel, reducir el ruido semántico, aislar las características significativas de los objetos y restringir la descripción local de estas características. Los métodos y enfoques descritos a lo largo de esta tesis establecen soluciones alternativas o complementarias al análisis a nivel de píxelIt was partially supported by the Spanish Government trough its FPU grant program and the projects (TEC2007-65400 - SemanticVideo), (TEC2011-25995 Event Video) and (TEC2014-53176-R HAVideo); the European Commission (IST-FP6-027685 - Mesh); the Comunidad de Madrid (S-0505/TIC-0223 - ProMultiDis-CM) and the Spanish Administration Agency CENIT 2007-1007 (VISION)

    Efficient multi-level scene understanding in videos

    No full text
    Automatic video parsing is a key step towards human-level dynamic scene understanding, and a fundamental problem in computer vision. A core issue in video understanding is to infer multiple scene properties of a video in an efficient and consistent manner. This thesis addresses the problem of holistic scene understanding from monocular videos, which jointly reason about semantic and geometric scene properties from multiple levels, including pixelwise annotation of video frames, object instance segmentation in spatio-temporal domain, and/or scene-level description in terms of scene categories and layouts. We focus on four main issues in the holistic video understanding: 1) what is the representation for consistent semantic and geometric parsing of videos? 2) how do we integrate high-level reasoning (e.g., objects) with pixel-wise video parsing? 3) how can we do efficient inference for multi-level video understanding? and 4) what is the representation learning strategy for efficient/cost-aware scene parsing? We discuss three multi-level video scene segmentation scenarios based on different aspects of scene properties and efficiency requirements. The first case addresses the problem of consistent geometric and semantic video segmentation for outdoor scenes. We propose a geometric scene layout representation, or a stage scene model, to efficiently capture the dependency between the semantic and geometric labels. We build a unified conditional random field for joint modeling of the semantic class, geometric label and the stage representation, and design an alternating inference algorithm to minimize the resulting energy function. The second case focuses on the problem of simultaneous pixel-level and object-level segmentation in videos. We propose to incorporate foreground object information into pixel labeling by jointly reasoning semantic labels of supervoxels, object instance tracks and geometric relations between objects. In order to model objects, we take an exemplar approach based on a small set of object annotations to generate a set of object proposals. We then design a conditional random field framework that jointly models the supervoxel labels and object instance segments. To scale up our method, we develop an active inference strategy to improve the efficiency of multi-level video parsing, which adaptively selects an informative subset of object proposals and performs inference on the resulting compact model. The last case explores the problem of learning a flexible representation for efficient scene labeling. We propose a dynamic hierarchical model that allows us to achieve flexible trade-offs between efficiency and accuracy. Our approach incorporates the cost of feature computation and model inference, and optimizes the model performance for any given test-time budget. We evaluate all our methods on several publicly available video and image semantic segmentation datasets, and demonstrate superior performance in efficiency and accuracy. Keywords: Semantic video segmentation, Multi-level scene understanding, Efficient inference, Cost-aware scene parsin

    Adaptive Sensor Optimization and Cognitive Image Processing Using Autonomous Optical Neuroprocessors

    Full text link

    Audio Event Classification Using Deep Learning Methods

    Get PDF
    Whether crossing the road or enjoying a concert, sound carries important information about the world around us. Audio event classification refers to recognition tasks involving the assignment of one or several labels, such as ‘dog bark’ or ‘doorbell’, to a particular audio signal. Thus, teaching machines to conduct this classification task can help humans in many fields. Since deep learning has shown its great potential and usefulness in many AI applications, this thesis focuses on studying deep learning methods and building suitable neural networks for this audio event classification task. In order to evaluate the performance of different neural networks, we tested them on both Google AudioSet and the dataset for DCASE 2018 Task 2. Instead of providing original audio files, AudioSet offers compact 128-dimensional embeddings outputted by a modified VGG model for audio with a frame length of 960ms. For DCASE 2018 Task 2, we firstly preprocessed the soundtracks and then fine-tuned the VGG model that AudioSet used as a feature extractor. Thus, each soundtrack from both tasks is represented as a series of 128-dimensional features. We then compared the DNN, LSTM, and multi-level attention models with different hyper parameters. The results show that fine-tuning the feature generation model for the DCASE task greatly improved the evaluation score. In addition, the attention models were found to perform the best in our settings for both tasks. The results indicate that utilizing a CNN-like model as a feature extractor for the log-mel spectrograms and modeling the dynamics information using an attention model can achieve state-of-the-art results in the task of audio event classification. For future research, the thesis suggests training a better CNN model for feature extraction, utilizing multi-scale and multi-level features for better classification, and combining the audio features with other multimodal information for audiovisual data analysis

    Breaking Down the Barriers To Operator Workload Estimation: Advancing Algorithmic Handling of Temporal Non-Stationarity and Cross-Participant Differences for EEG Analysis Using Deep Learning

    Get PDF
    This research focuses on two barriers to using EEG data for workload assessment: day-to-day variability, and cross- participant applicability. Several signal processing techniques and deep learning approaches are evaluated in multi-task environments. These methods account for temporal, spatial, and frequential data dependencies. Variance of frequency- domain power distributions for cross-day workload classification is statistically significant. Skewness and kurtosis are not significant in an environment absent workload transitions, but are salient with transitions present. LSTMs improve day- to-day feature stationarity, decreasing error by 59% compared to previous best results. A multi-path convolutional recurrent model using bi-directional, residual recurrent layers significantly increases predictive accuracy and decreases cross-participant variance. Deep learning regression approaches are applied to a multi-task environment with workload transitions. Accounting for temporal dependence significantly reduces error and increases correlation compared to baselines. Visualization techniques for LSTM feature saliency are developed to understand EEG analysis model biases

    Optimization of Automatic Target Recognition with a Reject Option Using Fusion and Correlated Sensor Data

    Get PDF
    This dissertation examines the optimization of automatic target recognition (ATR) systems when a rejection option is included. First, a comprehensive review of the literature inclusive of ATR assessment, fusion, correlated sensor data, and classifier rejection is presented. An optimization framework for the fusion of multiple sensors is then developed. This framework identifies preferred fusion rules and sensors along with rejection and receiver operating characteristic (ROC) curve thresholds without the use of explicit misclassification costs as required by a Bayes\u27 loss function. This optimization framework is the first to integrate both vertical warfighter output label analysis and horizontal engineering confusion matrix analysis. In addition, optimization is performed for the true positive rate, which incorporates the time required by classification systems. The mathematical programming framework is used to assess different fusion methods and to characterize correlation effects both within and across sensors. A synthetic classifier fusion-testing environment is developed by controlling the correlation levels of generated multivariate Gaussian data. This synthetic environment is used to demonstrate the utility of the optimization framework and to assess the performance of fusion algorithms as correlation varies. The mathematical programming framework is then applied to collected radar data. This radar fusion experiment optimizes Boolean and neural network fusion rules across four levels of sensor correlation. Comparisons are presented for the maximum true positive rate and the percentage of feasible thresholds to assess system robustness. Empirical evidence suggests ATR performance may improve by reducing the correlation within and across polarimetric radar sensors. Sensitivity analysis shows ATR performance is affected by the number of forced looks, prior probabilities, the maximum allowable rejection level, and the acceptable error rates

    HUMAN FACE RECOGNITION BASED ON FRACTAL IMAGE CODING

    Get PDF
    Human face recognition is an important area in the field of biometrics. It has been an active area of research for several decades, but still remains a challenging problem because of the complexity of the human face. In this thesis we describe fully automatic solutions that can locate faces and then perform identification and verification. We present a solution for face localisation using eye locations. We derive an efficient representation for the decision hyperplane of linear and nonlinear Support Vector Machines (SVMs). For this we introduce the novel concept of ρ\rho and η\eta prototypes. The standard formulation for the decision hyperplane is reformulated and expressed in terms of the two prototypes. Different kernels are treated separately to achieve further classification efficiency and to facilitate its adaptation to operate with the fast Fourier transform to achieve fast eye detection. Using the eye locations, we extract and normalise the face for size and in-plane rotations. Our method produces a more efficient representation of the SVM decision hyperplane than the well-known reduced set methods. As a result, our eye detection subsystem is faster and more accurate. The use of fractals and fractal image coding for object recognition has been proposed and used by others. Fractal codes have been used as features for recognition, but we need to take into account the distance between codes, and to ensure the continuity of the parameters of the code. We use a method based on fractal image coding for recognition, which we call the Fractal Neighbour Distance (FND). The FND relies on the Euclidean metric and the uniqueness of the attractor of a fractal code. An advantage of using the FND over fractal codes as features is that we do not have to worry about the uniqueness of, and distance between, codes. We only require the uniqueness of the attractor, which is already an implied property of a properly generated fractal code. Similar methods to the FND have been proposed by others, but what distinguishes our work from the rest is that we investigate the FND in greater detail and use our findings to improve the recognition rate. Our investigations reveal that the FND has some inherent invariance to translation, scale, rotation and changes to illumination. These invariances are image dependent and are affected by fractal encoding parameters. The parameters that have the greatest effect on recognition accuracy are the contrast scaling factor, luminance shift factor and the type of range block partitioning. The contrast scaling factor affect the convergence and eventual convergence rate of a fractal decoding process. We propose a novel method of controlling the convergence rate by altering the contrast scaling factor in a controlled manner, which has not been possible before. This helped us improve the recognition rate because under certain conditions better results are achievable from using a slower rate of convergence. We also investigate the effects of varying the luminance shift factor, and examine three different types of range block partitioning schemes. They are Quad-tree, HV and uniform partitioning. We performed experiments using various face datasets, and the results show that our method indeed performs better than many accepted methods such as eigenfaces. The experiments also show that the FND based classifier increases the separation between classes. The standard FND is further improved by incorporating the use of localised weights. A local search algorithm is introduced to find a best matching local feature using this locally weighted FND. The scores from a set of these locally weighted FND operations are then combined to obtain a global score, which is used as a measure of the similarity between two face images. Each local FND operation possesses the distortion invariant properties described above. Combined with the search procedure, the method has the potential to be invariant to a larger class of non-linear distortions. We also present a set of locally weighted FNDs that concentrate around the upper part of the face encompassing the eyes and nose. This design was motivated by the fact that the region around the eyes has more information for discrimination. Better performance is achieved by using different sets of weights for identification and verification. For facial verification, performance is further improved by using normalised scores and client specific thresholding. In this case, our results are competitive with current state-of-the-art methods, and in some cases outperform all those to which they were compared. For facial identification, under some conditions the weighted FND performs better than the standard FND. However, the weighted FND still has its short comings when some datasets are used, where its performance is not much better than the standard FND. To alleviate this problem we introduce a voting scheme that operates with normalised versions of the weighted FND. Although there are no improvements at lower matching ranks using this method, there are significant improvements for larger matching ranks. Our methods offer advantages over some well-accepted approaches such as eigenfaces, neural networks and those that use statistical learning theory. Some of the advantages are: new faces can be enrolled without re-training involving the whole database; faces can be removed from the database without the need for re-training; there are inherent invariances to face distortions; it is relatively simple to implement; and it is not model-based so there are no model parameters that need to be tweaked

    Intelligent Sensors for Human Motion Analysis

    Get PDF
    The book, "Intelligent Sensors for Human Motion Analysis," contains 17 articles published in the Special Issue of the Sensors journal. These articles deal with many aspects related to the analysis of human movement. New techniques and methods for pose estimation, gait recognition, and fall detection have been proposed and verified. Some of them will trigger further research, and some may become the backbone of commercial systems
    corecore