83 research outputs found

    POL-LWIR Vehicle Detection: Convolutional Neural Networks Meet Polarised Infrared Sensors

    Get PDF
    For vehicle autonomy, driver assistance and situational awareness, it is necessary to operate at day and night, and in all weather conditions. In particular, long wave infrared (LWIR) sensors that receive predominantly emitted radiation have the capability to operate at night as well as during the day. In this work, we employ a polarised LWIR (POL-LWIR) camera to acquire data from a mobile vehicle, to compare and contrast four different convolutional neural network (CNN) configurations to detect other vehicles in video sequences. We evaluate two distinct and promising approaches, two-stage detection (Faster-RCNN) and one-stage detection (SSD), in four different configurations. We also employ two different image decompositions: the first based on the polarisation ellipse and the second on the Stokes parameters themselves. To evaluate our approach, the experimental trials were quantified by mean average precision (mAP) and processing time, showing a clear trade-off between the two factors. For example, the best mAP result of 80.94% was achieved using Faster-RCNN, but at a frame rate of 6.4 fps. In contrast, MobileNet SSD achieved only 64.51% mAP, but at 53.4 fps.Comment: Computer Vision and Pattern Recognition Workshop 201

    POL-LWIR Vehicle Detection: Convolutional Neural Networks Meet Polarised Infrared Sensors

    Get PDF

    Human Pose Estimation from Monocular Images : a Comprehensive Survey

    Get PDF
    Human pose estimation refers to the estimation of the location of body parts and how they are connected in an image. Human pose estimation from monocular images has wide applications (e.g., image indexing). Several surveys on human pose estimation can be found in the literature, but they focus on a certain category; for example, model-based approaches or human motion analysis, etc. As far as we know, an overall review of this problem domain has yet to be provided. Furthermore, recent advancements based on deep learning have brought novel algorithms for this problem. In this paper, a comprehensive survey of human pose estimation from monocular images is carried out including milestone works and recent advancements. Based on one standard pipeline for the solution of computer vision problems, this survey splits the problema into several modules: feature extraction and description, human body models, and modelin methods. Problem modeling methods are approached based on two means of categorization in this survey. One way to categorize includes top-down and bottom-up methods, and another way includes generative and discriminative methods. Considering the fact that one direct application of human pose estimation is to provide initialization for automatic video surveillance, there are additional sections for motion-related methods in all modules: motion features, motion models, and motion-based methods. Finally, the paper also collects 26 publicly available data sets for validation and provides error measurement methods that are frequently used

    Automatic target recognition with deep metric learning.

    Get PDF
    An Automatic Target Recognizer (ATR) is a real or near-real time understanding system where its input (images, signals) are obtained from sensors and its output is the detected and recognized target. ATR is an important task in many civilian and military computer vision applications. The used sensors, such as infrared (IR) imagery, enlarge our knowledge of the surrounding environment, especially at night as they provide continuous surveillance. However, ATR based on IR faces major challenges such as meteorological conditions, scale and viewpoint invariance. In this thesis, we propose solutions that are based on Deep Metric Learning (DML). DML is a technique that has been recently proposed to learn a transformation to a representation space (embedding space) in end-to-end manner based on convolutional neural networks. We explore three distinct approaches. The first one, is based on optimizing a loss function based on a set of triplets [47]. The second one is based on a method that aims to capture the explicit distributions of the different classes in the transformation space [45]. The third method aims to learn a compact hyper-spherical embedding based on Von Mises-Fisher distribution [64]. For these methods, we propose strategies to select and update the constraints to reduce the intra-class variations and increase the inter-class variations. To validate, analyze and compare the three different DML approaches, we use a large real benchmark data that contain multiple target classes of military and civilian vehicles. These targets are captured at different viewing angles, different ranges, and different times of the day. We validate the effectiveness of these methods by evaluating their classification performance as well as analyzing the compactness of their learned features. We show that the three considered methods can learn models that achieve their objectives

    Unsupervised Selection and Estimation of Non-Gaussian Mixtures for High Dimensional Data Analysis

    Get PDF
    Lately, the enormous generation of databases in almost every aspect of life has created a great demand for new, powerful tools for turning data into useful information. Therefore, researchers were encouraged to explore and develop new machine learning ideas and methods. Mixture models are one of the machine learning techniques receiving considerable attention due to their ability to handle efficiently and effectively multidimensional data. Generally, four critical issues have to be addressed when adopting mixture models in high dimensional spaces: (1) choice of the probability density functions, (2) estimation of the mixture parameters, (3) automatic determination of the number of components M in the mixture, and (4) determination of what features best discriminate among the different components. The main goal of this thesis is to summarize all these challenging interrelated problems in one unified model. In most of the applications, the Gaussian density is used in mixture modeling of data. Although a Gaussian mixture may provide a reasonable approximation to many real-world distributions, it is certainly not always the best approximation especially in computer vision and image processing applications where we often deal with non-Gaussian data. Therefore, we propose to use three highly flexible distributions: the generalized Gaussian distribution (GGD), the asymmetric Gaussian distribution (AGD), and the asymmetric generalized Gaussian distribution (AGGD). We are motivated by the fact that these distributions are able to fit many distributional shapes and then can be considered as a useful class of flexible models to address several problems and applications involving measurements and features having well-known marked deviation from the Gaussian shape. Recently, researches have shown that model selection and parameter learning are highly dependent and should be performed simultaneously. For this purpose, many approaches have been suggested. The vast majority of these approaches can be classified, from a computational point of view, into two classes: deterministic and stochastic methods. Deterministic methods estimate the model parameters for a set of candidate models using the Expectation-Maximization (EM) framework, then choose the model that maximizes a model selection criterion. Stochastic methods such as Markov chain Monte Carlo (MCMC) can be used in order to sample from the full a posteriori distribution with M considered unknown. Hence, in this thesis, we propose three learning techniques capable of automatically determining model complexity while learning its parameters. First, we incorporate a Minimum Message Length (MML) penalty in the model learning step performed using the EM algorithm. Our second approach employs the Rival Penalized EM (RPEM) algorithm which is able to select an appropriate number of densities by fading out the redundant densities from a density mixture. Last but not least, we incorporate the nonparametric aspect of mixture models by assuming a countably infinite number of components and using Markov Chain Monte Carlo (MCMC) simulations for the estimation of the posterior distributions. Hence, the difficulty of choosing the appropriate number of clusters is sidestepped by assuming that there are an infinite number of mixture components. Another essential issue in the case of statistical modeling in general and finite mixtures in particular is feature selection (i.e. identification of the relevant or discriminative features describing the data) especially in the case of high-dimensional data. Indeed, feature selection has been shown to be a crucial step in several image processing, computer vision and pattern recognition applications not only because it speeds up learning but also because it improves model accuracy and generalization. Moreover, the learning of the mixture parameters ( i.e. both model selection and parameters estimation) is greatly affected by the quality of the features used. Hence, in this thesis, we are trying to solve the feature selection problem in unsupervised learning by casting it as an estimation problem, thus avoiding any combinatorial search. Finally, the effectiveness of our approaches is evaluated by applying them to different computer vision and image processing applications

    The Understanding of Human Activities by Computer Vision Techniques

    Get PDF
    Esta tesis propone nuevas metodologías para el aprendizaje de actividades humanas y su clasificación en categorías. Aunque este tema ha sido ampliamente estudiado por la comunidad investigadora en visión por computador, aún encontramos importantes dificultades por resolver. En primer lugar hemos encontrado que la literatura sobre técnicas de visión por computador para el aprendizaje de actividades humanas empleando pocas secuencias de entrenamiento es escasa y además presenta resultados pobres [1] [2]. Sin embargo, este aprendizaje es una herramienta crucial en varios escenarios. Por ejemplo, un sistema de reconocimiento recién desplegado necesita mucho tiempo para adquirir nuevas secuencias de entrenamiento así que el entrenamiento con pocos ejemplos puede acelerar la puesta en funcionamiento. También la detección de comportamientos anómalos, ejemplos de los cuales son difíciles de obtener, puede beneficiarse de estas técnicas. Existen soluciones mediante técnicas de cruce dominios o empleando características invariantes, sin embargo estas soluciones omiten información del escenario objetivo la cual reduce el ruido en el sistema mejorando los resultados cuando se tiene en cuenta y ejemplos de actividades anómalas siguen siendo difíciles de obtener. Estos sistemas entrenados con poca información se enfrentan a dos problemas principales: por una parte el sistema de entrenamiento puede sufrir de inestabilidades numéricas en la estimación de los parámetros del modelo, por otra, existe una falta de información representativa proveniente de actividades diversas. Nos hemos enfrentado a estos problemas proponiendo novedosos métodos para el aprendizaje de actividades humanas usando tan solo un ejemplo, lo que se denomina one-shot learning. Nuestras propuestas se basan en sistemas generativos, derivadas de los Modelos Ocultos de Markov[3][4], puesto que cada clase de actividad debe ser aprendida con tan solo un ejemplo. Además, hemos ampliado la diversidad de información en los modelos aplicado una transferencia de información desde fuentes externas al escenario[5]. En esta tesis se explican varias propuestas y se muestra como con ellas hemos conseguidos resultados en el estado del arte en tres bases de datos públicas [6][7][8]. La segunda dificultad a la que nos hemos enfrentado es el reconocimiento de actividades sin restricciones en el escenario. En este caso no tiene por qué coincidir el escenario de entrenamiento y el de evaluación por lo que la reducción de ruido anteriormente expuesta no es aplicable. Esto supone que se pueda emplear cualquier ejemplo etiquetado para entrenamiento independientemente del escenario de origen. Esta libertad nos permite extraer vídeos desde cualquier fuente evitando la restricción en el número de ejemplos de entrenamiento. Teniendo suficientes ejemplos de entrenamiento tanto métodos generativos como discriminativos pueden ser empleados. En el momento de realización de esta tesis encontramos que el estado del arte obtiene los mejores resultados empleando métodos discriminativos, sin embargo, la mayoría de propuestas no suelen considerar la información temporal a largo plazo de las actividades[9]. Esta información puede ser crucial para distinguir entre actividades donde el orden de sub-acciones es determinante, y puede ser una ayuda en otras situaciones[10]. Para ello hemos diseñado un sistema que incluye dicha información en una Máquina de Vectores de Soporte. Además, el sistema permite cierta flexibilidad en la alineación de las secuencias a comparar, característica muy útil si la segmentación de las actividades no es perfecta. Utilizando este sistema hemos obtenido resultados en el estado del arte para cuatro bases de datos complejas sin restricciones en los escenarios[11][12][13][14]. Los trabajos realizados en esta tesis han servido para realizar tres artículos en revistas del primer cuartil [15][16][17], dos ya publicados y otro enviado. Además, se han publicado 8 artículos en congresos internacionales y uno nacional [18][19][20][21][22][23][24][25][26]. [1]Seo, H. J. and Milanfar, P. (2011). Action recognition from one example. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):867–882.(2011) [2]Yang, Y., Saleemi, I., and Shah, M. Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7):1635–1648. (2013) [3]Rabiner, L. R. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. (1989) [4]Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. (2006) [5]Cook, D., Feuz, K., and Krishnan, N. Transfer learning for activity recognition: a survey. Knowledge and Information Systems, pages 1–20. (2013) [6]Schuldt, C., Laptev, I., and Caputo, B. Recognizing human actions: a local svm approach. In International Conference on Pattern Recognition (ICPR). (2004) [7]Weinland, D., Ronfard, R., and Boyer, E. Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2-3):249–257. (2006) [8]Gorelick, L., Blank, M., Shechtman, E., Irani, M., and Basri, R. Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12):2247–2253. (2007) [9]Wang, H. and Schmid, C. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV). (2013) [10]Choi, J., Wang, Z., Lee, S.-C., and Jeon, W. J. A spatio-temporal pyramid matching for video retrieval. Computer Vision and Image Understanding, 117(6):660 – 669. (2013) [11]Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., and Desai, M. A large-scale benchmark dataset for event recognition in surveillance video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3153–3160. (2011) [12] Niebles, J. C., Chen, C.-W., and Fei-Fei, L. Modeling temporal structure of decomposable motion segments for activity classification. In European Conference on Computer Vision (ECCV), pages 392–405.(2010) [13]Reddy, K. K. and Shah, M. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971–981. (2013) [14]Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. HMDB: a large video database for human motion recognition. In IEEE International Conference on Computer Vision (ICCV). (2011) [15]Rodriguez, M., Orrite, C., Medrano, C., and Makris, D. One-shot learning of human activity with an map adapted gmm and simplex-hmm. IEEE Transactions on Cybernetics, PP(99):1–12. (2016) [16]Rodriguez, M., Orrite, C., Medrano, C., and Makris, D. A time flexible kernel framework for video-based activity recognition. Image and Vision Computing 48-49:26 – 36. (2016) [17]Rodriguez, M., Orrite, C., Medrano, C., and Makris, D. Extended Study for One-shot Learning of Human Activity by a Simplex-HMM. IEEE Transactions on Cybernetics (Enviado) [18]Orrite, C., Rodriguez, M., Medrano, C. One-shot learning of temporal sequences using a distance dependent Chinese Restaurant Process. In Proceedings of the 23nd International Conference Pattern Recognition ICPR (December 2016) [19]Rodriguez, M., Medrano, C., Herrero, E., and Orrite, C. Spectral Clustering Using Friendship Path Similarity Proceedings of the 7th Iberian Conference, IbPRIA (June 2015) [20]Orrite, C., Soler, J., Rodriguez, M., Herrero, E., and Casas, R. Image-based location recognition and scenario modelling. In Proceedings of the 10th International Conference on Computer Vision Theory and Applications, VISAPP (March 2015) [21]Castán, D., Rodríguez, M., Ortega, A., Orrite, C., and Lleida, E. Vivolab and cvlab - mediaeval 2014: Violent scenes detection affect task. In Working Notes Proceedings of the MediaEval (October 2014) [22]Orrite, C., Rodriguez, M., Herrero, E., Rogez, G., and Velastin, S. A. Automatic segmentation and recognition of human actions in monocular sequences In Proceedings of the 22nd International Conference Pattern Recognition ICPR (August 2014) [23]Rodriguez, M., Medrano, C., Herrero, E., and Orrite, C. Transfer learning of human poses for action recognition. In 4th International Workshop of Human Behavior Unterstanding (HBU). (October 2013) [24]Rodriguez, M., Orrite, C., and Medrano, C. Human action recognition with limited labelled data. In Actas del III Workshop de Reconocimiento de Formas y Analisis de Imagenes, WSRFAI. (September 2013) [25]Orrite, C., Monforte, P., Rodriguez, M., and Herrero, E. Human Action Recognition under Partial Occlusions . Proceedings of the 6th Iberian Conference, IbPRIA (June 2013) [26]Orrite, C., Rodriguez, M., and Montañes, M. One sequence learning of human actions. In 2nd International Workshop of Human Behavior Unterstanding (HBU). (November 2011)This thesis provides some novel frameworks for learning human activities and for further classifying them into categories. This field of research has been largely studied by the computer vision community however there are still many drawbacks to solve. First, we have found few proposals in the literature for learning human activities from limited number of sequences. However, this learning is critical in several scenarios. For instance, in the initial stage after a system installation the capture of activity examples is time expensive and therefore, the learning with limited examples may accelerate the operational launch of the system. Moreover, examples for training abnormal behaviour are hardly obtainable and their learning may benefit from the same techniques. This problem is solved by some approaches, such as cross domain implementations or the use of invariant features, but they do not consider the specific scenario information which is useful for reducing the clutter and improving the results. Systems trained with scarce information face two main problems: on the one hand, the training process may suffer from numerical instabilities while estimating the model parameters; on the other hand, the model lacks of representative information coming from a diverse set of activity classes. We have dealt with these problems providing some novel approaches for learning human activities from one example, what is called a one-shot learning method. To do so, we have proposed generative approaches based on Hidden Markov Models as we need to learn each activity class from only one example. In addition, we have transferred information from external sources in order to introduce diverse information into the model. This thesis explains our proposals and shows how these methods achieve state-of-the-art results in three public datasets. Second, we have studied the recognition of human activities in unconstrained scenarios. In this case, the scenario may or may not be repeated in training and evaluation and therefore the clutter reduction previously mentioned does not happen. On the other hand, we can use any labelled video for training the system independently of the target scenario. This freedom allows the extraction of videos from the Internet dismissing the implicit constrains when training with limited examples. Having plenty of training examples both, generative and discriminative, methods can be used and by the time this thesis has been made the state-of-the-art has been achieved by discriminative ones. However, most of the methods usually fail when taking into consideration long-term information of the activities. This information is critical when comparing activities where the order of sub-actions is important, and may be useful in other comparisons as well. Thus, we have designed a framework that incorporates this information in a discriminative classifier. In addition, this method introduces some flexibility for sequence alignment, useful feature when the activity segmentation is not exact. Using this framework we have obtained state-of-the-art results in four challenging public datasets with unconstrained scenarios

    Deep Architectures for Visual Recognition and Description

    Get PDF
    In recent times, digital media contents are inherently of multimedia type, consisting of the form text, audio, image and video. Several of the outstanding computer Vision (CV) problems are being successfully solved with the help of modern Machine Learning (ML) techniques. Plenty of research work has already been carried out in the field of Automatic Image Annotation (AIA), Image Captioning and Video Tagging. Video Captioning, i.e., automatic description generation from digital video, however, is a different and complex problem altogether. This study compares various existing video captioning approaches available today and attempts their classification and analysis based on different parameters, viz., type of captioning methods (generation/retrieval), type of learning models employed, the desired output description length generated, etc. This dissertation also attempts to critically analyze the existing benchmark datasets used in various video captioning models and the evaluation metrics for assessing the final quality of the resultant video descriptions generated. A detailed study of important existing models, highlighting their comparative advantages as well as disadvantages are also included. In this study a novel approach for video captioning on the Microsoft Video Description (MSVD) dataset and Microsoft Video-to-Text (MSR-VTT) dataset is proposed using supervised learning techniques to train a deep combinational framework, for achieving better quality video captioning via predicting semantic tags. We develop simple shallow CNN (2D and 3D) as feature extractors, Deep Neural Networks (DNNs and Bidirectional LSTMs (BiLSTMs) as tag prediction models and Recurrent Neural Networks (RNNs) (LSTM) model as the language model. The aim of the work was to provide an alternative narrative to generating captions from videos via semantic tag predictions and deploy simpler shallower deep model architectures with lower memory requirements as solution so that it is not very memory extensive and the developed models prove to be stable and viable options when the scale of the data is increased. This study also successfully employed deep architectures like the Convolutional Neural Network (CNN) for speeding up automation process of hand gesture recognition and classification of the sign languages of the Indian classical dance form, ‘Bharatnatyam’. This hand gesture classification is primarily aimed at 1) building a novel dataset of 2D single hand gestures belonging to 27 classes that were collected from (i) Google search engine (Google images), (ii) YouTube videos (dynamic and with background considered) and (iii) professional artists under staged environment constraints (plain backgrounds). 2) exploring the effectiveness of CNNs for identifying and classifying the single hand gestures by optimizing the hyperparameters, and 3) evaluating the impacts of transfer learning and double transfer learning, which is a novel concept explored for achieving higher classification accuracy

    Robust Methods for Accurate and Efficient Reconstruction from Motion Imagery

    Get PDF
    Creating virtual representations of real-world scenes has been a long-standing goal in photogrammetry and computer vision, and has high practical relevance in industries involved in creating intelligent urban solutions. This includes a wide range of applications such as urban and community planning, reconnaissance missions by the military and government, autonomous robotics, virtual reality, cultural heritage preservation, and many others. Over the last decades, image-based modeling emerged as one of the most popular solutions. The objective is to extract metric information directly from images. Many procedural techniques achieve good results in terms of robustness, accuracy, completeness, and efficiency. More recently, deep-learning-based techniques were proposed to tackle this problem by training on vast amounts of data to learn to associate features between images through deep convolutional neural networks and were shown to outperform traditional procedural techniques. However, many of the key challenges such as large displacement and scalability still remain, especially when dealing with large-scale aerial imagery. This thesis investigates image-based modeling and proposes robust and scalable methods for large-scale aerial imagery. First, we present a method for reconstructing large-scale areas from aerial imagery that formulates the solution as a single-step process, reducing the processing time considerably. Next, we address feature matching and propose a variational optical flow technique (HybridFlow) for dense feature matching that leverages the robustness of graph matching to large displacements. The proposed solution efficiently handles arbitrary-sized aerial images. Finally, for general-purpose image-based modeling, we propose a deep-learning-based approach, an end-to-end multi-view structure from motion employing hypercorrelation volumes for learning dense feature matches. We demonstrate the application of the proposed techniques on several applications and report on task-related measures
    • …
    corecore