51 research outputs found

    Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition

    Get PDF
    This article belongs to the Section Intelligent SensorsHuman action recognition (HAR) has emerged as a core research domain for video understanding and analysis, thus attracting many researchers. Although significant results have been achieved in simple scenarios, HAR is still a challenging task due to issues associated with view independence, occlusion and inter-class variation observed in realistic scenarios. In previous research efforts, the classical bag of visual words approach along with its variations has been widely used. In this paper, we propose a Dynamic Spatio-Temporal Bag of Expressions (D-STBoE) model for human action recognition without compromising the strengths of the classical bag of visual words approach. Expressions are formed based on the density of a spatio-temporal cube of a visual word. To handle inter-class variation, we use class-specific visual word representation for visual expression generation. In contrast to the Bag of Expressions (BoE) model, the formation of visual expressions is based on the density of spatio-temporal cubes built around each visual word, as constructing neighborhoods with a fixed number of neighbors could include non-relevant information making a visual expression less discriminative in scenarios with occlusion and changing viewpoints. Thus, the proposed approach makes the model more robust to occlusion and changing viewpoint challenges present in realistic scenarios. Furthermore, we train a multi-class Support Vector Machine (SVM) for classifying bag of expressions into action classes. Comprehensive experiments on four publicly available datasets: KTH, UCF Sports, UCF11 and UCF50 show that the proposed model outperforms existing state-of-the-art human action recognition methods in term of accuracy to 99.21%, 98.60%, 96.94 and 94.10%, respectively.Sergio A. Velastin is grateful for funding received from the Universidad Carlos III de Madrid, the European Union’s Seventh Framework Programme for research, technological development and demonstration under grant agreement N° 600371, el Ministerio de Economía, Industria y Competitividad (COFUND2013-51509) el Ministerio de Educación, Cultura y Deporte (CEI-15-17) and Banco Santander. Muhammad Haroon Yousaf received funding from the Higher Education Commission, Pakistan for Swarm Robotics Lab under the National Centre for Robotics and Automation (NCRA). The authors also acknowledge support from the Directorate of ASR&TD, University of Engineering and Technology Taxila, Pakistan

    Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

    Full text link
    Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design computer agents with intelligent capabilities such as understanding, reasoning, and learning through integrating multiple communicative modalities, including linguistic, acoustic, visual, tactile, and physiological messages. With the recent interest in video understanding, embodied autonomous agents, text-to-image generation, and multisensor fusion in application domains such as healthcare and robotics, multimodal machine learning has brought unique computational and theoretical challenges to the machine learning community given the heterogeneity of data sources and the interconnections often found between modalities. However, the breadth of progress in multimodal research has made it difficult to identify the common themes and open questions in the field. By synthesizing a broad range of application domains and theoretical frameworks from both historical and recent perspectives, this paper is designed to provide an overview of the computational and theoretical foundations of multimodal machine learning. We start by defining two key principles of modality heterogeneity and interconnections that have driven subsequent innovations, and propose a taxonomy of 6 core technical challenges: representation, alignment, reasoning, generation, transference, and quantification covering historical and recent trends. Recent technical achievements will be presented through the lens of this taxonomy, allowing researchers to understand the similarities and differences across new approaches. We end by motivating several open problems for future research as identified by our taxonomy

    Automatic generation of natural language descriptions of visual data: describing images and videos using recurrent and self-attentive models

    Get PDF
    Humans are faced with a constant flow of visual stimuli, e.g., from the environment or when looking at social media. In contrast, visually-impaired people are often incapable to perceive and process this advantageous and beneficial information that could help maneuver them through everyday situations and activities. However, audible feedback such as natural language can give them the ability to better be aware of their surroundings, thus enabling them to autonomously master everyday's challenges. One possibility to create audible feedback is to produce natural language descriptions for visual data such as still images and then read this text to the person. Moreover, textual descriptions for images can be further utilized for text analysis (e.g., sentiment analysis) and information aggregation. In this work, we investigate different approaches and techniques for the automatic generation of natural language of visual data such as still images and video clips. In particular, we look at language models that generate textual descriptions with recurrent neural networks: First, we present a model that allows to generate image captions for scenes that depict interactions between humans and branded products. Thereby, we focus on the correct identification of the brand name in a multi-task training setting and present two new metrics that allow us to evaluate this requirement. Second, we explore the automatic answering of questions posed for an image. In fact, we propose a model that generates answers from scratch instead of predicting an answer from a limited set of possible answers. In comparison to related works, we are therefore able to generate rare answers, which are not contained in the pool of frequent answers. Third, we review the automatic generation of doctors' reports for chest X-ray images. That is, we introduce a model that can cope with a dataset bias of medical datasets (i.e., abnormal cases are very rare) and generates reports with a hierarchical recurrent model. We also investigate the correlation between the distinctiveness of the report and the score in traditional metrics and find a discrepancy between good scores and accurate reports. Then, we examine self-attentive language models that improve computational efficiency and performance over the recurrent models. Specifically, we utilize the Transformer architecture. First, we expand the automatic description generation to the domain of videos where we present a video-to-text (VTT) model that can easily synchronize audio-visual features. With an extensive experimental exploration, we verify the effectiveness of our video-to-text translation pipeline. Finally, we revisit our recurrent models with this self-attentive approach

    Recognizing complex faces and gaits via novel probabilistic models

    Get PDF
    In the field of computer vision, developing automated systems to recognize people under unconstrained scenarios is a partially solved problem. In unconstrained sce- narios a number of common variations and complexities such as occlusion, illumi- nation, cluttered background and so on impose vast uncertainty to the recognition process. Among the various biometrics that have been emerging recently, this dissertation focus on two of them namely face and gait recognition. Firstly we address the problem of recognizing faces with major occlusions amidst other variations such as pose, scale, expression and illumination using a novel PRObabilistic Component based Interpretation Model (PROCIM) inspired by key psychophysical principles that are closely related to reasoning under uncertainty. The model basically employs Bayesian Networks to establish, learn, interpret and exploit intrinsic similarity mappings from the face domain. Then, by incorporating e cient inference strategies, robust decisions are made for successfully recognizing faces under uncertainty. PROCIM reports improved recognition rates over recent approaches. Secondly we address the newly upcoming gait recognition problem and show that PROCIM can be easily adapted to the gait domain as well. We scienti cally de ne and formulate sub-gaits and propose a novel modular training scheme to e ciently learn subtle sub-gait characteristics from the gait domain. Our results show that the proposed model is robust to several uncertainties and yields sig- ni cant recognition performance. Apart from PROCIM, nally we show how a simple component based gait reasoning can be coherently modeled using the re- cently prominent Markov Logic Networks (MLNs) by intuitively fusing imaging, logic and graphs. We have discovered that face and gait domains exhibit interesting similarity map- pings between object entities and their components. We have proposed intuitive probabilistic methods to model these mappings to perform recognition under vari- ous uncertainty elements. Extensive experimental validations justi es the robust- ness of the proposed methods over the state-of-the-art techniques.

    Automatic assessment of honey bee cells using deep learning

    Get PDF
    Temporal assessment of honey bee colony strength is required for different applications in many research projects, which often involves counting the number of comb cells with brood and food reserves multiple times a year. There are thousands of cells in each comb, which makes manual counting a time-consuming, tedious and thereby an error-prone task. Therefore, the automation of this task using modern imaging processing techniques represents a major advance. Herein, we developed a software capable of (i) detecting each cell from comb images, (ii) classifying its content and (iii) display the results to the researcher in a simple way. The cells’ contents typically display a high variation of patterns which make their classification by software a challenging endeavour. To address this challenge, we used Deep Neural Networks (DNNs). DNNs are known for achieving the state of art in many fields of study including image classification, because they can learn features that best describe the content being classified by themselves. Our DNN model was trained with over 70,000 manually labelled cell images whose cells were separated into seven classes. Our contribution is an end-to-end software capable of doing automatic background removal, cell detection, and classification of cell content based on an input comb image. With this software, colony assessment achieves an average accuracy of 94% across the seven classes in our dataset, representing a substantial progress regarding the approximation methods (e.g. Lieberfeld) currently used by honey bee researchers and previous techniques based on machine learning that used handmade features like colour and texture.A análise temporal sobre a qualidade e força de colônias de abelha melífera (Apis mellifera L.) é necessária em muitos projetos de pesquisa. Ela pode ser realizada contando alvéolos com alimento (pólen e néctar) e criação. É comum que ela seja feita diversas vezes ao ano. A grande quantidade de alvéolos em cada favo torna a tarefa demorada e tediosa ao pesquisador. Assim, frequentemente essa contagem é feita forma aproximada usando métodos como o de Lieberfeld. Automatizar este processo usando técnicas modernas de processamento de imagem representa um grande avanço, pois resultados mais precisos e padronizados poderão ser obtidos em menos tempo. O objetivo deste trabalho é construir de um software capaz de detectar, classificar e contar alvéolos a partir de uma imagem. Após, ele deverá apresentar os dados de forma simplificada ao usuário. Para tratar da alta variação de padrões como textura, cor e iluminação presente nas alvéolos, usaremos Deep Neural Network (DNN), que são modelos computacionais conhecidos por terem alcançado o estado da arte em várias tarefas relacionadas a processamento de sinais e imagens. Para o treinamento desses modelos utilizamos mais de 70.000 alvéolos anotadas por um apicultor experiente, separadas em sete classes. Entre nossas contribuições estão métodos de préprocessamento que garantem uma alta taxa de detecção de alvéolos, aliados a modelos de segmentação baseados em DNNs que asseguram uma baixa taxa de falsos positivos. Com nossos classificadores conseguimos uma acurácia média de 94% em nosso dataset e obtivemos resultados superiores a outros métodos baseados em contagens aproximadas e técnicas de análise por imagem que não utilizam DNNs.This research was conducted in the framework of the project BEEHOPE, funded through the 2013-2014 BiodivERsA/FACCE-JPI Joint call for research proposals, with the national founders FCT(Portugal), CNRS(France), and MEC(Spain)

    Domain Adaptive Computational Models for Computer Vision

    Get PDF
    abstract: The widespread adoption of computer vision models is often constrained by the issue of domain mismatch. Models that are trained with data belonging to one distribution, perform poorly when tested with data from a different distribution. Variations in vision based data can be attributed to the following reasons, viz., differences in image quality (resolution, brightness, occlusion and color), changes in camera perspective, dissimilar backgrounds and an inherent diversity of the samples themselves. Machine learning techniques like transfer learning are employed to adapt computational models across distributions. Domain adaptation is a special case of transfer learning, where knowledge from a source domain is transferred to a target domain in the form of learned models and efficient feature representations. The dissertation outlines novel domain adaptation approaches across different feature spaces; (i) a linear Support Vector Machine model for domain alignment; (ii) a nonlinear kernel based approach that embeds domain-aligned data for enhanced classification; (iii) a hierarchical model implemented using deep learning, that estimates domain-aligned hash values for the source and target data, and (iv) a proposal for a feature selection technique to reduce cross-domain disparity. These adaptation procedures are tested and validated across a range of computer vision applications like object classification, facial expression recognition, digit recognition, and activity recognition. The dissertation also provides a unique perspective of domain adaptation literature from the point-of-view of linear, nonlinear and hierarchical feature spaces. The dissertation concludes with a discussion on the future directions for research that highlight the role of domain adaptation in an era of rapid advancements in artificial intelligence.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Programmable Image-Based Light Capture for Previsualization

    Get PDF
    Previsualization is a class of techniques for creating approximate previews of a movie sequence in order to visualize a scene prior to shooting it on the set. Often these techniques are used to convey the artistic direction of the story in terms of cinematic elements, such as camera movement, angle, lighting, dialogue, and character motion. Essentially, a movie director uses previsualization (previs) to convey movie visuals as he sees them in his minds-eye . Traditional methods for previs include hand-drawn sketches, Storyboards, scaled models, and photographs, which are created by artists to convey how a scene or character might look or move. A recent trend has been to use 3D graphics applications such as video game engines to perform previs, which is called 3D previs. This type of previs is generally used prior to shooting a scene in order to choreograph camera or character movements. To visualize a scene while being recorded on-set, directors and cinematographers use a technique called On-set previs, which provides a real-time view with little to no processing. Other types of previs, such as Technical previs, emphasize accurately capturing scene properties but lack any interactive manipulation and are usually employed by visual effects crews and not for cinematographers or directors. This dissertation\u27s focus is on creating a new method for interactive visualization that will automatically capture the on-set lighting and provide interactive manipulation of cinematic elements to facilitate the movie maker\u27s artistic expression, validate cinematic choices, and provide guidance to production crews. Our method will overcome the drawbacks of the all previous previs methods by combining photorealistic rendering with accurately captured scene details, which is interactively displayed on a mobile capture and rendering platform. This dissertation describes a new hardware and software previs framework that enables interactive visualization of on-set post-production elements. A three-tiered framework, which is the main contribution of this dissertation is; 1) a novel programmable camera architecture that provides programmability to low-level features and a visual programming interface, 2) new algorithms that analyzes and decomposes the scene photometrically, and 3) a previs interface that leverages the previous to perform interactive rendering and manipulation of the photometric and computer generated elements. For this dissertation we implemented a programmable camera with a novel visual programming interface. We developed the photometric theory and implementation of our novel relighting technique called Symmetric lighting, which can be used to relight a scene with multiple illuminants with respect to color, intensity and location on our programmable camera. We analyzed the performance of Symmetric lighting on synthetic and real scenes to evaluate the benefits and limitations with respect to the reflectance composition of the scene and the number and color of lights within the scene. We found that, since our method is based on a Lambertian reflectance assumption, our method works well under this assumption but that scenes with high amounts of specular reflections can have higher errors in terms of relighting accuracy and additional steps are required to mitigate this limitation. Also, scenes which contain lights whose colors are a too similar can lead to degenerate cases in terms of relighting. Despite these limitations, an important contribution of our work is that Symmetric lighting can also be leveraged as a solution for performing multi-illuminant white balancing and light color estimation within a scene with multiple illuminants without limits on the color range or number of lights. We compared our method to other white balance methods and show that our method is superior when at least one of the light colors is known a priori
    corecore