9 research outputs found

    Adapting RGB pose estimation to new domains

    Get PDF
    2019 Spring.Includes bibliographical references.Many multi-modal human computer interaction (HCI) systems interact with users in real-time by estimating the user's pose. Generally, they estimate human poses using depth sensors such as the Microsoft Kinect.For multi-modal HCI interfaces to gain traction in the real world, however, it would be better for pose estimation to be based on data from RGB cameras, which are more common and less expensive than depth sensors. This has motivated research into pose estimation from RGB images. Convolutional Neural Networks (CNNs) represent the state-of-the-art in this literature, for example [1–5], and [6]. These systems estimate 2D human poses from RGB images. A problem with current CNN-based pose estimators is that they require large amounts of labeled data for training. If the goal is to train an RGB pose estimator for a new domain, the cost of collecting and more importantly labeling data can be prohibitive. A common solution is to train on publicly available pose data sets, but then the trained system is not tailored to the domain. We propose using RGB+D sensors to collect domain-specific data in the lab, and then training the RGB pose estimator using skeletons automatically extracted from the RGB+D data. This paper presents a case study of adapting the RMPE pose estimation network [4] to the domain of the DARPA Communicating with Computers (CWC) program [7], as represented by the EGGNOG data set [8]. We chose RMPE because it predicts both joint locations and Part Affinity Fields (PAFs) in real-time. Our adaptation of RMPE trained on automatically-labeled data outperforms the original RMPE on the EGGNOG data set

    Multimodal agents for cooperative interaction

    Get PDF
    2020 Fall.Includes bibliographical references.Embodied virtual agents offer the potential to interact with a computer in a more natural manner, similar to how we interact with other people. To reach this potential requires multimodal interaction, including both speech and gesture. This project builds on earlier work at Colorado State University and Brandeis University on just such a multimodal system, referred to as Diana. I designed and developed a new software architecture to directly address some of the difficulties of the earlier system, particularly with regard to asynchronous communication, e.g., interrupting the agent after it has begun to act. Various other enhancements were made to the agent systems, including the model itself, as well as speech recognition, speech synthesis, motor control, and gaze control. Further refactoring and new code were developed to achieve software engineering goals that are not outwardly visible, but no less important: decoupling, testability, improved networking, and independence from a particular agent model. This work, combined with the effort of others in the lab, has produced a "version 2'' Diana system that is well positioned to serve the lab's research needs in the future. In addition, in order to pursue new research opportunities related to developmental and intervention science, a "Faelyn Fox'' agent was developed. This is a different model, with a simplified cognitive architecture, and a system for defining an experimental protocol (for example, a toy-sorting task) based on Unity's visual state machine editor. This version too lays a solid foundation for future research

    One-shot learning with pretrained convolutional neural network

    Get PDF
    2019 Summer.Includes bibliographical references.Recent progress in convolutional neural networks and deep learning has revolutionized the image classification field, and computers can now classify images with a very high accuracy. However, unlike the human vision system which efficiently recognizes a new object after seeing a similar one, recognizing new classes of images requires a time- and resource-consuming process of retraining a neural network due to several restrictions. Since a pretrained neural network has seen a large amount of training data, it may be generalized to effectively and efficiently recognize new classes considering it may extract patterns from training images. This inspires some research in one-shot learning, which is the process of learning to classify a novel class through one training image from the novel class. One-shot learning can help expand the use of a trained convolutional neural network without costly model retraining. In addition to the practical application of one-shot learning, it is also important to understand how a convolutional neural network supports one-shot learning. More specifically, how does the feature space structure to support one-shot learning? This can potentially help us better understand the mechanisms of convolutional neural networks. This thesis proposes an approximate nearest neighbor-based method for one-shot learning. This method makes use of the features produced by a pretrained convolutional neural network and builds a proximity forest to classify new classes. The algorithm is tested in two datasets with different scales and achieves reasonable high classification accuracy in both datasets. Furthermore, this thesis tries to understand the feature space to explain the success of our proposed method. A novel tool generalized curvature analysis is used to probe the feature space structure of the convolutional neural network. The results show that the feature space curves around samples with both known classes and unknown in-domain classes, but not around transition samples between classes or out-of-domain samples. In addition, the low curvature of out-of-domain samples is correlated with the inability of a pretrained convolutional neural network to classify out-of-domain classes, indicating that a pretrained model cannot generate useful feature representations for out-of-domain samples. In summary, this thesis proposes a new method for one-shot learning, and provides insight into understanding the feature space of convolutional neural networks

    Improving gesture recognition through spatial focus of attention

    Get PDF
    2018 Fall.Includes bibliographical references.Gestures are a common form of human communication and important for human computer interfaces (HCI). Most recent approaches to gesture recognition use deep learning within multi- channel architectures. We show that when spatial attention is focused on the hands, gesture recognition improves significantly, particularly when the channels are fused using a sparse network. We propose an architecture (FOANet) that divides processing among four modalities (RGB, depth, RGB flow, and depth flow), and three spatial focus of attention regions (global, left hand, and right hand). The resulting 12 channels are fused using sparse networks. This architecture improves performance on the ChaLearn IsoGD dataset from a previous best of 67.71% to 82.07%, and on the NVIDIA dynamic hand gesture dataset from 83.8% to 91.28%. We extend FOANet to perform gesture recognition on continuous streams of data. We show that the best temporal fusion strategies for multi-channel networks depends on the modality (RGB vs depth vs flow field) and target (global vs left hand vs right hand) of the channel. The extended architecture achieves optimum performance using Gaussian Pooling for global channels, LSTMs for focused (left hand or right hand) flow field channels, and late Pooling for focused RGB and depth channels. The resulting system achieves a mean Jaccard Index of 0.7740 compared to the previous best result of 0.6103 on the ChaLearn ConGD dataset without first pre-segmenting the videos into single gesture clips. Human vision has α and β channels for processing different modalities in addition to spatial attention similar to FOANet. However, unlike FOANet, attention is not implemented through separate neural channels. Instead, attention is implemented through top-down excitation of neurons corresponding to specific spatial locations within the α and β channels. Motivated by the covert attention in human vision, we propose a new architecture called CANet (Covert Attention Net), that merges spatial attention channels while preserving the concept of attention. The focus layers of CANet allows it to focus attention on hands without having dedicated attention channels. CANet outperforms FOANet by achieving an accuracy of 84.79% on ChaLearn IsoGD dataset while being efficient (≈35% of FOANet parameters and ≈70% of FOANet operations). In addition to producing state-of-the-art results on multiple gesture recognition datasets, this thesis also tries to understand the behavior of multi-channel networks (a la FOANet). Multi- channel architectures are becoming increasingly common, setting the state of the art for performance in gesture recognition and other domains. Unfortunately, we lack a clear explanation of why multi-channel architectures outperform single channel ones. This thesis considers two hypotheses. The Bagging hypothesis says that multi-channel architectures succeed because they average the result of multiple unbiased weak estimators in the form of different channels. The Society of Experts (SoE) hypothesis suggests that multi-channel architectures succeed because the channels differentiate themselves, developing expertise with regard to different aspects of the data. Fusion layers then get to combine complementary information. This thesis presents two sets of experiments to distinguish between these hypotheses and both sets of experiments support the SoE hypothesis, suggesting multi-channel architectures succeed because their channels become specialized. Finally we demonstrate the practical impact of the gesture recognition techniques discussed in this thesis in the context of a sophisticated human computer interaction system. We developed a prototype system with a limited form of peer-to-peer communication in the context of blocks world. The prototype allows the users to communicate with the avatar using gestures and speech and make the avatar build virtual block structures

    Intelligent Systems and Applications [electronic resource] : Proceedings of the 2018 Intelligent Systems Conference (IntelliSys) Volume 1 /

    No full text
    Gathering the Proceedings of the 2018 Intelligent Systems Conference (IntelliSys 2018), this book offers a remarkable collection of chapters covering a wide range of topics in intelligent systems and computing, and their real-world applications. The Conference attracted a total of 568 submissions from pioneering researchers, scientists, industrial engineers, and students from all around the world. These submissions underwent a double-blind peer review process, after which 194 (including 13 poster papers) were selected to be included in these proceedings. As intelligent systems continue to replace and sometimes outperform human intelligence in decision-making processes, they have made it possible to tackle many problems more effectively. This branching out of computational intelligence in several directions, and the use of intelligent systems in everyday applications, have created the need for such an international conference, which serves as a venue for reporting on cutting-edge innovations and developments. This book collects both theory and application-based chapters on all aspects of artificial intelligence, from classical to intelligent scope. Readers are sure to find the book both interesting and valuable, as it presents state-of-the-art intelligent methods and techniques for solving real-world problems, along with a vision of future research directions.ViZDoom: DRQN with Prioritized Experience Replay, Double-Q Learning and Snapshot Ensembling -- Ship Classification from SAR Images based on Deep Learning -- HIMALIA: Recovering Compiler Optimization Levels from Binaries by Deep Learning -- Architecture of Management Game for Reinforced Deep Learning -- The Cognitive Packet Network with QoS and Cybersecurity Deep Learning Clusters -- Convolution Neural Network Application for Road Asset Detection and Classification in LiDAR Point Cloud -- PerceptionNet: A Deep Convolutional Neural Network for Late Sensor Fusion -- Reinforcement Learning for Fair Dynamic Pricing -- A Classification-Regression Deep Learning Model for People Counting -- The Impact of Replacing Complex Hand-Crafted Features with Standard Features for Melanoma Classification using Both Hand-Crafted and Deep Features -- Deep Learning in Classifying Depth of Anesthesia (DoA) -- Content based Video Retrieval using Convolutional Neural Network -- Proposal and Evaluation of an Indirect Reward Assignment Method for Reinforcement Learning by Profit Sharing Method -- Eye-Tracking to Enhance Usability: A Race Game -- Eye-Tracking to Enhance Usability: A Race Game -- Automatized Approach to Assessment of Degree of Delamination around a Scribe -- Face Detection and Recognition for Automatic Attendance System -- Fine Localization of Complex Components for Bin Picking -- Intrusion Detection in Computer Networks based on KNN, K-Means++ and J48 -- Cooperating with Avatars through Gesture, Language and Action -- A Safer YouTube Kids: An Extra Layer of Content Filtering using Automated Multimodal Analysis -- Designing an Augmented Reality Multimodal Interface for 6DOF Manipulation Techniques -- InstaSent: A Novel Framework for Sentiment Analysis based on Instagram Selfies -- Segmentation of Heart Sound by Clustering using Spectral and Temporal Features -- Evaluation of Classifiers for Emotion Detection while Performing Physical and Visual Tasks: Tower of Hanoi and IAPS -- Investigating Input Protocols, Image Analysis, and Machine Learning Methods for an Intelligent Identification System of Fusarium Oxysporum Sp. in Soil Samples -- Intelligent System Design for Massive Collection and Recognition of Faces in Integrated Control Centres -- Wheat Plots Segmentation for Experimental Agricultural Field from Visible and Multispectral UAV Imaging -- Evaluation of Image Spatial Resolution for Machine Learning Mapping of Wildland Fire Effects -- Region-based Poisson Blending for Image Repairing -- Modified Radial Basis Function and Orthogonal Bipolar Vector for Better Performance of Pattern Recognition -- Video Detection for Dynamic Fire Texture by using Motion Pattern Recognition -- A Gaussian-Median Filter for Moving Objects Segmentation Applied for Static Scenarios -- Straight Boundary Detection Algorithm based on Orientation Filter -- Using Motion Detection and Facial Recognition to Secure Places of High Security: A Case Study at Banking Vaults of Ghana -- Kinect-Based Frontal View Gait Recognition using Support Vector Machine -- Curve Evolution based on Edge Following Algorithm for Medical Image Segmentation -- Enhancing Translation from English to Arabic using Two-Phase Decoder Translation -- On Character vs Word Embeddings as Input for English Sentence Classification.Gathering the Proceedings of the 2018 Intelligent Systems Conference (IntelliSys 2018), this book offers a remarkable collection of chapters covering a wide range of topics in intelligent systems and computing, and their real-world applications. The Conference attracted a total of 568 submissions from pioneering researchers, scientists, industrial engineers, and students from all around the world. These submissions underwent a double-blind peer review process, after which 194 (including 13 poster papers) were selected to be included in these proceedings. As intelligent systems continue to replace and sometimes outperform human intelligence in decision-making processes, they have made it possible to tackle many problems more effectively. This branching out of computational intelligence in several directions, and the use of intelligent systems in everyday applications, have created the need for such an international conference, which serves as a venue for reporting on cutting-edge innovations and developments. This book collects both theory and application-based chapters on all aspects of artificial intelligence, from classical to intelligent scope. Readers are sure to find the book both interesting and valuable, as it presents state-of-the-art intelligent methods and techniques for solving real-world problems, along with a vision of future research directions
    corecore