11 research outputs found

    THE PERCEPTION-ACTION LOOP IN ATTENTION-BASED PREDICTIVE AGENTS: APPLICATION TO MULTIMODAL DATA GENERATION AND RECOGNITION

    Get PDF
    With the proliferation of soft and hard sensors, data in multiple sensor modalities has become commonplace. In this dissertation, we propose a general-purpose agent model that operates using a closed perception-action loop. The agent actively and sequentially samples its environment, driven by sensory prediction error. It learns where and what to sample by minimizing this prediction error, without any reinforcement. This end-to-end model is evaluated on three applications: (1) generation and recognition of handwritten numerals and alphabets from images and videos, (2) generation and recognition of human-human interactions from videos, and (3) recognition of emotions from speech via generation. For each application, the model yields state-of-the-art accuracy on benchmark datasets, while also maintaining sample and model size efficiency. In order to validate our model with respect to human performance, we collect mouse-click attention tracking (mcAT) data from 382 participants trying to recognize handwritten numerals and alphabets (upper and lowercase) from images via sequential sampling. Images from benchmark datasets are presented as stimuli. The collected data consists of a sequence of sample (click) locations, predicted class label(s) at each sampling, and duration of each sampling. We show that on average, participants observe only 12.8% of an image for recognition. When exposed to the same stimuli and experimental conditions as the participants, our agent model performs handwritten numeral/alphabet recognition more efficiently than the participants as well as a highly-cited attention-based reinforcement model

    An Attention-Based Predictive Agent for Static and Dynamic Environments

    Get PDF
    Real-world applications of intelligent agents demand accuracy and efficiency, and seldom provide reinforcement signals. Currently, most agent models are reinforcement-based and concentrate exclusively on accuracy. We propose a general-purpose agent model consisting of proprioceptive and perceptual pathways. The agent actively samples its environment via a sequence of glimpses. It completes the partial propriocept and percept sequences observed till each sampling instant, and learns where and what to sample by minimizing prediction error, without reinforcement or supervision (class labels). The model is evaluated by exposing it to two kinds of stimuli: images of fully-formed handwritten numerals and alphabets, and videos of gradual formation of numerals. It yields state-of-the-art prediction accuracy upon sampling only 22:6% of the scene on average. The model saccades when exposed to images and tracks when exposed to videos. This is the first known attention-based agent to generate realistic handwriting with state-of-the-art accuracy and efficiency by interacting with and learning end-to-end from static and dynamic environments

    Intent Prediction in Human-Human Interactions

    Get PDF
    The human ability to infer others' intent is innate and crucial to development. Machines ought to acquire this ability for seamless interaction with humans. We propose an agent model for predicting the intent of actors in human-human interactions. This requires simultaneous generation and recognition of an interaction at any time, for which end-to-end models are scarce. The proposed agent actively samples its environment via a sequence of glimpses. At each sampling instant, the model infers the observation class and completes the partially observed body motion. It learns the sequence of body locations to sample by jointly minimizing the classi�cation and generation errors. The model is evaluated on videos of two-skeleton interactions under two settings: (fi�rst person) one skeleton is the modeled agent and the other skeleton's joint movements constitute its visual observation, and (third person) an audience is the modeled agent and the two interacting skeletons' joint movements constitute its visual observation. Three methods for implementing the attention mechanism are analyzed using benchmark datasets. One of them, where attention is driven by sensory prediction error, achieves the highest classi�cation accuracy in both settings by sampling less than 50% of the skeleton joints, while also being the most efficient in terms of model size. This is the �first known attention-based agent to learn end-to-end from two-person interactions for intent prediction, with high accuracy and efficiency

    AttentionMNIST: A Mouse-Click Attention Tracking Dataset for Handwritten Numeral and Alphabet Recognition

    Get PDF
    Multiple attention-based models that recognize objects via a sequence of glimpses have reported results on handwritten numeral recognition. However, no attentiontracking data for handwritten numeral or alphabet recognition is available. Availability of such data would allow attention-based models to be evaluated in comparison to human performance. We collect mouse-click attention tracking (mcAT) data from 382 participants trying to recognize handwritten numerals and alphabets (upper and lowercase) from images via sequential sampling. Images from benchmark datasets are presented as stimuli. The collected dataset, called AttentionMNIST, consists of a sequence of sample (mouse click) locations, predicted class label(s) at each sampling, and the duration of each sampling. On average, our participants observe only 12.8% of an image for recognition. We propose a baseline model to predict the location and the class(es) a participant will select at the next sampling. When exposed to the same stimuli and experimental conditions as our participants, a highly-cited attention-based reinforcement model falls short of human e�ciency

    Synthesizing Skeletal Motion and Physiological Signals as a Function of a Virtual Human's Actions and Emotions

    Full text link
    Round-the-clock monitoring of human behavior and emotions is required in many healthcare applications which is very expensive but can be automated using machine learning (ML) and sensor technologies. Unfortunately, the lack of infrastructure for collection and sharing of such data is a bottleneck for ML research applied to healthcare. Our goal is to circumvent this bottleneck by simulating a human body in virtual environment. This will allow generation of potentially infinite amounts of shareable data from an individual as a function of his actions, interactions and emotions in a care facility or at home, with no risk of confidentiality breach or privacy invasion. In this paper, we develop for the first time a system consisting of computational models for synchronously synthesizing skeletal motion, electrocardiogram, blood pressure, respiration, and skin conductance signals as a function of an open-ended set of actions and emotions. Our experimental evaluations, involving user studies, benchmark datasets and comparison to findings in the literature, show that our models can generate skeletal motion and physiological signals with high fidelity. The proposed framework is modular and allows the flexibility to experiment with different models. In addition to facilitating ML research for round-the-clock monitoring at a reduced cost, the proposed framework will allow reusability of code and data, and may be used as a training tool for ML practitioners and healthcare professionals

    Truveta Mapper: A Zero-shot Ontology Alignment Framework

    Full text link
    In this paper, a new perspective is suggested for unsupervised Ontology Matching (OM) or Ontology Alignment (OA) by treating it as a translation task. Ontologies are represented as graphs, and the translation is performed from a node in the source ontology graph to a path in the target ontology graph. The proposed framework, Truveta Mapper (TM), leverages a multi-task sequence-to-sequence transformer model to perform alignment across multiple ontologies in a zero-shot, unified and end-to-end manner. Multi-tasking enables the model to implicitly learn the relationship between different ontologies via transfer-learning without requiring any explicit cross-ontology manually labeled data. This also enables the formulated framework to outperform existing solutions for both runtime latency and alignment quality. The model is pre-trained and fine-tuned only on publicly available text corpus and inner-ontologies data. The proposed solution outperforms state-of-the-art approaches, Edit-Similarity, LogMap, AML, BERTMap, and the recently presented new OM frameworks in Ontology Alignment Evaluation Initiative (OAEI22), offers log-linear complexity in contrast to quadratic in the existing end-to-end methods, and overall makes the OM task efficient and more straightforward without much post-processing involving mapping extension or mapping repair

    Speech Emotion Recognition via Generation using an Attention-based Variational Recurrent Neural Network

    No full text
    The last decade has seen an exponential rise in the number of attention-based models for speech emotion recognition (SER). Most of these models use a spectrogram as the input speech representation and the CNN or RNN or convolutional RNN as the key machine learning (ML) component, and learn feature weights to implement attention. We propose an attention-based model for SER that uses MFCC as the input speech representation and a variational RNN (VRNN) as the key ML component. Since the MFCC is of lower dimension than a spectrogram, the model is size- and data-efficient. The VRNN has been used for problems in vision but rarely for SER. Our model is predictive in nature. At each instant, it infers the emotion class and generates the next observation, computes the generation error, and selectively samples (attends to) the locations of high error. Thus, attention emerges in our model, and does not require learning feature weights. This simple model provides interesting insights when evaluated for SER on benchmark datasets. The model can operate on variable length and infinite duration audio files. This work is the first to explore simultaneous generation and recognition for SER, where the generation capability is necessary for efficient recognition

    A multimodal predictive agent model for human interaction generation

    No full text
    Perception and action are inextricably tied together. We propose an agent model which consists of perceptual and proprioceptive pathways. The agent actively samples a sequence of percepts from its environment using the perception-action loop. The model predicts to complete the partial percept and propriocept sequences observed till each sampling instant, and learns where and what to sample from the prediction error, without supervision or reinforcement. The model is implemented using a multimodal variational recurrent neural network. The model is exposed to videos of two-person interactions, where one person is the modeled agent and the other person\u27s actions constitute its visual observation. For each interaction class, the model learns to selectively attend to locations in the other person\u27s body. The proposed attention-based agent is the first of its kind to interact with and learn end-to-end from human interactions, and generate realistic interactions with performance comparable to models without attention and using significantly more computational resources

    The Perception-Action Loop in a Predictive Agent

    No full text
    We propose an agent model consisting of perceptual and proprioceptive pathways. It actively samples a sequence of percepts from its environment using the perception-action loop. The model predicts to complete the partial percept and propriocept sequences observed till each sampling instant, and learns where and what to sample from the prediction error, without supervision or reinforcement. The model is exposed to two kinds of stimuli: images of fully-formed handwritten numerals/alphabets, and videos of gradual formation of numerals. For each object class, the model learns a set of salient locations to attend to in images and a policy consisting of a sequence of eye fixations in videos. Behaviorally, the same model gives rise to saccades while observing images and tracking while observing videos. The proposed agent is the first of its kind to interact with and learn end-to-end from static and dynamic environments to generate realistic handwriting with state-of-the-art performance
    corecore