27 research outputs found

    Grounded Semantic Composition for Visual Scenes

    Full text link
    We present a visually-grounded language understanding model based on a study of how people verbally describe objects in scenes. The emphasis of the model is on the combination of individual word meanings to produce meanings for complex referring expressions. The model has been implemented, and it is able to understand a broad range of spatial referring expressions. We describe our implementation of word level visually-grounded semantics and their embedding in a compositional parsing framework. The implemented system selects the correct referents in response to natural language expressions for a large percentage of test cases. In an analysis of the system's successes and failures we reveal how visual context influences the semantics of utterances and propose future extensions to the model that take such context into account

    Rethinking Pen Input Interaction: Enabling Freehand Sketching Through Improved Primitive Recognition

    Get PDF
    Online sketch recognition uses machine learning and artificial intelligence techniques to interpret markings made by users via an electronic stylus or pen. The goal of sketch recognition is to understand the intention and meaning of a particular user's drawing. Diagramming applications have been the primary beneficiaries of sketch recognition technology, as it is commonplace for the users of these tools to rst create a rough sketch of a diagram on paper before translating it into a machine understandable model, using computer-aided design tools, which can then be used to perform simulations or other meaningful tasks. Traditional methods for performing sketch recognition can be broken down into three distinct categories: appearance-based, gesture-based, and geometric-based. Although each approach has its advantages and disadvantages, geometric-based methods have proven to be the most generalizable for multi-domain recognition. Tools, such as the LADDER symbol description language, have shown to be capable of recognizing sketches from over 30 different domains using generalizable, geometric techniques. The LADDER system is limited, however, in the fact that it uses a low-level recognizer that supports only a few primitive shapes, the building blocks for describing higher-level symbols. Systems which support a larger number of primitive shapes have been shown to have questionable accuracies as the number of primitives increase, or they place constraints on how users must input shapes (e.g. circles can only be drawn in a clockwise motion; rectangles must be drawn starting at the top-left corner). This dissertation allows for a significant growth in the possibility of free-sketch recognition systems, those which place little to no drawing constraints on users. In this dissertation, we describe multiple techniques to recognize upwards of 18 primitive shapes while maintaining high accuracy. We also provide methods for producing confidence values and generating multiple interpretations, and explore the difficulties of recognizing multi-stroke primitives. In addition, we show the need for a standardized data repository for sketch recognition algorithm testing and propose SOUSA (sketch-based online user study application), our online system for performing and sharing user study sketch data. Finally, we will show how the principles we have learned through our work extend to other domains, including activity recognition using trained hand posture cues

    Hidden Markov models and neural networks for speech recognition

    Get PDF
    The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ..

    Embodied object schemas for grounding language use

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 139-146).This thesis presents the Object Schema Model (OSM) for grounded language interaction. Dynamic representations of objects are used as the central point of coordination between actions, sensations, planning, and language use. Objects are modeled as object schemas -- sets of multimodal, object-directed behavior processes -- each of which can make predictions, take actions, and collate sensations, in the modalities of touch, vision, and motor control. This process-centered view allows the system to respond continuously to real-world activity, while still viewing objects as stabilized representations for planning and speech interaction. The model can be described from four perspectives, each organizing and manipulating behavior processes in a different way. The first perspective views behavior processes like thread objects, running concurrently to carry out their respective functions. The second perspective organizes the behavior processes into object schemas. The third perspective organizes the behavior processes into plan hierarchies to coordinate actions. The fourth perspective creates new behavior processes in response to language input.(cont.) Results from interactions with objects are used to update the object schemas, which then influence subsequent plans and actions. A continuous planning algorithm examines the current object schemas to choose between candidate processes according to a set of primary motivations, such as responding to collisions, exploring objects, and interacting with the human. An instance of the model has been implemented using a physical robotic manipulator. The implemented system is able to interpret basic speech acts that relate to perception of, and actions upon, objects in the robot's physical environment.by Kai-yuh Hsiao.Ph.D

    Detecting Periods of Eating in Everyday Life by Tracking Wrist Motion — What is a Meal?

    Get PDF
    Eating is one of the most basic activities observed in sentient animals, a behavior so natural that humans often eating without giving the activity a second thought. Unfortunately, this often leads to consuming more calories than expended, which can cause weight gain - a leading cause of diseases and death. This proposal describes research in methods to automatically detect periods of eating by tracking wrist motion so that calorie consumption can be tracked. We first briefly discuss how obesity is caused due to an imbalance in calorie intake and expenditure. Calorie consumption and expenditure can be tracked manually using tools like paper diaries, however it is well known that human bias can affect the accuracy of such tracking. Researchers in the upcoming field of automated dietary monitoring (ADM) are attempting to track diet using electronic methods in an effort to mitigate this bias. We attempt to replicate a previous algorithm that detects eating by tracking wrist motion electronically. The previous algorithm was evaluated on data collected from 43 subjects using an iPhone as the sensor. Periods of time are segmented first, and then classified using a naive Bayesian classifier. For replication, we describe the collection of the Clemson all-day data set (CAD), a free-living eating activity dataset containing 4,680 hours of wrist motion collected from 351 participants - the largest of its kind known to us. We learn that while different sensors are available to log wrist acceleration data, no unified convention exists, and this data must thus be transformed between conventions. We learn that the performance of the eating detection algorithm is affected due to changes in the sensors used to track wrist motion, increased variability in behavior due to a larger participant pool, and the ratio of eating to non-eating in the dataset. We learn that commercially available acceleration sensors contain noise in their reported readings which affects wrist tracking specifically due to the low magnitude of wrist acceleration. Commercial accelerometers can have noise up to 0.06g which is acceptable in applications like automobile crash testing or pedestrian indoor navigation, but not in ones using wrist motion. We quantify linear acceleration noise in our free-living dataset. We explain sources of noise, a method to mitigate it, and also evaluate the effect of this noise on the eating detection algorithm. By visualizing periods of eating in the collected dataset we learn that that people often conduct secondary activities while eating, such as walking, watching television, working, and doing household chores. These secondary activities cause wrist motions that obfuscate wrist motions associated with eating, which increases the difficulty of detecting periods of eating (meals). Subjects reported conducting secondary activities in 72% of meals. Analysis of wrist motion data revealed that the wrist was resting 12.8% of the time during self-reported meals, compared to only 6.8% of the time in a cafeteria dataset. Walking motion was found during 5.5% of the time during meals in free-living, compared to 0% in the cafeteria. Augmenting an eating detection classifier to include walking and resting detection improved the average per person accuracy from 74% to 77% on our free-living dataset (t[353]=7.86, p\u3c0.001). This suggests that future data collections for eating activity detection should also collect detailed ground truth on secondary activities being conducted during eating. Finally, learning from this data collection, we describe a convolutional neural network (CNN) to detect periods of eating by tracking wrist motion during everyday life. Eating uses hand-to-mouth gestures for ingestion, each of which lasts appx 1-5 sec. The novelty of our new approach is that we analyze a much longer window (0.5-15 min) that can contain other gestures related to eating, such as cutting or manipulating food, preparing foods for consumption, and resting between ingestion events. The context of these other gestures can improve the detection of periods of eating. We found that accuracy at detecting eating increased by 15% in longer windows compared to shorter windows. Overall results on CAD were 89% detection of meals with 1.7 false positives for every true positive (FP/TP), and a time weighted accuracy of 80%

    Automatic recognition of multiparty human interactions using dynamic Bayesian networks

    Get PDF
    Relating statistical machine learning approaches to the automatic analysis of multiparty communicative events, such as meetings, is an ambitious research area. We have investigated automatic meeting segmentation both in terms of “Meeting Actions” and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine grained level highlighting individual speaker intentions. Group meeting actions describe the same process at a coarse level, highlighting interactions between different meeting participants and showing overall group intentions. A framework based on probabilistic graphical models such as dynamic Bayesian networks (DBNs) has been investigated for both tasks. Our first set of experiments is concerned with the segmentation and structuring of meetings (recorded using multiple cameras and microphones) into sequences of group meeting actions such as monologue, discussion and presentation. We outline four families of multimodal features based on speaker turns, lexical transcription, prosody, and visual motion that are extracted from the raw audio and video recordings. We relate these lowlevel multimodal features to complex group behaviours proposing a multistreammodelling framework based on dynamic Bayesian networks. Later experiments are concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty conversational speech. We present a joint generative approach based on a switching DBN for DA recognition in which segmentation and classification of DAs are carried out in parallel. This approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. The DBN based approach yielded significant improvements when applied both to the meeting action and the dialogue act recognition task. On both tasks, the DBN framework provided an effective factorisation of the state-space and a flexible infrastructure able to integrate a heterogeneous set of resources such as continuous and discrete multimodal features, and statistical language models. Although our experiments have been principally targeted on multiparty meetings; features, models, and methodologies developed in this thesis can be employed for a wide range of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features for several related research areas such as speaker addressing and focus of attention modelling, automatic speech recognition and understanding, topic and decision detection

    Getting Past the Language Gap: Innovations in Machine Translation

    Get PDF
    In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT

    A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery

    Full text link
    Semantic segmentation (classification) of Earth Observation imagery is a crucial task in remote sensing. This paper presents a comprehensive review of technical factors to consider when designing neural networks for this purpose. The review focuses on Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Generative Adversarial Networks (GANs), and transformer models, discussing prominent design patterns for these ANN families and their implications for semantic segmentation. Common pre-processing techniques for ensuring optimal data preparation are also covered. These include methods for image normalization and chipping, as well as strategies for addressing data imbalance in training samples, and techniques for overcoming limited data, including augmentation techniques, transfer learning, and domain adaptation. By encompassing both the technical aspects of neural network design and the data-related considerations, this review provides researchers and practitioners with a comprehensive and up-to-date understanding of the factors involved in designing effective neural networks for semantic segmentation of Earth Observation imagery.Comment: 145 pages with 32 figure
    corecore