27 research outputs found
Grounded Semantic Composition for Visual Scenes
We present a visually-grounded language understanding model based on a study
of how people verbally describe objects in scenes. The emphasis of the model is
on the combination of individual word meanings to produce meanings for complex
referring expressions. The model has been implemented, and it is able to
understand a broad range of spatial referring expressions. We describe our
implementation of word level visually-grounded semantics and their embedding in
a compositional parsing framework. The implemented system selects the correct
referents in response to natural language expressions for a large percentage of
test cases. In an analysis of the system's successes and failures we reveal how
visual context influences the semantics of utterances and propose future
extensions to the model that take such context into account
Rethinking Pen Input Interaction: Enabling Freehand Sketching Through Improved Primitive Recognition
Online sketch recognition uses machine learning and artificial intelligence techniques
to interpret markings made by users via an electronic stylus or pen. The
goal of sketch recognition is to understand the intention and meaning of a particular
user's drawing. Diagramming applications have been the primary beneficiaries
of sketch recognition technology, as it is commonplace for the users of these tools to
rst create a rough sketch of a diagram on paper before translating it into a machine
understandable model, using computer-aided design tools, which can then be used to
perform simulations or other meaningful tasks.
Traditional methods for performing sketch recognition can be broken down into
three distinct categories: appearance-based, gesture-based, and geometric-based. Although
each approach has its advantages and disadvantages, geometric-based methods
have proven to be the most generalizable for multi-domain recognition. Tools, such as
the LADDER symbol description language, have shown to be capable of recognizing
sketches from over 30 different domains using generalizable, geometric techniques.
The LADDER system is limited, however, in the fact that it uses a low-level recognizer
that supports only a few primitive shapes, the building blocks for describing
higher-level symbols. Systems which support a larger number of primitive shapes have
been shown to have questionable accuracies as the number of primitives increase, or
they place constraints on how users must input shapes (e.g. circles can only be drawn
in a clockwise motion; rectangles must be drawn starting at the top-left corner).
This dissertation allows for a significant growth in the possibility of free-sketch
recognition systems, those which place little to no drawing constraints on users. In
this dissertation, we describe multiple techniques to recognize upwards of 18 primitive
shapes while maintaining high accuracy. We also provide methods for producing
confidence values and generating multiple interpretations, and explore the difficulties
of recognizing multi-stroke primitives. In addition, we show the need for a standardized
data repository for sketch recognition algorithm testing and propose SOUSA
(sketch-based online user study application), our online system for performing and
sharing user study sketch data. Finally, we will show how the principles we have
learned through our work extend to other domains, including activity recognition
using trained hand posture cues
Hidden Markov models and neural networks for speech recognition
The Hidden Markov Model (HMMs) is one of the most successful modeling approaches for acoustic events in speech recognition, and more recently it has proven useful for several problems in biological sequence analysis. Although the HMM is good at capturing the temporal nature of processes such as speech, it has a very limited capacity for recognizing complex patterns involving more than first order dependencies in the observed data sequences. This is due to the first order state process and the assumption of state conditional independence between observations. Artificial Neural Networks (NNs) are almost the opposite: they cannot model dynamic, temporally extended phenomena very well, but are good at static classification and regression tasks. Combining the two frameworks in a sensible way can therefore lead to a more powerful model with better classification abilities. The overall aim of this work has been to develop a probabilistic hybrid of hidden Markov models and neural networks and ..
Embodied object schemas for grounding language use
Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 139-146).This thesis presents the Object Schema Model (OSM) for grounded language interaction. Dynamic representations of objects are used as the central point of coordination between actions, sensations, planning, and language use. Objects are modeled as object schemas -- sets of multimodal, object-directed behavior processes -- each of which can make predictions, take actions, and collate sensations, in the modalities of touch, vision, and motor control. This process-centered view allows the system to respond continuously to real-world activity, while still viewing objects as stabilized representations for planning and speech interaction. The model can be described from four perspectives, each organizing and manipulating behavior processes in a different way. The first perspective views behavior processes like thread objects, running concurrently to carry out their respective functions. The second perspective organizes the behavior processes into object schemas. The third perspective organizes the behavior processes into plan hierarchies to coordinate actions. The fourth perspective creates new behavior processes in response to language input.(cont.) Results from interactions with objects are used to update the object schemas, which then influence subsequent plans and actions. A continuous planning algorithm examines the current object schemas to choose between candidate processes according to a set of primary motivations, such as responding to collisions, exploring objects, and interacting with the human. An instance of the model has been implemented using a physical robotic manipulator. The implemented system is able to interpret basic speech acts that relate to perception of, and actions upon, objects in the robot's physical environment.by Kai-yuh Hsiao.Ph.D
Detecting Periods of Eating in Everyday Life by Tracking Wrist Motion — What is a Meal?
Eating is one of the most basic activities observed in sentient animals, a behavior so natural that humans often eating without giving the activity a second thought. Unfortunately, this often leads to consuming more calories than expended, which can cause weight gain - a leading cause of diseases and death. This proposal describes research in methods to automatically detect periods of eating by tracking wrist motion so that calorie consumption can be tracked. We first briefly discuss how obesity is caused due to an imbalance in calorie intake and expenditure. Calorie consumption and expenditure can be tracked manually using tools like paper diaries, however it is well known that human bias can affect the accuracy of such tracking. Researchers in the upcoming field of automated dietary monitoring (ADM) are attempting to track diet using electronic methods in an effort to mitigate this bias.
We attempt to replicate a previous algorithm that detects eating by tracking wrist motion electronically. The previous algorithm was evaluated on data collected from 43 subjects using an iPhone as the sensor. Periods of time are segmented first, and then classified using a naive Bayesian classifier. For replication, we describe the collection of the Clemson all-day data set (CAD), a free-living eating activity dataset containing 4,680 hours of wrist motion collected from 351 participants - the largest of its kind known to us. We learn that while different sensors are available to log wrist acceleration data, no unified convention exists, and this data must thus be transformed between conventions. We learn that the performance of the eating detection algorithm is affected due to changes in the sensors used to track wrist motion, increased variability in behavior due to a larger participant pool, and the ratio of eating to non-eating in the dataset.
We learn that commercially available acceleration sensors contain noise in their reported readings which affects wrist tracking specifically due to the low magnitude of wrist acceleration. Commercial accelerometers can have noise up to 0.06g which is acceptable in applications like automobile crash testing or pedestrian indoor navigation, but not in ones using wrist motion. We quantify linear acceleration noise in our free-living dataset. We explain sources of noise, a method to mitigate it, and also evaluate the effect of this noise on the eating detection algorithm.
By visualizing periods of eating in the collected dataset we learn that that people often conduct secondary activities while eating, such as walking, watching television, working, and doing household chores. These secondary activities cause wrist motions that obfuscate wrist motions associated with eating, which increases the difficulty of detecting periods of eating (meals). Subjects reported conducting secondary activities in 72% of meals. Analysis of wrist motion data revealed that the wrist was resting 12.8% of the time during self-reported meals, compared to only 6.8% of the time in a cafeteria dataset. Walking motion was found during 5.5% of the time during meals in free-living, compared to 0% in the cafeteria. Augmenting an eating detection classifier to include walking and resting detection improved the average per person accuracy from 74% to 77% on our free-living dataset (t[353]=7.86, p\u3c0.001). This suggests that future data collections for eating activity detection should also collect detailed ground truth on secondary activities being conducted during eating.
Finally, learning from this data collection, we describe a convolutional neural network (CNN) to detect periods of eating by tracking wrist motion during everyday life. Eating uses hand-to-mouth gestures for ingestion, each of which lasts appx 1-5 sec. The novelty of our new approach is that we analyze a much longer window (0.5-15 min) that can contain other gestures related to eating, such as cutting or manipulating food, preparing foods for consumption, and resting between ingestion events. The context of these other gestures can improve the detection of periods of eating.
We found that accuracy at detecting eating increased by 15% in longer windows compared to shorter windows. Overall results on CAD were 89% detection of meals with 1.7 false positives for every true positive (FP/TP), and a time weighted accuracy of 80%
Automatic recognition of multiparty human interactions using dynamic Bayesian networks
Relating statistical machine learning approaches to the automatic analysis of multiparty
communicative events, such as meetings, is an ambitious research area. We
have investigated automatic meeting segmentation both in terms of “Meeting Actions”
and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine
grained level highlighting individual speaker intentions. Group meeting actions describe
the same process at a coarse level, highlighting interactions between different
meeting participants and showing overall group intentions.
A framework based on probabilistic graphical models such as dynamic Bayesian
networks (DBNs) has been investigated for both tasks. Our first set of experiments
is concerned with the segmentation and structuring of meetings (recorded using
multiple cameras and microphones) into sequences of group meeting actions such
as monologue, discussion and presentation. We outline four families of multimodal
features based on speaker turns, lexical transcription, prosody, and visual motion
that are extracted from the raw audio and video recordings. We relate these lowlevel
multimodal features to complex group behaviours proposing a multistreammodelling
framework based on dynamic Bayesian networks. Later experiments are
concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty
conversational speech. We present a joint generative approach based on a switching
DBN for DA recognition in which segmentation and classification of DAs are
carried out in parallel. This approach models a set of features, related to lexical
content and prosody, and incorporates a weighted interpolated factored language
model. In conjunction with this joint generative model, we have also investigated
the use of a discriminative approach, based on conditional random fields, to perform
a reclassification of the segmented DAs.
The DBN based approach yielded significant improvements when applied both
to the meeting action and the dialogue act recognition task. On both tasks, the DBN
framework provided an effective factorisation of the state-space and a flexible infrastructure
able to integrate a heterogeneous set of resources such as continuous
and discrete multimodal features, and statistical language models. Although our
experiments have been principally targeted on multiparty meetings; features, models,
and methodologies developed in this thesis can be employed for a wide range
of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features
for several related research areas such as speaker addressing and focus of attention
modelling, automatic speech recognition and understanding, topic and decision detection
Getting Past the Language Gap: Innovations in Machine Translation
In this chapter, we will be reviewing state of the art machine translation systems, and will discuss innovative methods for machine translation, highlighting the most promising techniques and applications. Machine translation (MT) has benefited from a revitalization in the last 10 years or so, after a period of relatively slow activity. In 2005 the field received a jumpstart when a powerful complete experimental package for building MT systems from scratch became freely available as a result of the unified efforts of the MOSES international consortium. Around the same time, hierarchical methods had been introduced by Chinese researchers, which allowed the introduction and use of syntactic information in translation modeling. Furthermore, the advances in the related field of computational linguistics, making off-the-shelf taggers and parsers readily available, helped give MT an additional boost. Yet there is still more progress to be made. For example, MT will be enhanced greatly when both syntax and semantics are on board: this still presents a major challenge though many advanced research groups are currently pursuing ways to meet this challenge head-on. The next generation of MT will consist of a collection of hybrid systems. It also augurs well for the mobile environment, as we look forward to more advanced and improved technologies that enable the working of Speech-To-Speech machine translation on hand-held devices, i.e. speech recognition and speech synthesis. We review all of these developments and point out in the final section some of the most promising research avenues for the future of MT
A review of technical factors to consider when designing neural networks for semantic segmentation of Earth Observation imagery
Semantic segmentation (classification) of Earth Observation imagery is a
crucial task in remote sensing. This paper presents a comprehensive review of
technical factors to consider when designing neural networks for this purpose.
The review focuses on Convolutional Neural Networks (CNNs), Recurrent Neural
Networks (RNNs), Generative Adversarial Networks (GANs), and transformer
models, discussing prominent design patterns for these ANN families and their
implications for semantic segmentation. Common pre-processing techniques for
ensuring optimal data preparation are also covered. These include methods for
image normalization and chipping, as well as strategies for addressing data
imbalance in training samples, and techniques for overcoming limited data,
including augmentation techniques, transfer learning, and domain adaptation. By
encompassing both the technical aspects of neural network design and the
data-related considerations, this review provides researchers and practitioners
with a comprehensive and up-to-date understanding of the factors involved in
designing effective neural networks for semantic segmentation of Earth
Observation imagery.Comment: 145 pages with 32 figure