433 research outputs found

    Symbol Emergence in Robotics: A Survey

    Full text link
    Humans can learn the use of language through physical interaction with their environment and semiotic communication with other people. It is very important to obtain a computational understanding of how humans can form a symbol system and obtain semiotic skills through their autonomous mental development. Recently, many studies have been conducted on the construction of robotic systems and machine-learning methods that can learn the use of language through embodied multimodal interaction with their environment and other systems. Understanding human social interactions and developing a robot that can smoothly communicate with human users in the long term, requires an understanding of the dynamics of symbol systems and is crucially important. The embodied cognition and social interaction of participants gradually change a symbol system in a constructive manner. In this paper, we introduce a field of research called symbol emergence in robotics (SER). SER is a constructive approach towards an emergent symbol system. The emergent symbol system is socially self-organized through both semiotic communications and physical interactions with autonomous cognitive developmental agents, i.e., humans and developmental robots. Specifically, we describe some state-of-art research topics concerning SER, e.g., multimodal categorization, word discovery, and a double articulation analysis, that enable a robot to obtain words and their embodied meanings from raw sensory--motor information, including visual information, haptic information, auditory information, and acoustic speech signals, in a totally unsupervised manner. Finally, we suggest future directions of research in SER.Comment: submitted to Advanced Robotic

    A Simultaneous Extraction of Context and Community from pervasive signals using nested Dirichlet process

    Get PDF
    Understanding user contexts and group structures plays a central role in pervasive computing. These contexts and community structures are complex to mine from data collected in the wild due to the unprecedented growth of data, noise, uncertainties and complexities. Typical existing approaches would first extract the latent patterns to explain human dynamics or behaviors and then use them as a way to consistently formulate numerical representations for community detection, often via a clustering method. While being able to capture high-order and complex representations, these two steps are performed separately. More importantly, they face a fundamental difficulty in determining the correct number of latent patterns and communities. This paper presents an approach that seamlessly addresses these challenges to simultaneously discover latent patterns and communities in a unified Bayesian nonparametric framework. Our Simultaneous Extraction of Context and Community (SECC) model roots in the nested Dirichlet process theory which allows a nested structure to be built to summarize data at multiple levels. We demonstrate our framework on five datasets where the advantages of the proposed approach are validated

    ๋™์  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ฐ์ดํ„ฐ ํ•™์Šต์„ ์œ„ํ•œ ์‹ฌ์ธต ํ•˜์ดํผ๋„คํŠธ์›Œํฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2015. 2. ์žฅ๋ณ‘ํƒ.Recent advancements in information communication technology has led the explosive increase of data. Dissimilar to traditional data which are structured and unimodal, in particular, the characteristics of recent data generated from dynamic environments are summarized as high-dimensionality, multimodality, and structurelessness as well as huge-scale size. The learning from non-stationary multimodal data is essential for solving many difficult problems in artificial intelligence. However, despite many successful reports, existing machine learning methods have mainly focused on solving practical problems represented by large-scaled but static databases, such as image classification, tagging, and retrieval. Hypernetworks are a probabilistic graphical model representing empirical distribution, using a hypergraph structure that is a large collection of many hyperedges encoding the associations among variables. This representation allows the model to be suitable for characterizing the complex relationships between features with a population of building blocks. However, since a hypernetwork is represented by a huge combinatorial feature space, the model requires a large number of hyperedges for handling the multimodal large-scale data and thus faces the scalability problem. In this dissertation, we propose a deep architecture of hypernetworks for dealing with the scalability issue for learning from multimodal data with non-stationary properties such as videos, i.e., deep hypernetworks. Deep hypernetworks handle the issues through the abstraction at multiple levels using a hierarchy of multiple hypergraphs. We use a stochastic method based on Monte-Carlo simulation, a graph MC, for efficiently constructing hypergraphs representing the empirical distribution of the observed data. The structure of a deep hypernetwork continuously changes as the learning proceeds, and this flexibility is contrasted to other deep learning models. The proposed model incrementally learns from the data, thus handling the nonstationary properties such as concept drift. The abstract representations in the learned models play roles of multimodal knowledge on data, which are used for the content-aware crossmodal transformation including vision-language conversion. We view the vision-language conversion as a machine translation, and thus formulate the vision-language translation in terms of the statistical machine translation. Since the knowledge on the video stories are used for translation, we call this story-aware vision-language translation. We evaluate deep hypernetworks on large-scale vision-language multimodal data including benmarking datasets and cartoon video series. The experimental results show the deep hypernetworks effectively represent visual-linguistic information abstracted at multiple levels of the data contents as well as the associations between vision and language. We explain how the introduction of a hierarchy deals with the scalability and non-stationary properties. In addition, we present the story-aware vision-language translation on cartoon videos by generating scene images from sentences and descriptive subtitles from scene images. Furthermore, we discuss the meaning of our model for lifelong learning and the improvement direction for achieving human-level artificial intelligence.1 Introduction 1.1 Background and Motivation 1.2 Problems to be Addressed 1.3 The Proposed Approach and its Contribution 1.4 Organization of the Dissertation 2 RelatedWork 2.1 Multimodal Leanring 2.2 Models for Learning from Multimodal Data 2.2.1 Topic Model-Based Multimodal Leanring 2.2.2 Deep Network-based Multimodal Leanring 2.3 Higher-Order Graphical Models 2.3.1 Hypernetwork Models 2.3.2 Bayesian Evolutionary Learning of Hypernetworks 3 Multimodal Hypernetworks for Text-to-Image Retrievals 3.1 Overview 3.2 Hypernetworks for Multimodal Associations 3.2.1 Multimodal Hypernetworks 3.2.2 Incremental Learning of Multimodal Hypernetworks 3.3 Text-to-Image Crossmodal Inference 3.3.1 Representatation of Textual-Visual Data 3.3.2 Text-to-Image Query Expansion 3.4 Text-to-Image Retrieval via Multimodal Hypernetworks 3.4.1 Data and Experimental Settings 3.4.2 Text-to-Image Retrieval Performance 3.4.3 Incremental Learning for Text-to-Image Retrieval 3.5 Summary 4 Deep Hypernetworks for Multimodal Cocnept Learning from Cartoon Videos 4.1 Overview 4.2 Visual-Linguistic Concept Representation of Catoon Videos 4.3 Deep Hypernetworks for Modeling Visual-Linguistic Concepts 4.3.1 Sparse Population Coding 4.3.2 Deep Hypernetworks for Concept Hierarchies 4.3.3 Implication of Deep Hypernetworks on Cognitive Modeling 4.4 Learning of Deep Hypernetworks 4.4.1 Problem Space of Deep Hypernetworks 4.4.2 Graph Monte-Carlo Simulation 4.4.3 Learning of Concept Layers 4.4.4 Incremental Concept Construction 4.5 Incremental Concept Construction from Catoon Videos 4.5.1 Data Description and Parameter Setup 4.5.2 Concept Representation and Development 4.5.3 Character Classification via Concept Learning 4.5.4 Vision-Language Conversion via Concept Learning 4.6 Summary 5 Story-awareVision-LanguageTranslation usingDeepConcept Hiearachies 5.1 Overview 5.2 Vision-Language Conversion as a Machine Translation 5.2.1 Statistical Machine Translation 5.2.2 Vision-Language Translation 5.3 Story-aware Vision-Language Translation using Deep Concept Hierarchies 5.3.1 Story-aware Vision-Language Translation 5.3.2 Vision-to-Language Translation 5.3.3 Language-to-Vision Translation 5.4 Story-aware Vision-Language Translation on Catoon Videos 5.4.1 Data and Experimental Setting 5.4.2 Scene-to-Sentence Generation 5.4.3 Sentence-to-Scene Generation 5.4.4 Visual-Linguistic Story Summarization of Cartoon Videos 5.5 Summary 6 Concluding Remarks 6.1 Summary of the Dissertation 6.2 Directions for Further Research Bibliography ํ•œ๊ธ€์ดˆ๋กDocto

    Latent variable models for understanding user behavior in software applications

    Get PDF
    Thesis: Ph. D., Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2018.Cataloged from PDF version of thesis.Includes bibliographical references (pages 147-157).Understanding user behavior in software applications is of significant interest to software developers and companies. By having a better understanding of the user needs and usage patterns, the developers can design a more efficient workflow, add new features, or even automate the user's workflow. In this thesis, I propose novel latent variable models to understand, predict and eventually automate the user interaction with a software application. I start by analyzing users' clicks using time series models; I introduce models and inference algorithms for time series segmentation which are scalable to large-scale user datasets. Next, using a conditional variational autoencoder and some related models, I introduce a framework for automating the user interaction with a software application. I focus on photo enhancement applications, but this framework can be applied to any domain where segmentation, prediction and personalization is valuable. Finally, by combining sequential Monte Carlo and variational inference, I propose a new inference scheme which has better convergence properties than other reasonable baselines.by Ardavan Saeedi.Ph. D
    • โ€ฆ
    corecore