Search CORE

59,978 research outputs found

OBJ2TEXT: Generating Visually Descriptive Language from Object Layouts

Author: Ordonez Vicente
Yin Xuwang
Publication venue
Publication date: 01/01/2017
Field of study

Generating captions for images is a task that has recently received considerable attention. In this work we focus on caption generation for abstract scenes, or object layouts where the only information provided is a set of objects and their locations. We propose OBJ2TEXT, a sequence-to-sequence model that encodes a set of objects and their locations as an input sequence using an LSTM network, and decodes this representation using an LSTM language model. We show that our model, despite encoding object layouts as a sequence, can represent spatial relationships between objects, and generate descriptions that are globally coherent and semantically relevant. We test our approach in a task of object-layout captioning by using only object annotations as inputs. We additionally show that our model, combined with a state-of-the-art object detector, improves an image captioning model from 0.863 to 0.950 (CIDEr score) in the test benchmark of the standard MS-COCO Captioning task.Comment: Accepted at EMNLP 201

arXiv.org e-Print Archive

Crossref

Symbol Emergence in Robotics: A Survey

Author: Asoh Hideki
Iwahashi Naoto
Nagai Takayuki
Nakamura Tomoaki
Ogata Tetsuya
Taniguchi Tadahiro
Publication venue
Publication date: 29/09/2015
Field of study

Humans can learn the use of language through physical interaction with their environment and semiotic communication with other people. It is very important to obtain a computational understanding of how humans can form a symbol system and obtain semiotic skills through their autonomous mental development. Recently, many studies have been conducted on the construction of robotic systems and machine-learning methods that can learn the use of language through embodied multimodal interaction with their environment and other systems. Understanding human social interactions and developing a robot that can smoothly communicate with human users in the long term, requires an understanding of the dynamics of symbol systems and is crucially important. The embodied cognition and social interaction of participants gradually change a symbol system in a constructive manner. In this paper, we introduce a field of research called symbol emergence in robotics (SER). SER is a constructive approach towards an emergent symbol system. The emergent symbol system is socially self-organized through both semiotic communications and physical interactions with autonomous cognitive developmental agents, i.e., humans and developmental robots. Specifically, we describe some state-of-art research topics concerning SER, e.g., multimodal categorization, word discovery, and a double articulation analysis, that enable a robot to obtain words and their embodied meanings from raw sensory--motor information, including visual information, haptic information, auditory information, and acoustic speech signals, in a totally unsupervised manner. Finally, we suggest future directions of research in SER.Comment: submitted to Advanced Robotic

arXiv.org e-Print Archive

A two-way translation system of Chinese sign language based on computer vision

Author: Lan Yan
Wei Shengzhuo
Publication venue
Publication date: 03/06/2023
Field of study

As the main means of communication for deaf people, sign language has a special grammatical order, so it is meaningful and valuable to develop a real-time translation system for sign language. In the research process, we added a TSM module to the lightweight neural network model for the large Chinese continuous sign language dataset . It effectively improves the network performance with high accuracy and fast recognition speed. At the same time, we improve the Bert-Base-Chinese model to divide Chinese sentences into words and mapping the natural word order to the statute sign language order, and finally use the corresponding word videos in the isolated sign language dataset to generate the sentence video, so as to achieve the function of text-to-sign language translation. In the last of our research we built a system with sign language recognition and translation functions, and conducted performance tests on the complete dataset. The sign language video recognition accuracy reached about 99.3% with a time of about 0.05 seconds, and the sign language generation video time was about 1.3 seconds. The sign language system has good performance performance and is feasible

arXiv.org e-Print Archive