Systems that perform in real environments need to bind the internal state to externally

perceived objects, events, or complete scenes. How to learn this correspondence has been a long

standing problem in computer vision as well as artificial intelligence. Augmented Reality provides

an interesting perspective on this problem because a human user can directly relate displayed

system results to real environments. In the following we present a system that is able to bootstrap

internal models from user-system interactions. Starting from pictorial representations it learns

symbolic object labels that provide the basis for storing observed episodes. In a second step, more

complex relational information is extracted from stored episodes that enables the system to react

on specific scene contexts

Bauckhage, Christian

Hanheide, Marc

Wachsmuth, Sven

Wrede, Sebastian

English

University of Lincoln Institutional Repository

From Images via Symbols to Contexts: UsingAugmented Reality for Interactive Model AcquisitionSven Wachsmuth+, Marc Hanheide+, Sebastian Wrede+ and Christian Bauckhage∗+Bielefeld University, Faculty of Technology, D-33594 Bielefeld, Germany{swachsmu,mhanheid,swrede}@techfak.uni-bielefeld.de∗York University, Centre for Vision Research, Toronto ON, M3J 1P3, Canadabauckhag@cs.yorku.caAbstract. Systems that perform in real environments need to bind the internal state to externallyperceived objects, events, or complete scenes. How to learn this correspondence has been a longstanding problem in computer vision as well as artificial intelligence. Augmented Reality providesan interesting perspective on this problem because a human user can directly relate displayedsystem results to real environments. In the following we present a system that is able to bootstrapinternal models from user-system interactions. Starting from pictorial representations it learnssymbolic object labels that provide the basis for storing observed episodes. In a second step, morecomplex relational information is extracted from stored episodes that enables the system to reacton specific scene contexts.1 IntroductionMixed reality systems combine real world views with views of a virtual environment[4]. In the sub-field of augmented reality virtual augmentations are added to the realworld view of the user. This is typically realized by using a setup with a head-mounteddevice which is equipped with cameras and a display. Most of the research on com-puter vision in this field is dedicated to the problem of aligning real and virtual objects(cf. e.g. [4, 8]). This is mostly based on pre-defined 3-d CAD models. The AR systemis either used to present a virtually changed environment to the user or to supportthe user in a pre-defined task, e.g. [8]. In the VAMPIRE1 project we take a differentapproach in that we focus on the problem of how a system can bootstrap its knowl-edge about an unknown real environment. By using Augmented Reality techniques,the computer vision system is embodied through the tight interaction with the user.In this kind of scenario, augmentations, like bounding boxes, text labels, or arrows,are used in order to close the feedback cycle to the user. In turn, the user is able toreact based on the augmentations by changing the view or acting in the scene. Thus,the coupling between the user and the vision system is highly dynamic and dependson the interaction history.The learning of visual models based on human feedback has been explored inseveral different scenarios. Roy uses video data from mother child interactions in or-der to learn the association between acoustic and visual pattern [9]. Steels introduces1 Visual Active Memory Processes and Interactive REtrieval – IST-2001-34401(a) AR gearsymbolic descriptiongrounded throughobject modelscontextual modelslearned fromrecorded episodesin grounded sub−scenesfrom image patchesobject models learned<OBJECT><HYPOTHESIS><RELIABILITY value="0.6"/><CLASS>Cup</CLASS></OBJECT>.........episodicmemorymemorybasedfeature−conceptualmemorypictorialmemory.....................(b) Memory organizationFig. 1. AR system and memory organizationthe term of social learning in a scenario where a human teaches different kinds ofobjects to an Aibo robot [10]. In [2], imitation learning is explored as social learn-ing and teaching process with aims at socially intelligent robots. Finally, Heidemannet al. [7] present an augmented reality system for interactive object learning whichwas developed within the VAMPIRE project. However, most systems limit the learn-ing capability on a single aspect, like learning a classifier for an individual object. Amore general approach needs to deal with various kinds of data structures and needsto integrate different learning processes in a single framework.2 AR interaction and formation of memory contentIn Fig. 1(a) the scenario of the system is shown. The user is sitting at a regular officetable and wears a head-mounted device which is equipped with cameras and a display.Information about recognized objects and results of user queries are visualized usingaugmented reality (AR). The head of the user is tracked using a CMOS camera and aninertial sensor that are mounted on the top of the helmet. The head pose is computedfrom an artificial landmark that is placed in the scene and defines a global coordinatesystem. The system is able to detect objects, and user activities, like moving an ob-ject. It copes with varying lighting conditions as well as cluttered video signals. Byselecting from a menu displayed on the right of the field of view by speech or a mousewheel the user can trigger learning sessions or retrieve information.In order to realize a bootstrapping behavior of the system starting from image-based representations to symbol-based representations, the organization of memorycontent plays a key role. The technical basis for storing and retrieving various kindsof information as well as the coordination of different visual behaviors is providedby the Active Memory Infrastructure which is also described in [11]. The persistenceback-end is the native Berkeley XML DB. Binary data is stored directly in the under-lying relational database and is referenced from stored XML documents. Thus, XMLprovides a unified data model for structured information that is exchanged betweensystem components and stored in the memory.On the conceptual level, we distinguish four different kinds of abstraction layersin the memory representation (see Fig. 1(b)). On the pictorial layer images and im-age patches are temporarily stored. The feature-based layer includes learned objectmodels and configuration data of the object recognition components. In the episodicmemory layer recognition results are stored that have been reliably detected duringan interactive session with a user. Finally, the categorical layer consists of a coupleof contextual models that e.g. describe typical configurations of objects. Each higherlayer is grounded in a layer that is nearer to the signal. Object models in the feature-based memory are learned from image patches that are stored during system usage;detected objects and events are related to learned prototypes in the feature space; fi-nally, contextual models are learned from episodic sequences that capture a spatialcontext, e.g. the user was looking around on the writing area of his or her desk.Interpretation as well as learning processes are working asynchronously on mem-ory representation. They are coordinated through memory event notification [11], e.g.the object anchoring component is triggered if a new object hypothesis is stored inthe memory.3 Image-based scene decomposition and acquisition of objectviewsIn the Augmented Reality scenario, the user and the system share a common view.The images of the head-mounted cameras are directly shown on the head mountedstereo display, so that the user sees what the camera records and the system knowswhich part of the scene is focused by the user. Two different visual behaviors are usedon this pictorial representation level.Mosaicing: In indoor environments meaningful sub-scenes are typically defined byplanes, e.g. table top, front side of a shelf, walls. However, if we are keeping a suffi-cient level of image detail these kind of contextual areas cannot completely be seenthrough a single view. In [5] we present an unique approach to create mosaics forFig. 2. Constructing and tracking of planar sub-scenes. The mosaicing approach has constructed three differentplanar sub-scenes that are stored in the pictorial memory. They were constructed from an image sequence of thehead mounted cameras which is incrementally processed in soft real-time. The user turned his or her head fromthe right side of the table to the left side. The system has correctly identified the two different desk levels.arbitrarily moving head-mounted cameras. It uses a three stage architecture. First, wedecompose the scene into approximated planes using stereo information, which after-wards can be tracked and integrated to mosaics individually (see Fig. 2). This avoidsthe problem of parallax errors usually occurring from arbitrary motion and provides acompact and non-redundant representation of the scene. Each plane defines a coarsespatial context from which contextual models can be learned that interrelate objectsthat frequently co-occur in such a sub-scene.Object tracking: The acquisition of object models is a key to higher-level descriptionsof a scene. For object recognition an appearance-based VPL-classifier [1] is used thatcan directly be trained from image patches. These are automatically extracted whilea user is focusing the target object. An entropy measure is used in order to segmentunknown objects from a more or less homogeneous table plane. In the learning modeof the system the detected area is augmented to the view of the user. Once the firstview is registered by the system a data-driven tracking technique [6] is started thatprovides additional views of the object. Each view that the system collects for learningis checked with the user so that he or she can control the learning process. The patchescan be stored in the pictorial memory of the system for a fast online learning ofobjects as well as a more accurate object learning on a longer time scale [1]. A labelis currently given by speech input based on a pre-defined lexicon.TposeTobject PtableLandmark(a) Object localization and anchor-ing based on 3-d pose.0.5170.483computer0.2640.736deskfalsetrue0.2120.788computer0.3360.664deskfalsetrue0.5850.415computer0.2270.773deskfalsetrue0.1440.856computer0.4820.518deskfalsetrue0.5180.482computerdeskscenemonitorkeyboardcupsharpener(b) Bayesian network for scenery classification can be learnedfrom anchored object hypotheses.Fig. 3. Contextual models are learned from episodic memory content.4 Object anchoring and the role of contextObject anchoring links corresponding object hypotheses that are detected at differ-ent points in times to the same symbol. This is essential for representing episodesover an extended period of time. In addition to the trajectory information from ob-ject tracking, a second strategy is applied for linking that takes the 3-d position ofthe object hypotheses into account. This can be estimated based on a self-localizationof the cameras [3]. Currently, we assume that each object is lying on a table plane.Object hypotheses are fused over time if the 3-d positions are close enough to eachother. A Gaussian curve models the probability that two hypotheses refer to the sameobject. (see Fig. 3(a)). For the final classification result the labels provided by theobject recognition component is integrated over a short period of time. Thereby, thereliability value of a specific hypothesis is adapted. Only those hypotheses that havea highly rated reliability value are used for contextual model learning.Based on such kind of episodic data, contextual models can be estimated thatrepresent typical configurations of objects in a sub-scene. For that, we use simpleBayesian networks with discrete conditional probability tables. In Fig. 3(b) a learnedparameterization of a Bayesian network is shown.The contextual models in turn can be used to judge certain object hypothesesgiven their context as well as can be used to classify more general scene contexts, like’office table’ if a keyboard and computer mouse has been found. Thus, higher-levelcategories can be detected that are defined through relations between objects.5 Conclusion and OutlookIn this paper we presented a bootstrapping approach for the acquisition of knowledgein unknown environments. Augmented Reality techniques are used in order to closethe interaction loop with the user. This acquisition process combines several visualbehaviors that are integrated using the active memory infrastructure. It is shown howthe tight coupling with the user can be used in order acquire grounded higher-levelrepresentations. The demonstration system is running on 5 different laptops allowinga soft real-time behavior. New objects can be learned in about 2-3 minutes acquiringbetween 4 to 6 object views. Contextual models are learned on a longer time scale.Parameters of Bayesian networks are estimated from about 5 minutes of regular sys-tem usage where the corresponding scenery label is given by the user. Further systemdevelopment will focus on a further integration of the mosaiced sub-scenes and thestructural learning of contextual models. We think that the triadic interaction betweenthe system, the human, and the environment provides an ideal basis for pushing thecognitive development of artificial systems to a further level. Augmented reality of-fers strong interaction patterns for this purpose. On the other side, cognitive systemcapabilities will lead to a next generation of assistance technology offering a varietyof applications.References1. H. Bekel, I. Bax, G. Heidemann, and H. Ritter. Adaptive Computer Vision: Online Learning for ObjectRecognition. In Proc. Pattern Recognition Symposium (DAGM), 2004.2. C. Breazeal, D. Buchsbaum, J. Gray, D. Gatenby, and B. Blumberg. Learning from and about Others: To-wards Using Imitation to Bootstrap the Social Understanding of Others by Robots,. Artificial Life, 2004.(Forthcoming 2004).3. M.K. Chandraker, C. Stock, and A. Pinz. Real Time Camera Pose in a Room. In Int. Conf. on ComputerVision Systems, volume 2626 of LNCS, pages 98–110, 2003.4. D. Drascic and P. Milgram. Perceptual Issues in Augmented Reality. In Mark T. Bolas, Scott S. Fisher, andJohn O. Merritt, editors, Stereoscopic Displays and Virtual Reality Systems III, volume 2653 of SPIE, pages123–134, San Jose, California, USA, January - February 1996.5. N. Gorges, M. Hanheide, W. Christmas, C. Bauckhage, G. Sagerer, and J. Kittler. Mosaics from ArbitraryStereo Video Sequences. In Proc. Pattern Recognition Symposium (DAGM), 2004.6. Ch. Gra¨ßl, T. Zinßer, and H. Niemann. Efficient Hyperplane Tracking by Intelligent Region Selection. InProc. IEEE Southwest Symposium on Image Analysis and Interpreta tion, pages 51–55, 2004.7. G. Heidemann, H. Bekel, I. Bax, and H. Ritter. Interactive Online Learning. Pattern Recognition and ImageAnalysis, 15(1):55–58, 2005.8. G. Klinker, K. Ahlers, D. Breen, P.-Y. Chevalier, Ch. Crampton, D. Greer, D. Koller, A. Kramer, E. Rose,M. Tuceryan, and R. Whitaker. Confluence of Computer Vision and Interactive Graphics for AugmentedReality. Presence: Teleoperations and Virtual Environments, 6(4):433–451, August 1997.9. D. Roy. Learning visually grounded words and syntax of natural spoken language. Evolution of Communi-cation, 4(1), 2002.10. L. Steels and F. Kaplan. AIBO’s first words: The social learning of language and meaning. Evolution ofCommunication, 4(1):3–32, 2001.11. S. Wachsmuth, S. Wrede, M. Hanheide, and C. Bauckhage. An Active Memory Model for Cognitive Com-puter Vision Systems. Ku¨nstliche Intelligenz, 19(2):25–31, 2005.

Adaptive Computer Vision: Online Learning for Object Recognition.

AIBO’s ﬁrst words: The social learning of language and meaning.

An Active Memory Model for Cognitive Computer Vision Systems.

Conﬂuence of Computer Vision and Interactive Graphics for Augmented Reality. Presence: Teleoperations and Virtual Environments, 6(4):433–451,

Efﬁcient Hyperplane Tracking by Intelligent Region Selection.

Learning from and about Others: Towards Using Imitation to Bootstrap the Social Understanding of Others by Robots,. Artiﬁcial Life,

Learning visually grounded words and syntax of natural spoken language.

Mosaics from Arbitrary Stereo Video Sequences.

Perceptual Issues in Augmented Reality.

Real Time Camera Pose in a Room.

From images via symbols to contexts: using augmented reality for interactive model acquisition

From images via symbols to contexts: using augmented reality for interactive model acquisition

Abstract

Similar works

Full text

Available Versions

University of Lincoln Institutional Repository