We propose a novel method for learning visual concepts and their correspondence to the words of a natural language. The concepts and correspondences are jointly inferred from video clips depicting simple actions involving multiple objects, together with corresponding natural language commands that would elicit these actions. Individual objects are first detected, together with quantitative measurements of their colour, shape, location and motion. Visual concepts emerge from the co-occurrence of regions within a measurement space and words of the language. The method is evaluated on a set of videos generated automatically using computer graphics from a database of initial and goal configurations of objects. Each video is annotated with multiple commands in natural language obtained from human annotators using crowd sourcing

Alomari, M.

Chinellato, E.

Cohn, A.

Gatsoulis, Y.

Hogg, D.

English

Middlesex University Research Repository

Unsupervised Grounding of Textual Descriptions of Object Features and Actionsin VideoMuhannad Alomari, Eris Chinellato, Yiannis Gatsoulis, David C. Hogg and Anthony G. CohnSchool of Computing, University of Leeds, UK{scmara, E.Chinellato, Y.Gatsoulis, D.C.Hogg, A.G.Cohn}@leeds.ac.ukAbstractWe propose a novel method for learning visual conceptsand their correspondence to the words of a natural lan-guage. The concepts and correspondences are jointly inferredfrom video clips depicting simple actions involving multi-ple objects, together with corresponding natural languagecommands that would elicit these actions. Individual objectsare first detected, together with quantitative measurementsof their colour, shape, location and motion. Visual conceptsemerge from the co-occurrence of regions within a measure-ment space and words of the language. The method is eval-uated on a set of videos generated automatically using com-puter graphics from a database of initial and goal configura-tions of objects. Each video is annotated with multiple com-mands in natural language obtained from human annotatorsusing crowd sourcing.IntroductionLearning linguistic and visual concepts from videos and tex-tual descriptions without having a pre-defined set of repre-sentations is a challenging yet important task. For example,humans are born without the knowledge of how many repre-sentations for directions there are in the world, or how theyare described in natural language. In some situations, it isbetter to use the 4 directions representation (front, right, left,back), in others, one can use the 8 directions (front, frontright, right, etc.). Humans are capable of learning these dif-ferent representations of directions, and at the same timelearn the words that describe them in natural language, with-out having these concepts pre-programmed into their brains.Such learning ability makes us more capable of operating indifferent situations, hence the importance of this task.We exemplify our approach by showing how unsuper-vised learning of concepts in the following feature spacesis possible: colours, shapes, locations, and actions. For ex-ample, in the sentence “pick up the blue block” our aim isto learn that the phrase pick up is a concept in the actionsfeature space, the word blue is a concept in the colour fea-ture space and the word block is a concept in the shape fea-ture space. A key challenge of this task arises from the factthat the system does not know how many concepts there areto learn in each feature space. To avoid this dilemma, re-searchers have used constraints to simplify the setting suf-ficiently to enable learning to take place. For example, theyonly presented the system with a single concept to learn ata time (e.g. a single object in the scene), such that it al-ways knows which concept is to be learned (Roy et al. 1999,Steels et al. 2002, Kumar et al. 2014, Parde et al. 2015). Inother cases, certain hard-coded knowledge was provided ini-tially (e.g. the colours, or directions), which the system usedas fundamental basis to expand its knowledge (Siskind 1996,Dominey et al. 2005, Sridhar et al. 2010, Dubba et al. 2014,Yu et al. 2015).In this paper, we present a novel system that uses a morerelaxed set of assumptions, and does not use any hard-codedknowledge in any feature space. Yet we show that our systemis still capable of learning about language and vision, fromlinguistic and visual inputs, such as video sequences withtextual descriptions. The main goal is to learn natural lan-guage words and their representation in visual domains (e.g.the word blue is represented by a subset of the colour fea-ture space). We will refer to the words that have visual rep-resentations as concrete linguistic concepts (e.g. the wordblue has a representation in the colour space, therefore, blueis a concrete linguistic concept). We will refer to these vi-sual representations as visual concepts (e.g. the blue colourin the colour feature space is a visual concept). Finally, wewill use the term groundings to refer to the connections be-tween the different linguistic concepts and visual concepts.The word ‘concrete’ in the concrete linguistic concepts isused to distinguish it from abstract concepts (e.g. love, hate,real numbers). In this paper we will focus on learning con-crete concepts only, so we will omit ‘concrete’ in the sequelsince no confusion will arise.Connecting Language and Vision FrameworkThe architecture of the learning framework can be sum-marised in the following steps (i) the system receives lin-guistic and visual inputs, a video and a sentence describingit, (ii) the inputs are used to generate candidates for both vi-sual concepts and linguistic concepts, (iii) these candidatesare used to build all possible hypotheses that might groundlanguage and vision, (iv) the system tests the hypotheses,and uses the accepted hypotheses to learn about languageand vision. Steps (i,ii) are discussed in §Visual-LinguisticRepresentation of the World, and steps (iii,iv) are discussedin §Connecting Language and Vision.AssumptionsWe assume no hard-coded knowledge is given in any featurespace, but we make a number of assumptions that help thesystem in attaining its ambitious goal to connect languageand vision. In order to focus on the learning and groundingissues rather than on basic vision processing, we assume thatour system is capable of distinguishing and tracking objectsin the world, and is capable of computing the basic percep-tual properties: colour, shape, and location. We also assume(at least for the location feature), that the camera is static(so location values refer to the same position across frames).Also, since it will be helpful to segment each video into anumber of intervals, we make the assumption that whetherthe values in an object feature space are changing or not pro-vide suitable segmentation points.Visual-Linguistic Representation of the WorldThe system receives as inputs (i) a video sequence (withobjects already tracked), and (ii) a sentence describing thevideo, with upper case letters changed to lower case andpunctuation characters removed (as these would not be ex-plicitly present in spoken language). Both inputs are rep-resented in a way that allows for efficient and incrementallearning as will be discussed in the following sections.Linguistic Input RepresentationThe linguistic inputs are represented as n-grams. An n-gramis a sequence of n consecutive words. These n-grams are ex-tracted from the input sentence, and are used as candidatesfor linguistic concepts as shown in Fig. 1; the term ‘can-didates’ is used to indicate that these n-grams have not yetbeen connected to a visual concept.Visual Input RepresentationThe system receives a video sequence as a visual input, fromwhich it extracts visual information about the different (i)objects (colours, shapes, and locations), and (ii) actions (in-tervals). A mixture of Gaussian models is used to abstractand represent the information from the different objects’ fea-tures, and a number of intervals are used to represent the ac-tions (as shown in the visual concepts in Fig. 1). We willdiscuss these two representations further below.Objects’ Representation: Three pre-defined visual fea-tures are computed for the objects in the input video:colours, shapes, and locations (These features are just ex-amples; many further features could be added easily). It isassumed that the system has no pre-given knowledge in anyof these feature spaces. The features we use are as follows:1. colour : Object → [0, 360) × [0, 1] × [0, 1]; colour(o)gives a hue, saturation, and value (HSV) colour value perpixel.2. shape : Object → Rh; shape(o) gives a histogram oforiented gradients (HOG) values, with a size of 7200 (30by 30 pixels with 8 directions per pixel).3. loc : Object→ R×R×R; loc(o) gives an x,y,z coordi-nate location with respect to the system’s base frame.Figure 1: The video sequence and description are used toextract all possible visual and linguistic concept candidates.Actions’ Representations: In this paper, we say that anaction has happened if there is a change in the state of anobject (e.g. the action pick up is defined by a change in thelocation feature of an object). The changes in a video arerepresented using intervals. We divide the video sequenceinto intervals based on whether an object feature is chang-ing or not; the changes are measured across all the featuresfor an object. For example, if an object was static and thenstarted moving, we segment the video at this point; the firstsegment for the interval during which the object is static, andthe second during which it is moving. The system uses theseintervals as a way to represent the different actions that hap-pened in a video. This representation can be applicable toa wide variety of basic actions (verbs), such as (pick up, putdown, move, place, shift, take, remove) which are manifestedby changes in the location feature, (paint) by a change in thecolour feature, or (cut) by a change in the shape feature. Inthis paper, we aim to learn (i) the representations of the dif-ferent actions in vision, and (ii) the words that describe thesedifferent actions in natural language.Connecting Language and VisionTo learn how to connect the candidate linguistic and visualconcepts, we use an approach similar to that presented inHebbian theory, which can be summarized by the phrase:“Cells that fire together, wire together” (Schatz 1992). Thisidea can be translated into our system as: candidate visualand linguistic concepts that appear together, are connectedtogether. As an example, the 1-gram red and the colour redwill appear consistently together more often than the 1-gramred and the colour green. Based on this idea, the system usesthe inputs (linguistic descriptions and videos) to find the can-didate concepts with the strongest associations.The system will create one to many mapping betweeneach candidate linguistic concept and candidate visual con-cepts (e.g. the 1-gram red is associated with a high prob-ability to a subset of the colour space associated with thecolour red, and other visual concepts). In order to find outwhich of these mappings are correct, the system tests the va-lidity of each one of them. The correct mappings will thenbe used to learn about language and vision. The validationand learning procedures are done in an incremental way byprocessing each video and description individually. Theseprocedures can be described by the following three steps: (i)Compute the strength of the association links between thecandidate visual and linguistic concepts, (ii) test the valid-ity of the visual-linguistic associations using the language-vision matching test described bellow, and finally (iii) usethe concepts that pass the test to update the system’s knowl-edge about language and vision.1) Associating Candidate ConceptsTo determine which candidate concepts should be connectedtogether, we follow the frequentist approach (Everitt andSkrondal 2002). We keep track of the frequency at whicheach candidate visual and linguistic concept appears indi-vidually in all videos, and the frequency with which the twoappear together. The system uses these frequencies to com-pute the conditional probabilities that associate each candi-date linguistic concept with a candidate visual concept.In our incremental learning process, the system is intro-duced to new candidate concepts over time (e.g. new n-grams, colours, actions, etc). When this happens, new con-cepts that are seen for the first time should be created, andthe ones that have been seen before should have their fre-quencies updated. In order to update or create new candidatevisual concepts (Gaussian component), we use an Incremen-tal Gaussian Mixture Model (IGMM) approach (Song andWang 2005) to merge or create new candidate concepts.In order to find which candidate concepts have the highestassociation between them, we use their frequencies to com-pute the conditional probabilities between them. The condi-tional probability between each pair of candidates representsthe strength of associating these pairs together. The com-putation of this conditional probability is shown in Eq. 1,where l is a candidate linguistic concept, v is a candidate vi-sual concept, Fl is the frequency at which l appeared in allthe videos processed so far, Fvl is the frequency of seeingboth l and v together in the same video in all videos so far.This probability function is computed between every pair ofcandidate linguistic and visual concepts.P (v|l) = FvlFl(1)2) Language-Vision Matching Test (LVMT)For each new video, once the frequencies of the candidateconcepts have been updated, the system tests which of thecandidates are correct. At this stage, the system has a setof the strongest associations between candidate concepts. Inthe absence of a supervising teacher, the system needs to val-idate the correctness of these associations in an unsupervisedway using a Language Vision Matching Test (LVMT) whichwe have developed for this purpose. This is done by compar-ing the input video with multiple synthesized virtual videos,where each virtual video reflects a different association. Forexample, if the system does not know what the 1-gram redmeans, but it found that it has 3 potential strong associationswith the colours red, blue and the location bottom left cor-ner. The system generates 3 virtual videos that reflect theseassociations as shown in Fig. 2 and checks which of thesevirtual videos match the input video. Since substituting the1-gram red with the colour red leads to a match with the in-put video, the system accepts the grounding (1-gram red↔red colour Gaussian component) as the correct grounding.Figure 2: Generating different virtual videos from differentgroundings of the 1-gram red for the LVMT.Experimental ValidationTo validate our system, we used the Train Robots datasetwhich was designed to develop systems capable of under-standing verbal spatial commands described in a naturalway (Dukes 2013). Non-expert users were asked to anno-tate appropriate commands to 1000 pairs of different scenes.Each scene pair is represented by an initial and desired goalconfiguration; we automatically animated these to producevideos. 7752 commands were collected using Amazon Me-chanical Turk describing the 1000 scenes. We also translatedall the commands from English to Arabic, particular carewas taken on not to alter any command or change any mis-takes in any of them. An example of the dataset if shown inFig. 3. The original dataset along with the videos and trans-lated commands can be found at http://doi.org/10.5518/32.Figure 3: An Example from the Train Robots dataset, theArabic sentence is translated from the English one.Evaluation and ResultsWe evaluated the performance of our system based on itsability to acquire correct visual-linguistic groundings. TheTrain Robots dataset contains 20 different visual conceptsexpressible in our chosen feature spaces (e.g. the colourblue, the shape cube, etc) which our system managed tolearn all of them correctly. It also has 71 English and 91 Ara-bic linguistic concepts (which map to visual features in ourchosen feature spaces), from which the system managed tolearn 62 (87.3%) and 80 (87.9%) concepts respectively. Ta-ble 1 shows these results in more detail and some examplesof the learned concepts are shown in Fig. 4.Grounding Visual and Linguistic ConceptsEnglish ArabicVisual Features Linguistic Visual Linguistic VisualColours 15/16 8/8 30/31 8/8Shapes 18/18 4/4 22/24 4/4Locations 17/17 5/5 16/16 5/5Actions 12/20 3/3 12/20 3/3Table 1: The results of grounding visual and linguistic con-cepts in both English and Arabic. The numbers in each col-umn (A/B) mean A correctly acquired concepts out of Bavailable concepts.Figure 4: A sample of the learned English Linguist Conceptsand their perceptual representations (Visual Concepts). Thedifferent images show the visual concepts learned, and thewords next to each image show the linguistic concept asso-ciated with that visual concept.Conclusion and Future WorkWe have demonstrated for the first time that a system cansimultaneously learn about object features and actions byconnecting language and vision. The segmentation of videosbased on feature space changes corresponding to actions isalso a key contribution of the paper, acting as an intermedi-ary representation between the continuous perceptual space,and the purely symbolic linguistic structures. We plan to ex-tend our system to learn: (1) relations between objects suchas (distance, direction); (2) grammar rules that govern thesentence structure; (3) collective words such as (tower, pile);(4) comparative and superlative relations such as (further,furthest); and (5) higher arity relations such as (between).AcknowledgmentsWe thank colleagues in the School of Computing Roboticslab and in the STRANDS project consortium (http://strands-project.eu) for their valuable comments. We also acknowl-edge the financial support provided by EU FP7 project600623 (STRANDS).ReferencesDominey, P. F., and Boucher, J.-D. 2005. DevelopmentalStages of Perception and Language Acquisition in a Per-ceptually Grounded Robot. Cognitive Systems Research6(3):243–259.Dubba, K. S.; De Oliveira, M. R.; Lim, G. H.; Kasaei, H.;Lopes, L. S.; Tomé, A.; and Cohn, A. G. 2014. Ground-ing Language in Perception for Scene Conceptualization inAutonomous Robots. In Qualitative Representations forRobots: Papers from the AAAI Spring Symposium. Techni-cal report, 26–33.Dukes, K. 2013. Train Robots: A Dataset for NaturalLanguage Human-Robot Spatial Interaction through VerbalCommands. In International Conference on Social Robotics(ICSR). Embodied Communication of Goals and IntentionsWorkshop.Everitt, B. S., and Skrondal, A. 2002. The Cambridge dictio-nary of statistics. University of Cambride Press: Cambridge.Kumar, S.; Dhiman, V.; and Corso, J. J. 2014. Learn-ing Compositional Sparse Models of Bimodal Percepts. InTwenty-Eighth AAAI Conference on Artificial Intelligence.Parde, N.; Hair, A.; Papakostas, M.; Tsiakas, K.; Dagioglou,M.; Karkaletsis, V.; and Nielsen, R. D. 2015. Grounding theMeaning of Words through Vision and Interactive Game-play. Proc IJCAI 2015.Roy, D.; Schiele, B.; and Pentland, A. 1999. LearningAudio-Visual Associations using Mutual Information. InIntegration of Speech and Image Understanding, 1999. Pro-ceedings, 147–163. IEEE.Schatz, C. J. 1992. The Developing Brain. Scientific Amer-ican 267(3):60–67.Siskind, J. M. 1996. A Computational Study of Cross-Situational Techniques for Learning Word-to-MeaningMappings. Cognition 61(1):39–91.Song, M., and Wang, H. 2005. Highly Efficient Incremen-tal Estimation of Gaussian Mixture Models for Online DataStream Clustering. In Defense and Security, 174–183. In-ternational Society for Optics and Photonics.Sridhar, M.; Cohn, A. G.; and Hogg, D. C. 2010. Unsuper-vised Learning of Event Classes from Video. In Proceedingsof the Twenty-Fourth AAAI Conference on Artificial Intelli-gence, 1631–1638. AAAI Press.Steels, L., and Kaplan, F. 2002. Aibo’s First Words: TheSocial Learning of Language and Meaning. Evolution ofCommunication 4(1):3–32.Yu, H.; Siddharth, N.; Barbu, A.; and Siskind, J. M. 2015.A Compositional Framework for Grounding Language In-ference, Generation, and Acquisition in Video. Journal ofArtificial Intelligence Research 601–713.

Unsupervised grounding of textual descriptions of object features and actions in video

https://repository.mdx.ac.uk/download/b3df0b21863eb885850916d0fb43a1c0ba1163991914abdc5b02a70b47cdd888/940825/Alomari%20%281%29.pdf

Unsupervised grounding of textual descriptions of object features and actions in video

Abstract

Similar works

Full text

Available Versions

Middlesex University Research Repository