Search CORE

73,391 research outputs found

Jointly Modeling Embedding and Translation to Bridge Video and Language

Author: Houqiang Li
Tao Mei
Ting Yao
Yingwei Pan
Yong Rui
†
Publication venue
Publication date: 04/06/2015
Field of study

Automatically describing video content with natural language is a fundamental challenge of multimedia. Recurrent Neural Networks (RNN), which models sequence dynamics, has attracted increasing attention on visual interpretation. However, most existing approaches generate a word locally with given previous words and the visual content, while the relationship between sentence semantics and visual content is not holistically exploited. As a result, the generated sentences may be contextually correct but the semantics (e.g., subjects, verbs or objects) are not true. This paper presents a novel unified framework, named Long Short-Term Memory with visual-semantic Embedding (LSTM-E), which can simultaneously explore the learning of LSTM and visual-semantic embedding. The former aims to locally maximize the probability of generating the next word given previous words and visual content, while the latter is to create a visual-semantic embedding space for enforcing the relationship between the semantics of the entire sentence and visual content. Our proposed LSTM-E consists of three components: a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep RNN for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics. The experiments on YouTube2Text dataset show that our proposed LSTM-E achieves to-date the best reported performance in generating natural sentences: 45.3% and 31.0% in terms of BLEU@4 and METEOR, respectively. We also demonstrate that LSTM-E is superior in predicting Subject-Verb-Object (SVO) triplets to several state-of-the-art techniques

arXiv.org e-Print Archive

CiteSeerX

Crossref

Neural Task Programming: Learning to Generalize Across Hierarchical Tasks

Author: Fei-Fei Li
Gao Julian
Garg Animesh
Nair Suraj
Savarese Silvio
Xu Danfei
Zhu Yuke
Publication venue
Publication date: 14/03/2018
Field of study

In this work, we propose a novel robot learning framework called Neural Task Programming (NTP), which bridges the idea of few-shot learning from demonstration and neural program induction. NTP takes as input a task specification (e.g., video demonstration of a task) and recursively decomposes it into finer sub-task specifications. These specifications are fed to a hierarchical neural program, where bottom-level programs are callable subroutines that interact with the environment. We validate our method in three robot manipulation tasks. NTP achieves strong generalization across sequential tasks that exhibit hierarchal and compositional structures. The experimental results show that NTP learns to generalize well to- wards unseen tasks with increasing lengths, variable topologies, and changing objectives.Comment: ICRA 201

arXiv.org e-Print Archive

Crossref

Caltech Authors

Semantically Consistent Regularization for Zero-Shot Recognition

Author: Morgado Pedro
Vasconcelos Nuno
Publication venue
Publication date: 10/04/2017
Field of study

The role of semantics in zero-shot learning is considered. The effectiveness of previous approaches is analyzed according to the form of supervision provided. While some learn semantics independently, others only supervise the semantic subspace explained by training classes. Thus, the former is able to constrain the whole space but lacks the ability to model semantic correlations. The latter addresses this issue but leaves part of the semantic space unsupervised. This complementarity is exploited in a new convolutional neural network (CNN) framework, which proposes the use of semantics as constraints for recognition.Although a CNN trained for classification has no transfer ability, this can be encouraged by learning an hidden semantic layer together with a semantic code for classification. Two forms of semantic constraints are then introduced. The first is a loss-based regularizer that introduces a generalization constraint on each semantic predictor. The second is a codeword regularizer that favors semantic-to-class mappings consistent with prior semantic knowledge while allowing these to be learned from data. Significant improvements over the state-of-the-art are achieved on several datasets.Comment: Accepted to CVPR 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

Orthographic facilitation in oral vocabulary acquisition

Author: Allerup P.
Bates T. C.
Beck I.
Bowers J.
Bowey J. A.
Burt J. S.
Burt J. S.
Caravolas M.
Castles A.
Castles A.
Cossu G.
Cunningham A. E.
Ehri L. C.
Ehri L. C.
Gaskell M. G.
Harm M.
Hu C. F.
Hulme C.
Kessler B.
Kirk R.
Laws G.
McKague M.
McKay A.
Nation K.
Nelson J. R.
Perfetti C. A.
Perfetti C. A.
Ratcliff R.
Reitsma P.
Ricketts J.
Roch M.
Roch M.
Rosenthal J.
Schneider W.
Schneider W.
Seidenberg M. S.
Share D. L.
Share D. L.
Torgesen J.
Ventura P.
Ventura P.
Wechsler D.
Ziegler J. C.
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2009
Field of study

An experiment investigated whether exposure to orthography facilitates oral vocabulary learning. A total of 58 typically developing children aged 8-9 years were taught 12 nonwords. Children were trained to associate novel phonological forms with pictures of novel objects. Pictures were used as referents to represent novel word meanings. For half of the nonwords children were additionally exposed to orthography, although they were not alerted to its presence, nor were they instructed to use it. After this training phase a nonword-picture matching posttest was used to assess learning of nonword meaning, and a spelling posttest was used to assess learning of nonword orthography. Children showed robust learning for novel spelling patterns after incidental exposure to orthography. Further, we observed stronger learning for nonword-referent pairings trained with orthography. The degree of orthographic facilitation observed in posttests was related to children's reading levels, with more advanced readers showing more benefit from the presence of orthography

Central Archive at the University of Reading

Crossref

Royal Holloway - Pure

Warwick Research Archives Portal Repository

Oxford University Research Archive

Mind the Gap: Another look at the problem of the semantic gap in image retrieval

Author: Enser Peter G. B.
Hare Jonathon S.
Lewis Paul H.
Sandom Christine J.
Publication venue
Publication date: 01/01/2006
Field of study

This paper attempts to review and characterise the problem of the semantic gap in image retrieval and the attempts being made to bridge it. In particular, we draw from our own experience in user queries, automatic annotation and ontological techniques. The first section of the paper describes a characterisation of the semantic gap as a hierarchy between the raw media and full semantic understanding of the media's content. The second section discusses real users' queries with respect to the semantic gap. The final sections of the paper describe our own experience in attempting to bridge the semantic gap. In particular we discuss our work on auto-annotation and semantic-space models of image retrieval in order to bridge the gap from the bottom up, and the use of ontologies, which capture more semantics than keyword object labels alone, as a technique for bridging the gap from the top down

Southampton (e-Prints Soton)

Multimodal Visual Concept Learning with Weakly Supervised Techniques

Author: Bouritsas Giorgos
Koutras Petros
Maragos Petros
Zlatintsi Athanasia
Publication venue
Publication date: 04/04/2018
Field of study

Despite the availability of a huge amount of video data accompanied by descriptive texts, it is not always easy to exploit the information contained in natural language in order to automatically recognize video concepts. Towards this goal, in this paper we use textual cues as means of supervision, introducing two weakly supervised techniques that extend the Multiple Instance Learning (MIL) framework: the Fuzzy Sets Multiple Instance Learning (FSMIL) and the Probabilistic Labels Multiple Instance Learning (PLMIL). The former encodes the spatio-temporal imprecision of the linguistic descriptions with Fuzzy Sets, while the latter models different interpretations of each description's semantics with Probabilistic Labels, both formulated through a convex optimization algorithm. In addition, we provide a novel technique to extract weak labels in the presence of complex semantics, that consists of semantic similarity computations. We evaluate our methods on two distinct problems, namely face and action recognition, in the challenging and realistic setting of movies accompanied by their screenplays, contained in the COGNIMUSE database. We show that, on both tasks, our method considerably outperforms a state-of-the-art weakly supervised approach, as well as other baselines.Comment: CVPR 201

arXiv.org e-Print Archive

Crossref