200,685 research outputs found

    Learning language through pictures

    Full text link
    We propose Imaginet, a model of learning visually grounded representations of language from coupled textual and visual input. The model consists of two Gated Recurrent Unit networks with shared word embeddings, and uses a multi-task objective by receiving a textual description of a scene and trying to concurrently predict its visual representation and the next word in the sentence. Mimicking an important aspect of human language learning, it acquires meaning representations for individual words from descriptions of visual scenes. Moreover, it learns to effectively use sequential structure in semantic interpretation of multi-word phrases.Comment: To appear at ACL 201

    TRECVid 2006 experiments at Dublin City University

    Get PDF
    In this paper we describe our retrieval system and experiments performed for the automatic search task in TRECVid 2006. We submitted the following six automatic runs: ā€¢ F A 1 DCU-Base 6: Baseline run using only ASR/MT text features. ā€¢ F A 2 DCU-TextVisual 2: Run using text and visual features. ā€¢ F A 2 DCU-TextVisMotion 5: Run using text, visual, and motion features. ā€¢ F B 2 DCU-Visual-LSCOM 3: Text and visual features combined with concept detectors. ā€¢ F B 2 DCU-LSCOM-Filters 4: Text, visual, and motion features with concept detectors. ā€¢ F B 2 DCU-LSCOM-2 1: Text, visual, motion, and concept detectors with negative concepts. The experiments were designed both to study the addition of motion features and separately constructed models for semantic concepts, to runs using only textual and visual features, as well as to establish a baseline for the manually-assisted search runs performed within the collaborative K-Space project and described in the corresponding TRECVid 2006 notebook paper. The results of the experiments indicate that the performance of automatic search can be improved with suitable concept models. This, however, is very topic-dependent and the questions of when to include such models and which concept models should be included, remain unanswered. Secondly, using motion features did not lead to performance improvement in our experiments. Finally, it was observed that our text features, despite displaying a rather poor performance overall, may still be useful even for generic search topics
    • ā€¦
    corecore