5 research outputs found

    Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors

    Full text link
    Understanding the mechanisms underlying human attention is a fundamental challenge for both vision science and artificial intelligence. While numerous computational models of free-viewing have been proposed, less is known about the mechanisms underlying task-driven image exploration. To address this gap, we present CapMIT1003, a database of captions and click-contingent image explorations collected during captioning tasks. CapMIT1003 is based on the same stimuli from the well-known MIT1003 benchmark, for which eye-tracking data under free-viewing conditions is available, which offers a promising opportunity to concurrently study human attention under both tasks. We make this dataset publicly available to facilitate future research in this field. In addition, we introduce NevaClip, a novel zero-shot method for predicting visual scanpaths that combines contrastive language-image pretrained (CLIP) models with biologically-inspired neural visual attention (NeVA) algorithms. NevaClip simulates human scanpaths by aligning the representation of the foveated visual stimulus and the representation of the associated caption, employing gradient-driven visual exploration to generate scanpaths. Our experimental results demonstrate that NevaClip outperforms existing unsupervised computational models of human visual attention in terms of scanpath plausibility, for both captioning and free-viewing tasks. Furthermore, we show that conditioning NevaClip with incorrect or misleading captions leads to random behavior, highlighting the significant impact of caption guidance in the decision-making process. These findings contribute to a better understanding of mechanisms that guide human attention and pave the way for more sophisticated computational approaches to scanpath prediction that can integrate direct top-down guidance of downstream tasks

    Learning in text streams: discovery and disambiguation of entity and relation instances

    No full text
    We consider a scenario where an artificial agent is reading a stream of text composed of a set of narrations, and it is informed about the identity of some of the individuals that are mentioned in the text portion that is currently being read. The agent is expected to learn to follow the narrations, thus disambiguating mentions and discovering new individuals. We focus on the case in which individuals are entities and relations and propose an end-to-end trainable memory network that learns to discover and disambiguate them in an online manner, performing one-shot learning and dealing with a small number of sparse supervisions. Our system builds a not-given-in-advance knowledge base, and it improves its skills while reading the unsupervised text. The model deals with abrupt changes in the narration, considering their effects when resolving coreferences. We showcase the strong disambiguation and discovery skills of our model on a corpus of Wikipedia documents and on a newly introduced data set that we make publicly available

    Linguistic Feature Injection for Efficient Natural Language Processing

    No full text
    Transformers have been established as one of the most effective neural approach in performing various Natural Language Processing tasks. However, following common trend in modern deep architectures, their scale has quickly grown to an extent that reduces the concrete possibility for several enterprises to train such models from scratch. Indeed, despite their high-level performances, Transformers have the general drawback of requiring a huge amount of training data, computational resources and energy consumption to be successfully optimized. For this reason, more recent architectures like Bidirectional Encoder Representations from Transformers rely on unlabeled data to pre-train the model, which is later fine-tuned for a specific downstream task using a relatively smaller amount of training data. In a similar fashion, this paper considers a plug-and-play framework that can be used to inject multiple syntactic features, like Part-of-Speech Tagging or Dependency Parsing, into any kind of pre-trained Transformer. This novel approach allows to perform sequence-to-sequence labeling tasks by exploiting: (i) the (more abundant) available training data that is also used to learn the syntactic features, (ii) the language data that is used to pre-train the transformer model. The experimental results show that our approach improves over the baseline performances of the underlying model in different datasets, thus proving the effectiveness of employing syntactic language information for semantic regularization. In addition, we show that our architecture has a huge efficiency advantage over pure large language models. Indeed, by using a model with limited size, but whose input data are enriched with syntactic information, we show that it is possible to obtain a significant reduction of CO2 emissions without decreasing the prediction performances
    corecore