5,703 research outputs found

    Evidence of Human-Like Visual-Linguistic Integration in Multimodal Large Language Models During Predictive Language Processing

    Full text link
    The advanced language processing abilities of large language models (LLMs) have stimulated debate over their capacity to replicate human-like cognitive processes. One differentiating factor between language processing in LLMs and humans is that language input is often grounded in several perceptual modalities, whereas most LLMs process solely text-based information. Multimodal grounding allows humans to integrate - e.g. visual context with linguistic information and thereby place constraints on the space of upcoming words, reducing cognitive load and improving comprehension. Recent multimodal LLMs (mLLMs) combine a visual-linguistic embedding space with a transformer type attention mechanism for next-word prediction. Here we ask whether predictive language processing based on multimodal input in mLLMs aligns with humans. Two-hundred participants watched short audio-visual clips and estimated predictability of an upcoming verb or noun. The same clips were processed by the mLLM CLIP, with predictability scores based on comparing image and text feature vectors. Eye-tracking was used to estimate what visual features participants attended to, and CLIP's visual attention weights were recorded. We find that alignment of predictability scores was driven by multimodality of CLIP (no alignment for a unimodal state-of-the-art LLM) and by the attention mechanism (no alignment when attention weights were perturbated or when the same input was fed to a multimodal model without attention). We further find a significant spatial overlap between CLIP's visual attention weights and human eye-tracking data. Results suggest that comparable processes of integrating multimodal information, guided by attention to relevant visual features, supports predictive language processing in mLLMs and humans.Comment: 13 pages, 4 figures, submitted to journa

    The pain matrix reloaded: a salience detection system for the body

    Get PDF
    Neuroimaging and neurophysiological studies have shown that nociceptive stimuli elia salience detection system for the bodycit responses in an extensive cortical network including somatosensory, insular and cingulate areas, as well as frontal and parietal areas. This network, often referred to as the "pain matrix", is viewed as representing the activity by which the intensity and unpleasantness of the perception elicited by a nociceptive stimulus are represented. However, recent experiments have reported (i) that pain intensity can be dissociated from the magnitude of responses in the "pain matrix", (ii) that the responses in the "pain matrix" are strongly influenced by the context within which the nociceptive stimuli appear, and (iii) that non-nociceptive stimuli can elicit cortical responses with a spatial configuration similar to that of the "pain matrix". For these reasons, we propose an alternative view of the functional significance of this cortical network, in which it reflects a system involved in detecting, orienting attention towards, and reacting to the occurrence of salient sensory events. This cortical network might represent a basic mechanism through which significant events for the body's integrity are detected, regardless of the sensory channel through which these events are conveyed. This function would involve the construction of a multimodal cortical representation of the body and nearby space. Under the assumption that this network acts as a defensive system signaling potentially damaging threats for the body, emphasis is no longer on the quality of the sensation elicited by noxious stimuli but on the action prompted by the occurrence of potential threats

    Information and communication technology solutions for outdoor navigation in dementia

    Get PDF
    INTRODUCTION: Information and communication technology (ICT) is potentially mature enough to empower outdoor and social activities in dementia. However, actual ICT-based devices have limited functionality and impact, mainly limited to safety. What is an ideal operational framework to enhance this field to support outdoor and social activities? METHODS: Review of literature and cross-disciplinary expert discussion. RESULTS: A situation-aware ICT requires a flexible fine-tuning by stakeholders of system usability and complexity of function, and of user safety and autonomy. It should operate by artificial intelligence/machine learning and should reflect harmonized stakeholder values, social context, and user residual cognitive functions. ICT services should be proposed at the prodromal stage of dementia and should be carefully validated within the life space of users in terms of quality of life, social activities, and costs. DISCUSSION: The operational framework has the potential to produce ICT and services with high clinical impact but requires substantial investment

    Integration of top-down and bottom-up information for audio organization and retrieval

    Get PDF

    Learning of local predictable representations in partially learnable environments

    Get PDF
    International audiencePROPRE is a generic and cortically inspired framework that provides online input/output relationship learning. The input data flow is projected on a self-organizing map that provides an internal representation of the current stimulus. From this representation, the system predicts the value of the output target. A predictability measure, based on the monitoring of the prediction quality, modulates the projection learning so that to favor learning of representations that are helpful to predict the output. In this article, we study PROPRE when the input/output relationship is only defined in a small subspace of the input space, that we define as a partially learnable environment. This problem, which is not typical of the machine learning field, is however crucial for the robotic developmental field. Indeed, robots face high dimensional sensory-motor environments where large areas of these sensory-motor spaces are not learnable since a motor action cannot have a consequence on every perception each time. We show that the use of the predictability measure in PROPRE leads to an autonomous gathering of local representations where the input data are related to the output value, thus providing good classification performance as the system will learn the input/output function only where it is defined
    corecore