3,241 research outputs found

    Learning semantic representations through multimodal graph neural networks

    Get PDF
    Proyecto de Graduación (Licenciatura en Ingeniería Mecatrónica) Instituto Tecnológico de Costa Rica. Área Académica de Ingeniería Mecatrónica, 2021Para proporcionar del conocimiento semántico sobre los objetos con los que van a interactuar los sistemas robóticos, se debe abordar el problema del aprendizaje de las representaciones semánticas a partir de las modalidades del lenguaje y la visión. El conocimiento semántico se refiere a la información conceptual, incluida la información semántica (significado) y léxica (palabra), y que proporciona la base para muchos de nuestros comportamientos no verbales cotidianos. Por lo tanto, es necesario desarrollar métodos que permitan a los robots procesar oraciones en un entorno del mundo real, por lo que este proyecto presenta un enfoque novedoso que utiliza Redes Convolucionales Gráficas para aprender representaciones de palabras basadas en el significado. El modelo propuesto consta de una primera capa que codifica representaciones unimodales y una segunda capa que integra estas representaciones unimodales en una para aprender una representación desde ambas modalidades. Los resultados experimentales muestran que el modelo propuesto supera al estado del arte en similitud semántica y que tiene la capacidad de simular juicios de similitud humana. Hasta donde sabemos, este enfoque es novedoso en el uso de Redes Convolucionales Gráficas para mejorar la calidad de las representaciones de palabras.To provide semantic knowledge about the objects that robotic systems are going to interact with, you must address the problem of learning semantic representations from modalities of language and vision. Semantic knowledge refers to conceptual information, including semantic (meaning) and lexical (word) information, and that provides the basis for many of our everyday non-verbal behaviors. Therefore, it is necessary to develop methods that enable robots to process sentences in a real-world environment, so this project introduces a novel approach that uses Graph Convolutional Networks to learn grounded meaning representations of words. The proposed model consists of a first layer that encodes unimodal representations, and a second layer that integrates these unimodal representations into one to learn a representation from both modalities. Experimental results show that the proposed model outperforms that state-of-the-art in semantic similarity and that can simulate human similarity judgments. To the best of our knowledge, this approach is novel in its use of Graph Convolutional Networks to enhance the quality of word representations

    Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work

    Full text link
    Inspired by the fact that human brains can emphasize discriminative parts of the input and suppress irrelevant ones, substantial local mechanisms have been designed to boost the development of computer vision. They can not only focus on target parts to learn discriminative local representations, but also process information selectively to improve the efficiency. In terms of application scenarios and paradigms, local mechanisms have different characteristics. In this survey, we provide a systematic review of local mechanisms for various computer vision tasks and approaches, including fine-grained visual recognition, person re-identification, few-/zero-shot learning, multi-modal learning, self-supervised learning, Vision Transformers, and so on. Categorization of local mechanisms in each field is summarized. Then, advantages and disadvantages for every category are analyzed deeply, leaving room for exploration. Finally, future research directions about local mechanisms have also been discussed that may benefit future works. To the best our knowledge, this is the first survey about local mechanisms on computer vision. We hope that this survey can shed light on future research in the computer vision field

    Multimodal sequential fashion attribute prediction

    Get PDF
    We address multimodal product attribute prediction of fashion items based on product images and titles. The product attributes, such as type, sub-type, cut or fit, are in a chain format, with previous attribute values constraining the values of the next attributes. We propose to address this task with a sequential prediction model that can learn to capture the dependencies between the different attribute values in the chain. Our experiments on three product datasets show that the sequential model outperforms two non-sequential baselines on all experimental datasets. Compared to other models, the sequential model is also better able to generate sequences of attribute chains not seen during training. We also measure the contributions of both image and textual input and show that while text-only models always outperform image-only models, only the multimodal sequential model combining both image and text improves over the text-only model on all experimental dataset
    corecore