22 research outputs found

    MPCHAT: Towards Multimodal Persona-Grounded Conversation

    Full text link
    In order to build self-consistent personalized dialogue agents, previous research has mostly focused on textual persona that delivers personal facts or personalities. However, to fully describe the multi-faceted nature of persona, image modality can help better reveal the speaker's personal characteristics and experiences in episodic memory (Rubin et al., 2003; Conway, 2009). In this work, we extend persona-based dialogue to the multimodal domain and make two main contributions. First, we present the first multimodal persona-based dialogue dataset named MPCHAT, which extends persona with both text and images to contain episodic memories. Second, we empirically show that incorporating multimodal persona, as measured by three proposed multimodal persona-grounded dialogue tasks (i.e., next response prediction, grounding persona prediction, and speaker identification), leads to statistically significant performance improvements across all tasks. Thus, our work highlights that multimodal persona is crucial for improving multimodal dialogue comprehension, and our MPCHAT serves as a high-quality resource for this research.Comment: Accepted at ACL 202

    Match me if you can: Semantic Correspondence Learning with Unpaired Images

    Full text link
    Recent approaches for semantic correspondence have focused on obtaining high-quality correspondences using a complicated network, refining the ambiguous or noisy matching points. Despite their performance improvements, they remain constrained by the limited training pairs due to costly point-level annotations. This paper proposes a simple yet effective method that performs training with unlabeled pairs to complement both limited image pairs and sparse point pairs, requiring neither extra labeled keypoints nor trainable modules. We fundamentally extend the data quantity and variety by augmenting new unannotated pairs not primitively provided as training pairs in benchmarks. Using a simple teacher-student framework, we offer reliable pseudo correspondences to the student network via machine supervision. Finally, the performance of our network is steadily improved by the proposed iterative training, putting back the student as a teacher to generate refined labels and train a new student repeatedly. Our models outperform the milestone baselines, including state-of-the-art methods on semantic correspondence benchmarks.Comment: 12 page

    Neural Relation Graph: A Unified Framework for Identifying Label Noise and Outlier Data

    Full text link
    Diagnosing and cleaning data is a crucial step for building robust machine learning systems. However, identifying problems within large-scale datasets with real-world distributions is challenging due to the presence of complex issues such as label errors, under-representation, and outliers. In this paper, we propose a unified approach for identifying the problematic data by utilizing a largely ignored source of information: a relational structure of data in the feature-embedded space. To this end, we present scalable and effective algorithms for detecting label errors and outlier data based on the relational graph structure of data. We further introduce a visualization tool that provides contextual information of a data point in the feature-embedded space, serving as an effective tool for interactively diagnosing data. We evaluate the label error and outlier/out-of-distribution (OOD) detection performances of our approach on the large-scale image, speech, and language domain tasks, including ImageNet, ESC-50, and MNLI. Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in debugging large-scale real-world datasets across various domains.Comment: preprin

    SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage

    Full text link
    We need billion-scale images to achieve more generalizable and ground-breaking vision models, as well as massive dataset storage to ship the images (e.g., the LAION-4B dataset needs 240TB storage space). However, it has become challenging to deal with unlimited dataset storage with limited storage infrastructure. A number of storage-efficient training methods have been proposed to tackle the problem, but they are rarely scalable or suffer from severe damage to performance. In this paper, we propose a storage-efficient training strategy for vision classifiers for large-scale datasets (e.g., ImageNet) that only uses 1024 tokens per instance without using the raw level pixels; our token storage only needs <1% of the original JPEG-compressed raw pixels. We also propose token augmentations and a Stem-adaptor module to make our approach able to use the same architecture as pixel-based approaches with only minimal modifications on the stem layer and the carefully tuned optimization settings. Our experimental results on ImageNet-1k show that our method significantly outperforms other storage-efficient training methods with a large gap. We further show the effectiveness of our method in other practical scenarios, storage-efficient pre-training, and continual learning. Code is available at https://github.com/naver-ai/seitComment: ICCV 2023; First two authors contributed equally; code url: https://github.com/naver-ai/seit; 17 pages, 1.2M

    Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models

    Full text link
    Advances in Large Language Models (LLMs) have inspired a surge of research exploring their expansion into the visual domain. While recent models exhibit promise in generating abstract captions for images and conducting natural conversations, their performance on text-rich images leaves room for improvement. In this paper, we propose the Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details typically overlooked by existing methods. Cream integrates vision and auxiliary encoders, complemented by a contrastive feature alignment technique, resulting in a more effective understanding of textual information within document images. Our approach, thus, seeks to bridge the gap between vision and language understanding, paving the way for more sophisticated Document Intelligence Assistants. Rigorous evaluations across diverse tasks, such as visual question answering on document images, demonstrate the efficacy of Cream as a state-of-the-art model in the field of visual document understanding. We provide our codebase and newly-generated datasets at https://github.com/naver-ai/crea

    Who Wrote this Code? Watermarking for Code Generation

    Full text link
    Large language models for code have recently shown remarkable performance in generating executable code. However, this rapid advancement has been accompanied by many legal and ethical concerns, such as code licensing issues, code plagiarism, and malware generation, making watermarking machine-generated code a very timely problem. Despite such imminent needs, we discover that existing watermarking and machine-generated text detection methods for LLMs fail to function with code generation tasks properly. Hence, in this work, we propose a new watermarking method, SWEET, that significantly improves upon previous approaches when watermarking machine-generated code. Our proposed method selectively applies watermarking to the tokens with high enough entropy, surpassing a defined threshold. The experiments on code generation benchmarks show that our watermarked code has superior quality compared to code produced by the previous state-of-the-art LLM watermarking method. Furthermore, our watermark method also outperforms DetectGPT for the task of machine-generated code detection

    High-performance and scalable metal-chalcogenide semiconductors and devices via chalco-gel routes

    Get PDF
    We report a general strategy for obtaining high-quality, large-areametal-chalcogenide semiconductor films from precursors combining chelated metal salts with chalcoureas or chalcoamides. Using conventional organic solvents, such precursors enable the expeditious formation of chalco-gels,which are easily transformed into the corresponding highperformance metal-chalcogenide thin films with large, uniform areas. Diverse metal chalcogenides and their alloys (MQx: M = Zn, Cd, In, Sb, Pb; Q = S, Se, Te) are successfully synthesized at relatively low processing temperatures (&amp;lt;400°C). The versatility of this scalable route is demonstrated by the fabrication of large-area thin-film transistors (TFTs), optoelectronic devices, and integrated circuits on a 4-inch Si wafer and 2.5-inch borosilicate glass substrates in ambient air using CdS, CdSe, and In2Se3 active layers. The CdSe TFTs exhibit a maximum field-effect mobility greater than 300 cm2 V-1 s-1 with an on/off current ratio of &amp;gt;107 and good operational stability (threshold voltage shift &amp;lt; 0.5 V at a positive gate bias stress of 10 ks). In addition,metal chalcogenide-based phototransistors with a photodetectivity of &amp;gt;1013 Jones and seven-stage ring oscillators operating at a speed of ∼2.6 MHz (propagation delay of &amp;lt; 27 ns per stage) are demonstrated. © 2018 The Authors.1
    corecore