32 research outputs found

    Embodied Executable Policy Learning with Language-based Scene Summarization

    Full text link
    Large Language models (LLMs) have shown remarkable success in assisting robot learning tasks, i.e., complex household planning. However, the performance of pretrained LLMs heavily relies on domain-specific templated text data, which may be infeasible in real-world robot learning tasks with image-based observations. Moreover, existing LLMs with text inputs lack the capability to evolve with non-expert interactions with environments. In this work, we introduce a novel learning paradigm that generates robots' executable actions in the form of text, derived solely from visual observations, using language-based summarization of these observations as the connecting bridge between both domains. Our proposed paradigm stands apart from previous works, which utilized either language instructions or a combination of language and visual data as inputs. Moreover, our method does not require oracle text summarization of the scene, eliminating the need for human involvement in the learning loop, which makes it more practical for real-world robot learning tasks. Our proposed paradigm consists of two modules: the SUM module, which interprets the environment using visual observations and produces a text summary of the scene, and the APM module, which generates executable action policies based on the natural language descriptions provided by the SUM module. We demonstrate that our proposed method can employ two fine-tuning strategies, including imitation learning and reinforcement learning approaches, to adapt to the target test tasks effectively. We conduct extensive experiments involving various SUM/APM model selections, environments, and tasks across 7 house layouts in the VirtualHome environment. Our experimental results demonstrate that our method surpasses existing baselines, confirming the effectiveness of this novel learning paradigm.Comment: 15 pages. arXiv admin note: text overlap with arXiv:2107.06912 by other author

    Can Brain Signals Reveal Inner Alignment with Human Languages?

    Full text link
    Brain Signals, such as Electroencephalography (EEG), and human languages have been widely explored independently for many downstream tasks, however, the connection between them has not been well explored. In this study, we explore the relationship and dependency between EEG and language. To study at the representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal \textbf{T}ransformer \textbf{A}lignment \textbf{M}odel, to observe coordinated representations between the two modalities. We used various relationship alignment-seeking techniques, such as Canonical Correlation Analysis and Wasserstein Distance, as loss functions to transfigure features. On downstream applications, sentiment analysis and relation detection, we achieved new state-of-the-art results on two datasets, ZuCo and K-EmoCon. Our method achieved an F1-score improvement of 1.7% on K-EmoCon and 9.3% on Zuco datasets for sentiment analysis, and 7.4% on ZuCo for relation detection. In addition, we provide interpretations of the performance improvement: (1) feature distribution shows the effectiveness of the alignment module for discovering and encoding the relationship between EEG and language; (2) alignment weights show the influence of different language semantics as well as EEG frequency features; (3) brain topographical maps provide an intuitive demonstration of the connectivity in the brain regions. Our code is available at \url{https://github.com/Jason-Qiu/EEG_Language_Alignment}.Comment: EMNLP 2023 Finding

    Transfer Knowledge from Natural Language to Electrocardiography: Can We Detect Cardiovascular Disease Through Language Models?

    Full text link
    Recent advancements in Large Language Models (LLMs) have drawn increasing attention since the learned embeddings pretrained on large-scale datasets have shown powerful ability in various downstream applications. However, whether the learned knowledge by LLMs can be transferred to clinical cardiology remains unknown. In this work, we aim to bridge this gap by transferring the knowledge of LLMs to clinical Electrocardiography (ECG). We propose an approach for cardiovascular disease diagnosis and automatic ECG diagnosis report generation. We also introduce an additional loss function by Optimal Transport (OT) to align the distribution between ECG and language embedding. The learned embeddings are evaluated on two downstream tasks: (1) automatic ECG diagnosis report generation, and (2) zero-shot cardiovascular disease detection. Our approach is able to generate high-quality cardiac diagnosis reports and also achieves competitive zero-shot classification performance even compared with supervised baselines, which proves the feasibility of transferring knowledge from LLMs to the cardiac domain.Comment: EACL 202

    Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment

    Full text link
    Multimedia summarization with multimodal output (MSMO) is a recently explored application in language grounding. It plays an essential role in real-world applications, i.e., automatically generating cover images and titles for news articles or providing introductions to online videos. However, existing methods extract features from the whole video and article and use fusion methods to select the representative one, thus usually ignoring the critical structure and varying semantics. In this work, we propose a Semantics-Consistent Cross-domain Summarization (SCCS) model based on optimal transport alignment with visual and textual segmentation. In specific, our method first decomposes both video and article into segments in order to capture the structural semantics, respectively. Then SCCS follows a cross-domain alignment objective with optimal transport distance, which leverages multimodal interaction to match and select the visual and textual summary. We evaluated our method on three recent multimodal datasets and demonstrated the effectiveness of our method in producing high-quality multimodal summaries

    Converting ECG Signals to Images for Efficient Image-text Retrieval via Encoding

    Full text link
    Automated interpretation of electrocardiograms (ECG) has garnered significant attention with the advancements in machine learning methodologies. Despite the growing interest in automated ECG interpretation using machine learning, most current studies focus solely on classification or regression tasks and overlook a crucial aspect of clinical cardio-disease diagnosis: the diagnostic report generated by experienced human clinicians. In this paper, we introduce a novel approach to ECG interpretation, leveraging recent breakthroughs in Large Language Models (LLMs) and Vision-Transformer (ViT) models. Rather than treating ECG diagnosis as a classification or regression task, we propose an alternative method of automatically identifying the most similar clinical cases based on the input ECG data. Also, since interpreting ECG as images are more affordable and accessible, we process ECG as encoded images and adopt a vision-language learning paradigm to jointly learn vision-language alignment between encoded ECG images and ECG diagnosis reports. Encoding ECG into images can result in an efficient ECG retrieval system, which will be highly practical and useful in clinical applications. More importantly, our findings could serve as a crucial resource for providing diagnostic services in regions where only paper-printed ECG images are accessible due to past underdevelopment.Comment: 26 page

    MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos

    Full text link
    Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless, numerous limitations exist within existing public MSMO datasets, including insufficient maintenance, data inaccessibility, limited size, and the absence of proper categorization, which pose significant challenges. To address these challenges and provide a comprehensive dataset for this new direction, we have meticulously curated the \textbf{MMSum} dataset. Our new dataset features (1) Human-validated summaries for both video and textual content, providing superior human instruction and labels for multimodal learning. (2) Comprehensively and meticulously arranged categorization, spanning 17 principal categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. (3) Benchmark tests performed on the proposed dataset to assess various tasks and methods, including \textit{video summarization}, \textit{text summarization}, and \textit{multimodal summarization}. To champion accessibility and collaboration, we will release the \textbf{MMSum} dataset and the data collection tool as fully open-source resources, fostering transparency and accelerating future developments. Our project website can be found at~\url{https://mmsum-dataset.github.io/}Comment: Project website: https://mmsum-dataset.github.io

    Recent progress in 2D/quasi-2D layered metal halide perovskites for solar cells

    No full text
    © The Royal Society of Chemistry 2018. As an important category of perovskite materials, two-dimensional (2D) perovskites are attracting increasing research attention these days. Their possibility of combining high performance and stability for perovskite based optoelectronic devices has triggered a new wave of research. This review mainly focuses on the application of 2D perovskite materials for solar cells. We start with a brief introduction of 2D perovskite structures and their unique properties. The recent progress in 2D perovskite solar cells is summarized in three aspects according to the existing forms of the perovskite materials in the devices. In the end, a short outlook with our opinion is given to indicate the possible development trend for this kind of perovskite material.status: publishe

    Exploiting Two-Step Processed Mixed 2D/3D Perovskites for Bright Green Light Emitting Diodes

    No full text
    © 2019 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Mixed 2D/3D perovskite films with self-assembled quantum wells have significantly improved the performance of perovskite light emitting diodes (PeLEDs). In this work, such films are fabricated through a two-step interdiffusion method that is widely employed in processing of perovskite solar cells, however, remains rarely explored for PeLEDs. The effects of incorporating large-cation ligand, i.e., butylammonium bromide (BABr) into formamidinium lead bromide (FAPbBr 3 ) based perovskites, in terms of film composition, morphology, optoelectronic properties as well as device performance are thoroughly investigated in this method. By modulating BABr:PbBr 2 ratio in the precursor solution, the optimal device shows a maximum external quantum efficiency (EQE) of 7.36% at 147.7 mA cm −2 and a brightness of 37 720 Cd m −2 at 5 V. The performance is remarkably higher than a reference device without BABr that shows a maximum EQE of 2.53% and a brightness of 6190 Cd m −2 at 5 V. The versatility of this method is further extended to another large-cation ligand, 4-fluoro-benzylammonium bromide (F-BZABr), which leads to maximum EQE of 8.55%. This work indicates two-step processed mixed 2D/3D perovskites are promising for bright green PeLEDs.status: publishe
    corecore