32 research outputs found
Embodied Executable Policy Learning with Language-based Scene Summarization
Large Language models (LLMs) have shown remarkable success in assisting robot
learning tasks, i.e., complex household planning. However, the performance of
pretrained LLMs heavily relies on domain-specific templated text data, which
may be infeasible in real-world robot learning tasks with image-based
observations. Moreover, existing LLMs with text inputs lack the capability to
evolve with non-expert interactions with environments. In this work, we
introduce a novel learning paradigm that generates robots' executable actions
in the form of text, derived solely from visual observations, using
language-based summarization of these observations as the connecting bridge
between both domains. Our proposed paradigm stands apart from previous works,
which utilized either language instructions or a combination of language and
visual data as inputs. Moreover, our method does not require oracle text
summarization of the scene, eliminating the need for human involvement in the
learning loop, which makes it more practical for real-world robot learning
tasks. Our proposed paradigm consists of two modules: the SUM module, which
interprets the environment using visual observations and produces a text
summary of the scene, and the APM module, which generates executable action
policies based on the natural language descriptions provided by the SUM module.
We demonstrate that our proposed method can employ two fine-tuning strategies,
including imitation learning and reinforcement learning approaches, to adapt to
the target test tasks effectively. We conduct extensive experiments involving
various SUM/APM model selections, environments, and tasks across 7 house
layouts in the VirtualHome environment. Our experimental results demonstrate
that our method surpasses existing baselines, confirming the effectiveness of
this novel learning paradigm.Comment: 15 pages. arXiv admin note: text overlap with arXiv:2107.06912 by
other author
Can Brain Signals Reveal Inner Alignment with Human Languages?
Brain Signals, such as Electroencephalography (EEG), and human languages have
been widely explored independently for many downstream tasks, however, the
connection between them has not been well explored. In this study, we explore
the relationship and dependency between EEG and language. To study at the
representation level, we introduced \textbf{MTAM}, a \textbf{M}ultimodal
\textbf{T}ransformer \textbf{A}lignment \textbf{M}odel, to observe coordinated
representations between the two modalities. We used various relationship
alignment-seeking techniques, such as Canonical Correlation Analysis and
Wasserstein Distance, as loss functions to transfigure features. On downstream
applications, sentiment analysis and relation detection, we achieved new
state-of-the-art results on two datasets, ZuCo and K-EmoCon. Our method
achieved an F1-score improvement of 1.7% on K-EmoCon and 9.3% on Zuco datasets
for sentiment analysis, and 7.4% on ZuCo for relation detection. In addition,
we provide interpretations of the performance improvement: (1) feature
distribution shows the effectiveness of the alignment module for discovering
and encoding the relationship between EEG and language; (2) alignment weights
show the influence of different language semantics as well as EEG frequency
features; (3) brain topographical maps provide an intuitive demonstration of
the connectivity in the brain regions. Our code is available at
\url{https://github.com/Jason-Qiu/EEG_Language_Alignment}.Comment: EMNLP 2023 Finding
Transfer Knowledge from Natural Language to Electrocardiography: Can We Detect Cardiovascular Disease Through Language Models?
Recent advancements in Large Language Models (LLMs) have drawn increasing
attention since the learned embeddings pretrained on large-scale datasets have
shown powerful ability in various downstream applications. However, whether the
learned knowledge by LLMs can be transferred to clinical cardiology remains
unknown. In this work, we aim to bridge this gap by transferring the knowledge
of LLMs to clinical Electrocardiography (ECG). We propose an approach for
cardiovascular disease diagnosis and automatic ECG diagnosis report generation.
We also introduce an additional loss function by Optimal Transport (OT) to
align the distribution between ECG and language embedding. The learned
embeddings are evaluated on two downstream tasks: (1) automatic ECG diagnosis
report generation, and (2) zero-shot cardiovascular disease detection. Our
approach is able to generate high-quality cardiac diagnosis reports and also
achieves competitive zero-shot classification performance even compared with
supervised baselines, which proves the feasibility of transferring knowledge
from LLMs to the cardiac domain.Comment: EACL 202
Semantics-Consistent Cross-domain Summarization via Optimal Transport Alignment
Multimedia summarization with multimodal output (MSMO) is a recently explored
application in language grounding. It plays an essential role in real-world
applications, i.e., automatically generating cover images and titles for news
articles or providing introductions to online videos. However, existing methods
extract features from the whole video and article and use fusion methods to
select the representative one, thus usually ignoring the critical structure and
varying semantics. In this work, we propose a Semantics-Consistent Cross-domain
Summarization (SCCS) model based on optimal transport alignment with visual and
textual segmentation. In specific, our method first decomposes both video and
article into segments in order to capture the structural semantics,
respectively. Then SCCS follows a cross-domain alignment objective with optimal
transport distance, which leverages multimodal interaction to match and select
the visual and textual summary. We evaluated our method on three recent
multimodal datasets and demonstrated the effectiveness of our method in
producing high-quality multimodal summaries
Converting ECG Signals to Images for Efficient Image-text Retrieval via Encoding
Automated interpretation of electrocardiograms (ECG) has garnered significant
attention with the advancements in machine learning methodologies. Despite the
growing interest in automated ECG interpretation using machine learning, most
current studies focus solely on classification or regression tasks and overlook
a crucial aspect of clinical cardio-disease diagnosis: the diagnostic report
generated by experienced human clinicians. In this paper, we introduce a novel
approach to ECG interpretation, leveraging recent breakthroughs in Large
Language Models (LLMs) and Vision-Transformer (ViT) models. Rather than
treating ECG diagnosis as a classification or regression task, we propose an
alternative method of automatically identifying the most similar clinical cases
based on the input ECG data. Also, since interpreting ECG as images are more
affordable and accessible, we process ECG as encoded images and adopt a
vision-language learning paradigm to jointly learn vision-language alignment
between encoded ECG images and ECG diagnosis reports. Encoding ECG into images
can result in an efficient ECG retrieval system, which will be highly practical
and useful in clinical applications. More importantly, our findings could serve
as a crucial resource for providing diagnostic services in regions where only
paper-printed ECG images are accessible due to past underdevelopment.Comment: 26 page
MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos
Multimodal summarization with multimodal output (MSMO) has emerged as a
promising research direction. Nonetheless, numerous limitations exist within
existing public MSMO datasets, including insufficient maintenance, data
inaccessibility, limited size, and the absence of proper categorization, which
pose significant challenges. To address these challenges and provide a
comprehensive dataset for this new direction, we have meticulously curated the
\textbf{MMSum} dataset. Our new dataset features (1) Human-validated summaries
for both video and textual content, providing superior human instruction and
labels for multimodal learning. (2) Comprehensively and meticulously arranged
categorization, spanning 17 principal categories and 170 subcategories to
encapsulate a diverse array of real-world scenarios. (3) Benchmark tests
performed on the proposed dataset to assess various tasks and methods,
including \textit{video summarization}, \textit{text summarization}, and
\textit{multimodal summarization}. To champion accessibility and collaboration,
we will release the \textbf{MMSum} dataset and the data collection tool as
fully open-source resources, fostering transparency and accelerating future
developments. Our project website can be found
at~\url{https://mmsum-dataset.github.io/}Comment: Project website: https://mmsum-dataset.github.io
Recent progress in 2D/quasi-2D layered metal halide perovskites for solar cells
© The Royal Society of Chemistry 2018. As an important category of perovskite materials, two-dimensional (2D) perovskites are attracting increasing research attention these days. Their possibility of combining high performance and stability for perovskite based optoelectronic devices has triggered a new wave of research. This review mainly focuses on the application of 2D perovskite materials for solar cells. We start with a brief introduction of 2D perovskite structures and their unique properties. The recent progress in 2D perovskite solar cells is summarized in three aspects according to the existing forms of the perovskite materials in the devices. In the end, a short outlook with our opinion is given to indicate the possible development trend for this kind of perovskite material.status: publishe
Exploiting Two-Step Processed Mixed 2D/3D Perovskites for Bright Green Light Emitting Diodes
© 2019 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim Mixed 2D/3D perovskite films with self-assembled quantum wells have significantly improved the performance of perovskite light emitting diodes (PeLEDs). In this work, such films are fabricated through a two-step interdiffusion method that is widely employed in processing of perovskite solar cells, however, remains rarely explored for PeLEDs. The effects of incorporating large-cation ligand, i.e., butylammonium bromide (BABr) into formamidinium lead bromide (FAPbBr 3 ) based perovskites, in terms of film composition, morphology, optoelectronic properties as well as device performance are thoroughly investigated in this method. By modulating BABr:PbBr 2 ratio in the precursor solution, the optimal device shows a maximum external quantum efficiency (EQE) of 7.36% at 147.7 mA cm −2 and a brightness of 37 720 Cd m −2 at 5 V. The performance is remarkably higher than a reference device without BABr that shows a maximum EQE of 2.53% and a brightness of 6190 Cd m −2 at 5 V. The versatility of this method is further extended to another large-cation ligand, 4-fluoro-benzylammonium bromide (F-BZABr), which leads to maximum EQE of 8.55%. This work indicates two-step processed mixed 2D/3D perovskites are promising for bright green PeLEDs.status: publishe