105 research outputs found
Cross-Lingual Knowledge Editing in Large Language Models
Knowledge editing aims to change language models' performance on several
special cases (i.e., editing scope) by infusing the corresponding expected
knowledge into them. With the recent advancements in large language models
(LLMs), knowledge editing has been shown as a promising technique to adapt LLMs
to new knowledge without retraining from scratch. However, most of the previous
studies neglect the multi-lingual nature of some main-stream LLMs (e.g., LLaMA,
ChatGPT and GPT-4), and typically focus on monolingual scenarios, where LLMs
are edited and evaluated in the same language. As a result, it is still unknown
the effect of source language editing on a different target language. In this
paper, we aim to figure out this cross-lingual effect in knowledge editing.
Specifically, we first collect a large-scale cross-lingual synthetic dataset by
translating ZsRE from English to Chinese. Then, we conduct English editing on
various knowledge editing methods covering different paradigms, and evaluate
their performance in Chinese, and vice versa. To give deeper analyses of the
cross-lingual effect, the evaluation includes four aspects, i.e., reliability,
generality, locality and portability. Furthermore, we analyze the inconsistent
behaviors of the edited models and discuss their specific challenges
DTV: Dual Knowledge Distillation and Target-oriented Vision Modeling for Many-to-Many Multimodal Summarization
Many-to-many multimodal summarization (MS) task aims to generate
summaries in any language with document inputs in any language and the
corresponding image sequence, which essentially comprises multimodal
monolingual summarization (MMS) and multimodal cross-lingual summarization
(MXLS) tasks. Although much work has been devoted to either MMS or MXLS and has
obtained increasing attention in recent years, little research pays attention
to the MS task. Besides, existing studies mainly focus on 1) utilizing MMS
to enhance MXLS via knowledge distillation without considering the performance
of MMS or 2) improving MMS models by filtering summary-unrelated visual
features with implicit learning or explicitly complex training objectives. In
this paper, we first introduce a general and practical task, i.e., MS.
Further, we propose a dual knowledge distillation and target-oriented vision
modeling framework for the MS task. Specifically, the dual knowledge
distillation method guarantees that the knowledge of MMS and MXLS can be
transferred to each other and thus mutually prompt both of them. To offer
target-oriented visual features, a simple yet effective target-oriented
contrastive objective is designed and responsible for discarding needless
visual information. Extensive experiments on the many-to-many setting show the
effectiveness of the proposed approach. Additionally, we will contribute a
many-to-many multimodal summarization (MSum) dataset.Comment: EMNLP 2023 Finding
When to Pre-Train Graph Neural Networks? An Answer from Data Generation Perspective!
Recently, graph pre-training has attracted wide research attention, which
aims to learn transferable knowledge from unlabeled graph data so as to improve
downstream performance. Despite these recent attempts, the negative transfer is
a major issue when applying graph pre-trained models to downstream tasks.
Existing works made great efforts on the issue of what to pre-train and how to
pre-train by designing a number of graph pre-training and fine-tuning
strategies. However, there are indeed cases where no matter how advanced the
strategy is, the "pre-train and fine-tune" paradigm still cannot achieve clear
benefits. This paper introduces a generic framework W2PGNN to answer the
crucial question of when to pre-train (i.e., in what situations could we take
advantage of graph pre-training) before performing effortful pre-training or
fine-tuning. We start from a new perspective to explore the complex generative
mechanisms from the pre-training data to downstream data. In particular, W2PGNN
first fits the pre-training data into graphon bases, each element of graphon
basis (i.e., a graphon) identifies a fundamental transferable pattern shared by
a collection of pre-training graphs. All convex combinations of graphon bases
give rise to a generator space, from which graphs generated form the solution
space for those downstream data that can benefit from pre-training. In this
manner, the feasibility of pre-training can be quantified as the generation
probability of the downstream data from any generator in the generator space.
W2PGNN provides three broad applications, including providing the application
scope of graph pre-trained models, quantifying the feasibility of performing
pre-training, and helping select pre-training data to enhance downstream
performance. We give a theoretically sound solution for the first application
and extensive empirical justifications for the latter two applications
Understanding Translationese in Cross-Lingual Summarization
Given a document in a source language, cross-lingual summarization (CLS) aims
at generating a concise summary in a different target language. Unlike
monolingual summarization (MS), naturally occurring source-language documents
paired with target-language summaries are rare. To collect large-scale CLS
data, existing datasets typically involve translation in their creation.
However, the translated text is distinguished from the text originally written
in that language, i.e., translationese. In this paper, we first confirm that
different approaches of constructing CLS datasets will lead to different
degrees of translationese. Then we systematically investigate how
translationese affects CLS model evaluation and performance when it appears in
source documents or target summaries. In detail, we find that (1) the
translationese in documents or summaries of test sets might lead to the
discrepancy between human judgment and automatic evaluation; (2) the
translationese in training sets would harm model performance in real-world
applications; (3) though machine-translated documents involve translationese,
they are very useful for building CLS systems on low-resource languages under
specific training strategies. Lastly, we give suggestions for future CLS
research including dataset and model developments. We hope that our work could
let researchers notice the phenomenon of translationese in CLS and take it into
account in the future.Comment: Accepted to the Findings of EMNLP 202
Zero-Shot Cross-Lingual Summarization via Large Language Models
Given a document in a source language, cross-lingual summarization (CLS) aims
to generate a summary in a different target language. Recently, the emergence
of Large Language Models (LLMs), such as GPT-3.5, ChatGPT and GPT-4, has
attracted wide attention from the computational linguistics community. However,
it is not yet known the performance of LLMs on CLS. In this report, we
empirically use various prompts to guide LLMs to perform zero-shot CLS from
different paradigms (i.e., end-to-end and pipeline), and provide a preliminary
evaluation on the generated summaries. We find that ChatGPT and GPT-4
originally prefer to produce lengthy summaries with detailed information. These
two LLMs can further balance informativeness and conciseness with the help of
an interactive prompt, significantly improving their CLS performance.
Experimental results on three widely-used CLS datasets show that GPT-4 achieves
state-of-the-art zero-shot CLS performance, and performs competitively compared
with the fine-tuned mBART-50. Moreover, we also find some multi-lingual and
bilingual LLMs (i.e., BLOOMZ, ChatGLM-6B, Vicuna-13B and ChatYuan) have limited
zero-shot CLS ability. Due to the composite nature of CLS, which requires
models to perform summarization and translation simultaneously, accomplishing
this task in a zero-shot manner is even a challenge for LLMs. Therefore, we
sincerely hope and recommend future LLM research could use CLS as a testbed.Comment: Technical Report, 11 page
Is ChatGPT a Good NLG Evaluator? A Preliminary Study
Recently, the emergence of ChatGPT has attracted wide attention from the
computational linguistics community. Many prior studies have shown that ChatGPT
achieves remarkable performance on various NLP tasks in terms of automatic
evaluation metrics. However, the ability of ChatGPT to serve as an evaluation
metric is still underexplored. Considering assessing the quality of natural
language generation (NLG) models is an arduous task and NLG metrics notoriously
show their poor correlation with human judgments, we wonder whether ChatGPT is
a good NLG evaluation metric. In this report, we provide a preliminary
meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail,
we regard ChatGPT as a human evaluator and give task-specific (e.g.,
summarization) and aspect-specific (e.g., relevance) instruction to prompt
ChatGPT to evaluate the generated results of NLG models. We conduct experiments
on five NLG meta-evaluation datasets (including summarization, story generation
and data-to-text tasks). Experimental results show that compared with previous
automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation
with human judgments in most cases. In addition, we find that the effectiveness
of the ChatGPT evaluator might be influenced by the creation method of the
meta-evaluation datasets. For the meta-evaluation datasets which are created
greatly depending on the reference and thus are biased, the ChatGPT evaluator
might lose its effectiveness. We hope our preliminary study could prompt the
emergence of a general-purposed reliable NLG metric.Comment: Both first authors contributed equally. Technical Report, 11 pages.
Accepted to the 4th New Frontiers in Summarization Workshop (NewSumm@EMNLP
2023
High-Resolution Boundary Detection for Medical Image Segmentation with Piece-Wise Two-Sample T-Test Augmented Loss
Deep learning methods have contributed substantially to the rapid advancement
of medical image segmentation, the quality of which relies on the suitable
design of loss functions. Popular loss functions, including the cross-entropy
and dice losses, often fall short of boundary detection, thereby limiting
high-resolution downstream applications such as automated diagnoses and
procedures. We developed a novel loss function that is tailored to reflect the
boundary information to enhance the boundary detection. As the contrast between
segmentation and background regions along the classification boundary naturally
induces heterogeneity over the pixels, we propose the piece-wise two-sample
t-test augmented (PTA) loss that is infused with the statistical test for such
heterogeneity. We demonstrate the improved boundary detection power of the PTA
loss compared to benchmark losses without a t-test component
The effect of peak serum estradiol level during ovarian stimulation on cumulative live birth and obstetric outcomes in freeze-all cycles
ObjectiveTo determine whether the peak serum estradiol (E2) level during ovarian stimulation affects the cumulative live birth rate (CLBR) and obstetric outcomes in freeze-all cycles.MethodsThis retrospective cohort study involved patients who underwent their first cycle of in vitro fertilization followed by a freeze-all strategy and frozen embryo transfer cycles between January 2014 and June 2019 at a tertiary care center. Patients were categorized into four groups according to quartiles of peak serum E2 levels during ovarian stimulation (Q1-Q4). The primary outcome was CLBR. Secondary outcomes included obstetric and neonatal outcomes of singleton and twin pregnancies. Poisson or logistic regression was applied to control for potential confounders for outcome measures, as appropriate. Generalized estimating equations were used to account for multiple cycles from the same patient for the outcome of CLBR.Result(s)A total of 11237 patients were included in the analysis. Cumulatively, live births occurred in 8410 women (74.8%). The live birth rate (LBR) and CLBR improved as quartiles of peak E2 levels increased (49.7%, 52.1%, 54.9%, and 56.4% for LBR; 65.1%, 74.3%, 78.4%, and 81.6% for CLBR, from the lowest to the highest quartile of estradiol levels, respectively, P<0.001). Such association remained significant for CLBR after accounting for potential confounders in multivariable regression models, whereas the relationship between LBR and peak E2 levels did not reach statistical significance. In addition, no significant differences were noticed in adverse obstetric and neonatal outcomes (gestational diabetes mellitus, pregnancy-induced hypertension, preeclampsia, placental disorders, preterm birth, low birthweight, and small for gestational age) amongst E2 quartiles for either singleton or twin live births, both before and after adjustment.ConclusionIn freeze-all cycles, higher peak serum E2 levels during ovarian stimulation were associated with increased CLBR, without increasing the risks of adverse obstetric and neonatal outcomes
- …