49 research outputs found
Contrastive Hierarchical Discourse Graph for Scientific Document Summarization
The extended structural context has made scientific paper summarization a
challenging task. This paper proposes CHANGES, a contrastive hierarchical graph
neural network for extractive scientific paper summarization. CHANGES
represents a scientific paper with a hierarchical discourse graph and learns
effective sentence representations with dedicated designed hierarchical graph
information aggregation. We also propose a graph contrastive learning module to
learn global theme-aware sentence representations. Extensive experiments on the
PubMed and arXiv benchmark datasets prove the effectiveness of CHANGES and the
importance of capturing hierarchical structure information in modeling
scientific papers.Comment: CODI at ACL 202
HEGEL: Hypergraph Transformer for Long Document Summarization
Extractive summarization for long documents is challenging due to the
extended structured input context. The long-distance sentence dependency
hinders cross-sentence relations modeling, the critical step of extractive
summarization. This paper proposes HEGEL, a hypergraph neural network for long
document summarization by capturing high-order cross-sentence relations. HEGEL
updates and learns effective sentence representations with hypergraph
transformer layers and fuses different types of sentence dependencies,
including latent topics, keywords coreference, and section structure. We
validate HEGEL by conducting extensive experiments on two benchmark datasets,
and experimental results demonstrate the effectiveness and efficiency of HEGEL.Comment: EMNLP 202
SummIt: Iterative Text Summarization via ChatGPT
Existing text summarization systems have made significant progress in recent
years but typically generates summaries in a single step. The one-shot
summarization setting is sometimes inadequate, however, as the generated
summary may contain hallucinations or overlook important details related to the
reader's interests. In this paper, we address this limitation by proposing
SummIt, an iterative text summarization framework based on large language
models like ChatGPT. Our framework enables the model to refine the generated
summary iteratively through self-evaluation and feedback, closely resembling
the iterative process humans undertake when drafting and revising summaries. We
also explore using in-context learning to guide the rationale generation and
summary refinement. Furthermore, we explore the potential benefits of
integrating knowledge and topic extractors into the framework to enhance
summary faithfulness and controllability. We evaluate the performance of our
framework on three benchmark summarization datasets through empirical and
qualitative analyses. We also conduct a human evaluation to validate the
effectiveness of the model's refinements and find a potential issue of
over-correction. Our code is available at
\url{https://github.com/hpzhang94/summ_it}.Comment: work in progres
Extractive Summarization via ChatGPT for Faithful Summary Generation
Extractive summarization is a crucial task in natural language processing
that aims to condense long documents into shorter versions by directly
extracting sentences. The recent introduction of ChatGPT has attracted
significant interest in the NLP community due to its remarkable performance on
a wide range of downstream tasks. However, concerns regarding factuality and
faithfulness have hindered its practical applications for summarization
systems. This paper first presents a thorough evaluation of ChatGPT's
performance on extractive summarization and compares it with traditional
fine-tuning methods on various benchmark datasets. Our experimental analysis
reveals that ChatGPT's extractive summarization performance is still inferior
to existing supervised systems in terms of ROUGE scores. In addition, we
explore the effectiveness of in-context learning and chain-of-thought reasoning
for enhancing its performance. Furthermore, we find that applying an
extract-then-generate pipeline with ChatGPT yields significant performance
improvements over abstractive baselines in terms of summary faithfulness. These
observations highlight potential directions for enhancing ChatGPT's
capabilities for faithful text summarization tasks using two-stage approaches.Comment: Work in progres
Deciphering protein glycosylation through novel mass spectrometry-based proteomic strategies
Protein glycosylation is essential for cell survival and proliferation. Comprehensive analysis of protein glycosylation can aid in a better understanding of protein functions, cellular activities, and the molecular mechanisms of diseases. Emerging mass spectrometry (MS)-based proteomics enables comprehensive analysis of protein glycosylation and many other types of modifications. However, due to the heterogeneity of glycans and the low abundance of many glycoproteins in complex biological samples, it is extraordinarily challenging to globally and site-specifically analyze glycoproteins. This thesis focuses on the development of new methods for global analysis of glycoproteins, and the applications of the newly developed methods for biomedical research. This thesis is constituted of six chapters. Chapter 1 is an overview of MS-based glycoproteomics analysis, with an emphasis on the endeavors in the literature to solve the two major problems for global analysis of glycoproteins mentioned above. This chapter retraces the developments of important chemical and enzymatic methods in this field, and includes the discussion regarding how these methods have enabled qualitative and quantitative analyses of glycoproteins in a variety of biological systems. Chapter 2 focuses on the development of a strategy that utilizes the universal recognition between boronic acid and sugars, in order to enrich glycopeptides for LC-MS/MS analysis. Chapter 3 shows the approach of achieving quantitative analysis of protein glycosylation through the combination of boronic acid enrichment and quantitative proteomics. Chapter 4 describes a strategy for cell-surface N-glycoproteome analysis. Metabolic labeling, click chemistry, and MS-based proteomics were combined to specifically map the glycoproteins located only on cell surface. The labeling efficiencies of different sugar analogs were compared, and this method was combined with either stable isotope labeling in cell culture (SILAC) or tandem mass tag (TMT)-labeling to quantitatively study the surface N-glycoproteins. Chapter 5 explains how protein S-GlcNAcylation was unexpectedly found in human cells. Starting with an attempt to profile protein O-GlcNAc, hundreds of S-GlcNAcylation sites were surprisingly identified on cysteine residues. This modification was demonstrated not to be caused by chemical reactions with the cleavable linker during sample preparation nor due to false site assignment. Furthermore, protein S-GlcNAcylation events were investigated with different sugar analog labeling in three cell lines. Chapter 6 features an application of MS-based proteomics in biomedical research. In this chapter, the cellular responses and pleiotropic effects in statin-treated cells on the proteome, glycoproteome, and phosphoproteome levels were analyzed. In addition to the independent projects discussed above, the collaborative projects about that investigation of the cellular mechanisms of gold-nanorod assisted cancer photothermal therapy, and the discordance between mRNA and proteome in ovarian cancer tissues were also conducted. The abstracts of the publications resulted from the collaborations are shown in the appendix. In conclusion, the work presented in this thesis majorly combines chemical biology and modern MS-based proteomics to study protein modifications, especially glycosylation. This thesis strives to advance the techniques of glycoproteomics and apply the state-of-the-art methods to investigate biological and biomedical problems.Ph.D
DiffuSum: Generation Enhanced Extractive Summarization with Diffusion
Extractive summarization aims to form a summary by directly extracting
sentences from the source document. Existing works mostly formulate it as a
sequence labeling problem by making individual sentence label predictions. This
paper proposes DiffuSum, a novel paradigm for extractive summarization, by
directly generating the desired summary sentence representations with diffusion
models and extracting sentences based on sentence representation matching. In
addition, DiffuSum jointly optimizes a contrastive sentence encoder with a
matching loss for sentence representation alignment and a multi-class
contrastive loss for representation diversity. Experimental results show that
DiffuSum achieves the new state-of-the-art extractive results on CNN/DailyMail
with ROUGE scores of . Experiments on the other two datasets
with different summary lengths also demonstrate the effectiveness of DiffuSum.
The strong performance of our framework shows the great potential of adapting
generative models for extractive summarization. To encourage more following
work in the future, we have released our codes at
\url{https://github.com/hpzhang94/DiffuSum}Comment: ACL 2023 Finding
Simultaneous Quantitation of Glycoprotein Degradation and Synthesis Rates by Integrating Isotope Labeling, Chemical Enrichment, and Multiplexed Proteomics
Protein glycosylation
is essential for cell survival and regulates
many cellular events. Reversible glycosylation is also dynamic in
biological systems. The functions of glycoproteins are regulated by
their dynamics to adapt the ever-changing inter- and intracellular
environments. Glycans on proteins not only mediate a variety of protein
activities, but also creates a steric hindrance for protecting the
glycoproteins from degradation by proteases. In this work, a novel
strategy integrating isotopic labeling, chemical enrichment and multiplexed
proteomics was developed to simultaneously quantify the degradation
and synthesis rates of many glycoproteins in human cells. We quantified
the synthesis rates of 847 N-glycoproteins and the degradation rates
of 704 glycoproteins in biological triplicate experiments, including
many important glycoproteins such as CD molecules. Through comparing
the synthesis and degradation rates, we found that most proteins have
higher synthesis rates since cells are still growing throughout the
time course, while a small group of proteins with lower synthesis
rates mainly participate in adhesion, locomotion, localization, and
signaling. This method can be widely applied in biochemical and biomedical
research and provide insights into elucidating glycoprotein functions
and the molecular mechanism of many biological events
Global and Site-Specific Analysis Revealing Unexpected and Extensive Protein S‑GlcNAcylation in Human Cells
Protein glycosylation
is highly diverse and essential for mammalian
cell survival. Heterogeneous glycans may be bound to different amino
acid residues, forming multiple types of protein glycosylation. In
this work, unexpected protein S-GlcNAcylation on cysteine residues
was observed to extensively exist in human cells through global and
site-specific analysis of protein GlcNAcylation by mass spectrometry.
Three independent experiments produced similar results of many cysteine
residues bound to <i>N</i>-acetylglucosamine (GlcNAc). Among
well-localized S-GlcNAcylation sites, several motifs with an acidic
amino acid around the sites were identified, which strongly suggests
that a particular type of enzyme is responsible for this modification.
Clustering results show that glycoproteins modified with S-GlcNAc
are mainly involved in cell–cell adhesion and gene expression.
For the first time, we found that proteins were extensively bound
to GlcNAc through the side chains of cysteine residues in human cells,
and the current discovery further advances our understanding of protein
glycosylation