3 research outputs found
ScisummNet: A Large Annotated Corpus and Content-Impact Models for Scientific Paper Summarization with Citation Networks
Scientific article summarization is challenging: large, annotated corpora are
not available, and the summary should ideally include the article's impacts on
research community. This paper provides novel solutions to these two
challenges. We 1) develop and release the first large-scale manually-annotated
corpus for scientific papers (on computational linguistics) by enabling faster
annotation, and 2) propose summarization methods that integrate the authors'
original highlights (abstract) and the article's actual impacts on the
community (citations), to create comprehensive, hybrid summaries. We conduct
experiments to demonstrate the efficacy of our corpus in training data-driven
models for scientific paper summarization and the advantage of our hybrid
summaries over abstracts and traditional citation-based summaries. Our large
annotated corpus and hybrid methods provide a new framework for scientific
paper summarization research.Comment: AAAI 201
SPEC5G: A Dataset for 5G Cellular Network Protocol Analysis
5G is the 5th generation cellular network protocol. It is the
state-of-the-art global wireless standard that enables an advanced kind of
network designed to connect virtually everyone and everything with increased
speed and reduced latency. Therefore, its development, analysis, and security
are critical. However, all approaches to the 5G protocol development and
security analysis, e.g., property extraction, protocol summarization, and
semantic analysis of the protocol specifications and implementations are
completely manual. To reduce such manual effort, in this paper, we curate
SPEC5G the first-ever public 5G dataset for NLP research. The dataset contains
3,547,586 sentences with 134M words, from 13094 cellular network specifications
and 13 online websites. By leveraging large-scale pre-trained language models
that have achieved state-of-the-art results on NLP tasks, we use this dataset
for security-related text classification and summarization. Security-related
text classification can be used to extract relevant security-related properties
for protocol testing. On the other hand, summarization can help developers and
practitioners understand the high level of the protocol, which is itself a
daunting task. Our results show the value of our 5G-centric dataset in 5G
protocol analysis automation. We believe that SPEC5G will enable a new research
direction into automatic analyses for the 5G cellular network protocol and
numerous related downstream tasks. Our data and code are publicly available