54 research outputs found
Unsupervised Dual-Cascade Learning with Pseudo-Feedback Distillation for Query-based Extractive Summarization
We propose Dual-CES -- a novel unsupervised, query-focused, multi-document
extractive summarizer. Dual-CES is designed to better handle the tradeoff
between saliency and focus in summarization. To this end, Dual-CES employs a
two-step dual-cascade optimization approach with saliency-based pseudo-feedback
distillation. Overall, Dual-CES significantly outperforms all other
state-of-the-art unsupervised alternatives. Dual-CES is even shown to be able
to outperform strong supervised summarizers
Learning Concept Abstractness Using Weak Supervision
We introduce a weakly supervised approach for inferring the property of
abstractness of words and expressions in the complete absence of labeled data.
Exploiting only minimal linguistic clues and the contextual usage of a concept
as manifested in textual data, we train sufficiently powerful classifiers,
obtaining high correlation with human labels. The results imply the
applicability of this approach to additional properties of concepts, additional
languages, and resource-scarce scenarios.Comment: 6 pages, EMNLP 201
An Editorial Network for Enhanced Document Summarization
We suggest a new idea of Editorial Network - a mixed extractive-abstractive
summarization approach, which is applied as a post-processing step over a given
sequence of extracted sentences. Our network tries to imitate the decision
process of a human editor during summarization. Within such a process, each
extracted sentence may be either kept untouched, rephrased or completely
rejected. We further suggest an effective way for training the "editor" based
on a novel soft-labeling approach. Using the CNN/DailyMail dataset we
demonstrate the effectiveness of our approach compared to state-of-the-art
extractive-only or abstractive-only baseline methods
orgFAQ: A New Dataset and Analysis on Organizational FAQs and User Questions
Frequently Asked Questions (FAQ) webpages are created by organizations for
their users. FAQs are used in several scenarios, e.g., to answer user
questions. On the other hand, the content of FAQs is affected by user questions
by definition. In order to promote research in this field, several FAQ datasets
exist. However, we claim that being collected from community websites, they do
not correctly represent challenges associated with FAQs in an organizational
context. Thus, we release orgFAQ, a new dataset composed of user
questions and corresponding FAQs that were extracted from organizations'
FAQ webpages in the Jobs domain. In this paper, we provide an analysis of the
properties of such FAQs, and demonstrate the usefulness of our new dataset by
utilizing it in a relevant task from the Jobs domain. We also show the value of
the orgFAQ dataset in a task of a different domain - the COVID-19 pandemic
A Study of Human Summaries of Scientific Articles
Researchers and students face an explosion of newly published papers which
may be relevant to their work. This led to a trend of sharing human summaries
of scientific papers. We analyze the summaries shared in one of these platforms
Shortscience.org. The goal is to characterize human summaries of scientific
papers, and use some of the insights obtained to improve and adapt existing
automatic summarization systems to the domain of scientific papers
TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks
Currently, no large-scale training data is available for the task of
scientific paper summarization. In this paper, we propose a novel method that
automatically generates summaries for scientific papers, by utilizing videos of
talks at scientific conferences. We hypothesize that such talks constitute a
coherent and concise description of the papers' content, and can form the basis
for good summaries. We collected 1716 papers and their corresponding videos,
and created a dataset of paper summaries. A model trained on this dataset
achieves similar performance as models trained on a dataset of summaries
created manually. In addition, we validated the quality of our summaries by
human experts.Comment: Accepted to ACL 201
A Study of BERT for Non-Factoid Question-Answering under Passage Length Constraints
We study the use of BERT for non-factoid question-answering, focusing on the
passage re-ranking task under varying passage lengths. To this end, we explore
the fine-tuning of BERT in different learning-to-rank setups, comprising both
point-wise and pair-wise methods, resulting in substantial improvements over
the state-of-the-art. We then analyze the effectiveness of BERT for different
passage lengths and suggest how to cope with large passages
Controversy in Context
With the growing interest in social applications of Natural Language
Processing and Computational Argumentation, a natural question is how
controversial a given concept is. Prior works relied on Wikipedia's metadata
and on content analysis of the articles pertaining to a concept in question.
Here we show that the immediate textual context of a concept is strongly
indicative of this property, and, using simple and language-independent
machine-learning tools, we leverage this observation to achieve
state-of-the-art results in controversiality prediction. In addition, we
analyze and make available a new dataset of concepts labeled for
controversiality. It is significantly larger than existing datasets, and grades
concepts on a 0-10 scale, rather than treating controversiality as a binary
label.Comment: 5 page
Conversational Document Prediction to Assist Customer Care Agents
A frequent pattern in customer care conversations is the agents responding
with appropriate webpage URLs that address users' needs. We study the task of
predicting the documents that customer care agents can use to facilitate users'
needs. We also introduce a new public dataset which supports the aforementioned
problem. Using this dataset and two others, we investigate state-of-the art
deep learning (DL) and information retrieval (IR) models for the task.
Additionally, we analyze the practicality of such systems in terms of inference
time complexity. Our show that an hybrid IR+DL approach provides the best of
both worlds.Comment: EMNLP 2020. The released Twitter dataset is available at:
https://github.com/IBM/twitter-customer-care-document-predictio
A Summarization System for Scientific Documents
We present a novel system providing summaries for Computer Science
publications. Through a qualitative user study, we identified the most valuable
scenarios for discovery, exploration and understanding of scientific documents.
Based on these findings, we built a system that retrieves and summarizes
scientific documents for a given information need, either in form of a
free-text query or by choosing categorized values such as scientific tasks,
datasets and more. Our system ingested 270,000 papers, and its summarization
module aims to generate concise yet detailed summaries. We validated our
approach with human experts.Comment: Accepted to EMNLP 201
- …