Search CORE

135 research outputs found

LCSTS: A Large Scale Chinese Short Text Summarization Dataset

Author: Chen Qingcai
Hu Baotian
Zhu Fangze
Publication venue
Publication date: 04/12/2015
Field of study

Automatic text summarization is widely regarded as the highly difficult problem, partially because of the lack of large text summarization data set. Due to the great challenge of constructing the large scale summaries for full text, in this paper, we introduce a large corpus of Chinese short text summarization dataset constructed from the Chinese microblogging website Sina Weibo, which is released to the public {http://icrc.hitsz.edu.cn/Article/show/139.html}. This corpus consists of over 2 million real Chinese short texts with short summaries given by the author of each text. We also manually tagged the relevance of 10,666 short summaries with their corresponding short texts. Based on the corpus, we introduce recurrent neural network for the summary generation and achieve promising results, which not only shows the usefulness of the proposed corpus for short text summarization research, but also provides a baseline for further research on this topic.Comment: Recently, we received feedbacks from Yuya Taguchi from NAIST in Japan and Qian Chen from USTC of China, that the results in the EMNLP2015 version seem to be underrated. So we carefully checked our results and find out that we made a mistake while using the standard ROUGE. Then we re-evaluate all methods in the paper and get corrected results listed in Table 2 of this versio

arXiv.org e-Print Archive

CiteSeerX

Answer Sequence Learning with Neural Networks for Answer Selection in Community Question Answering

Author: Chen Qingcai
Hu Baotian
Tang Buzhou
Wang Xiaolong
Zhou Xiaoqiang
Publication venue
Publication date: 01/01/2015
Field of study

In this paper, the answer selection problem in community question answering (CQA) is regarded as an answer sequence labeling task, and a novel approach is proposed based on the recurrent architecture for this problem. Our approach applies convolution neural networks (CNNs) to learning the joint representation of question-answer pair firstly, and then uses the joint representation as input of the long short-term memory (LSTM) to learn the answer sequence of a question for labeling the matching quality of each answer. Experiments conducted on the SemEval 2015 CQA dataset shows the effectiveness of our approach.Comment: 6 page

arXiv.org e-Print Archive

Crossref

Calibration Meets Explanation: A Simple and Effective Approach for Model Confidence Estimates

Author: Chen Qingcai
Hu Baotian
Li Dongfang
Publication venue
Publication date: 06/11/2022
Field of study

Calibration strengthens the trustworthiness of black-box models by producing better accurate confidence estimates on given examples. However, little is known about if model explanations can help confidence calibration. Intuitively, humans look at important features attributions and decide whether the model is trustworthy. Similarly, the explanations can tell us when the model may or may not know. Inspired by this, we propose a method named CME that leverages model explanations to make the model less confident with non-inductive attributions. The idea is that when the model is not highly confident, it is difficult to identify strong indications of any class, and the tokens accordingly do not have high attribution scores for any class and vice versa. We conduct extensive experiments on six datasets with two popular pre-trained language models in the in-domain and out-of-domain settings. The results show that CME improves calibration performance in all settings. The expected calibration errors are further reduced when combined with temperature scaling. Our findings highlight that model explanations can help calibrate posterior estimates.Comment: EMNLP 202

arXiv.org e-Print Archive

Prompt-based Text Entailment for Low-Resource Named Entity Recognition

Author: Chen Qingcai
Hu Baotian
Li Dongfang
Publication venue
Publication date: 06/11/2022
Field of study

Pre-trained Language Models (PLMs) have been applied in NLP tasks and achieve promising results. Nevertheless, the fine-tuning procedure needs labeled data of the target domain, making it difficult to learn in low-resource and non-trivial labeled scenarios. To address these challenges, we propose Prompt-based Text Entailment (PTE) for low-resource named entity recognition, which better leverages knowledge in the PLMs. We first reformulate named entity recognition as the text entailment task. The original sentence with entity type-specific prompts is fed into PLMs to get entailment scores for each candidate. The entity type with the top score is then selected as final label. Then, we inject tagging labels into prompts and treat words as basic units instead of n-gram spans to reduce time complexity in generating candidates by n-grams enumeration. Experimental results demonstrate that the proposed method PTE achieves competitive performance on the CoNLL03 dataset, and better than fine-tuned counterparts on the MIT Movie and Few-NERD dataset in low-resource settings.Comment: COLING 202

arXiv.org e-Print Archive

SeDR: Segment Representation Learning for Long Documents Dense Retrieval

Author: Chen Junying
Chen Qingcai
Huang Yutao
Li Dongfang
Publication venue
Publication date: 19/11/2022
Field of study

Recently, Dense Retrieval (DR) has become a promising solution to document retrieval, where document representations are used to perform effective and efficient semantic search. However, DR remains challenging on long documents, due to the quadratic complexity of its Transformer-based encoder and the finite capacity of a low-dimension embedding. Current DR models use suboptimal strategies such as truncating or splitting-and-pooling to long documents leading to poor utilization of whole document information. In this work, to tackle this problem, we propose Segment representation learning for long documents Dense Retrieval (SeDR). In SeDR, Segment-Interaction Transformer is proposed to encode long documents into document-aware and segment-sensitive representations, while it holds the complexity of splitting-and-pooling and outperforms other segment-interaction patterns on DR. Since GPU memory requirements for long document encoding causes insufficient negatives for DR training, Late-Cache Negative is further proposed to provide additional cache negatives for optimizing representation learning. Experiments on MS MARCO and TREC-DL datasets show that SeDR achieves superior performance among DR models, and confirm the effectiveness of SeDR on long document retrieval

arXiv.org e-Print Archive

A Survey on Table Question Answering: Recent Advances

Author: Chen Qingcai
Jin Nengzheng
Li Dongfang
Siebert Joanna
Publication venue
Publication date: 11/07/2022
Field of study

Table Question Answering (Table QA) refers to providing precise answers from tables to answer a user's question. In recent years, there have been a lot of works on table QA, but there is a lack of comprehensive surveys on this research topic. Hence, we aim to provide an overview of available datasets and representative methods in table QA. We classify existing methods for table QA into five categories according to their techniques, which include semantic-parsing-based, generative, extractive, matching-based, and retriever-reader-based methods. Moreover, as table QA is still a challenging task for existing methods, we also identify and outline several key challenges and discuss the potential future directions of table QA.Comment: 13 page

arXiv.org e-Print Archive