Prior studies have demonstrated that approaches to generate an answer summary
for a given technical query in Software Question and Answer (SQA) sites are
desired. We find that existing approaches are assessed solely through user
studies. There is a need for a benchmark with ground truth summaries to
complement assessment through user studies. Unfortunately, such a benchmark is
non-existent for answer summarization for technical queries from SQA sites. To
fill the gap, we manually construct a high-quality benchmark to enable
automatic evaluation of answer summarization for technical queries for SQA
sites. Using the benchmark, we comprehensively evaluate the performance of
existing approaches and find that there is still a big room for improvement.
Motivated by the results, we propose a new approach TechSumBot with three key
modules:1) Usefulness Ranking module, 2) Centrality Estimation module, and 3)
Redundancy Removal module. We evaluate TechSumBot in both automatic (i.e.,
using our benchmark) and manual (i.e., via a user study) manners. The results
from both evaluations consistently demonstrate that TechSumBot outperforms the
best performing baseline approaches from both SE and NLP domains by a large
margin, i.e., 10.83%-14.90%, 32.75%-36.59%, and 12.61%-17.54%, in terms of
ROUGE-1, ROUGE-2, and ROUGE-L on automatic evaluation, and 5.79%-9.23% and
17.03%-17.68%, in terms of average usefulness and diversity score on human
evaluation. This highlights that the automatic evaluation of our benchmark can
uncover findings similar to the ones found through user studies. More
importantly, automatic evaluation has a much lower cost, especially when it is
used to assess a new approach. Additionally, we also conducted an ablation
study, which demonstrates that each module in TechSumBot contributes to
boosting the overall performance of TechSumBot.Comment: Accepted by ASE 202