230 research outputs found
GeneGPT: Teaching Large Language Models to Use NCBI Web APIs
In this paper, we present GeneGPT, a novel method for teaching large language
models (LLMs) to use the Web Application Programming Interfaces (APIs) of the
National Center for Biotechnology Information (NCBI) and answer genomics
questions. Specifically, we prompt Codex (code-davinci-002) to solve the
GeneTuring tests with few-shot URL requests of NCBI API calls as demonstrations
for in-context learning. During inference, we stop the decoding once a call
request is detected and make the API call with the generated URL. We then
append the raw execution results returned by NCBI APIs to the generated texts
and continue the generation until the answer is found or another API call is
detected. Our preliminary results show that GeneGPT achieves state-of-the-art
results on three out of four one-shot tasks and four out of five zero-shot
tasks in the GeneTuring dataset. Overall, GeneGPT achieves a macro-average
score of 0.76, which is much higher than retrieval-augmented LLMs such as the
New Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as
well as other LLMs such as GPT-3 (0.16) and ChatGPT (0.12).Comment: Work in progres
Engage Wider Audience or Facilitate Quality Answers? a Mixed-methods Analysis of Questioning Strategies for Research Sensemaking on a Community Q&A Site
Discussing research-sensemaking questions on Community Question and Answering
(CQA) platforms has been an increasingly common practice for the public to
participate in science communication. Nonetheless, how users strategically
craft research-sensemaking questions to engage public participation and
facilitate knowledge construction is a significant yet less understood problem.
To fill this gap, we collected 837 science-related questions and 157,684
answers from Zhihu, and conducted a mixed-methods study to explore
user-developed strategies in proposing research-sensemaking questions, and
their potential effects on public engagement and knowledge construction.
Through open coding, we captured a comprehensive taxonomy of question-crafting
strategies, such as eyecatching narratives with counter-intuitive claims and
rigorous descriptions with data use. Regression analysis indicated that these
strategies correlated with user engagement and answer construction in different
ways (e.g., emotional questions attracted more views and answers), yet there
existed a general divergence between wide participation and quality knowledge
establishment, when most questioning strategies could not ensure both. Based on
log analysis, we further found that collaborative editing afforded unique
values in refining research-sensemaking questions regarding accuracy, rigor,
comprehensiveness and attractiveness. We propose design implications to
facilitate accessible, accurate and engaging science communication on CQA
platforms.Comment: 31 pages, 5 figures. Accepted for publication in Proceedings of the
ACM on Human-Computer Interaction (CSCW 2024
Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT
Generative large language models (LLMs), e.g., ChatGPT, have demonstrated
remarkable proficiency across several NLP tasks, such as machine translation,
text summarization. Recent research (Kocmi and Federmann, 2023) has shown that
utilizing ChatGPT for assessing the quality of machine translation (MT)
achieves state-of-the-art performance at the system level but performs poorly
at the segment level. To further improve the performance of LLMs on MT quality
assessment, we conduct an investigation into several prompting methods, and
propose a new prompting method called Error Analysis Prompting (EAPrompt) by
combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al.,
2022). Our results on WMT22 indicate that prompting LLMs like ChatGPT with
error analysis can generate human-like MT evaluations at both the system and
segment level. Additionally, we first discover some limitations of ChatGPT as
an MT evaluator, such as changing the order of input may significantly
influence the judgment when providing multiple translations in a single query.
This work provides a preliminary experience of prompting LLMs as an evaluator
to improve the reliability of translation evaluation metrics under the error
analysis paradigm
- …