230 research outputs found

    GeneGPT: Teaching Large Language Models to Use NCBI Web APIs

    Full text link
    In this paper, we present GeneGPT, a novel method for teaching large language models (LLMs) to use the Web Application Programming Interfaces (APIs) of the National Center for Biotechnology Information (NCBI) and answer genomics questions. Specifically, we prompt Codex (code-davinci-002) to solve the GeneTuring tests with few-shot URL requests of NCBI API calls as demonstrations for in-context learning. During inference, we stop the decoding once a call request is detected and make the API call with the generated URL. We then append the raw execution results returned by NCBI APIs to the generated texts and continue the generation until the answer is found or another API call is detected. Our preliminary results show that GeneGPT achieves state-of-the-art results on three out of four one-shot tasks and four out of five zero-shot tasks in the GeneTuring dataset. Overall, GeneGPT achieves a macro-average score of 0.76, which is much higher than retrieval-augmented LLMs such as the New Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as other LLMs such as GPT-3 (0.16) and ChatGPT (0.12).Comment: Work in progres

    Engage Wider Audience or Facilitate Quality Answers? a Mixed-methods Analysis of Questioning Strategies for Research Sensemaking on a Community Q&A Site

    Full text link
    Discussing research-sensemaking questions on Community Question and Answering (CQA) platforms has been an increasingly common practice for the public to participate in science communication. Nonetheless, how users strategically craft research-sensemaking questions to engage public participation and facilitate knowledge construction is a significant yet less understood problem. To fill this gap, we collected 837 science-related questions and 157,684 answers from Zhihu, and conducted a mixed-methods study to explore user-developed strategies in proposing research-sensemaking questions, and their potential effects on public engagement and knowledge construction. Through open coding, we captured a comprehensive taxonomy of question-crafting strategies, such as eyecatching narratives with counter-intuitive claims and rigorous descriptions with data use. Regression analysis indicated that these strategies correlated with user engagement and answer construction in different ways (e.g., emotional questions attracted more views and answers), yet there existed a general divergence between wide participation and quality knowledge establishment, when most questioning strategies could not ensure both. Based on log analysis, we further found that collaborative editing afforded unique values in refining research-sensemaking questions regarding accuracy, rigor, comprehensiveness and attractiveness. We propose design implications to facilitate accessible, accurate and engaging science communication on CQA platforms.Comment: 31 pages, 5 figures. Accepted for publication in Proceedings of the ACM on Human-Computer Interaction (CSCW 2024

    Error Analysis Prompting Enables Human-Like Translation Evaluation in Large Language Models: A Case Study on ChatGPT

    Full text link
    Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks, such as machine translation, text summarization. Recent research (Kocmi and Federmann, 2023) has shown that utilizing ChatGPT for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conduct an investigation into several prompting methods, and propose a new prompting method called Error Analysis Prompting (EAPrompt) by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al., 2022). Our results on WMT22 indicate that prompting LLMs like ChatGPT with error analysis can generate human-like MT evaluations at both the system and segment level. Additionally, we first discover some limitations of ChatGPT as an MT evaluator, such as changing the order of input may significantly influence the judgment when providing multiple translations in a single query. This work provides a preliminary experience of prompting LLMs as an evaluator to improve the reliability of translation evaluation metrics under the error analysis paradigm
    • …
    corecore