246 research outputs found

    Evaluating Human-Language Model Interaction

    Full text link
    Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.Comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI

    The design and study of pedagogical paper recommendation

    Get PDF
    For learners engaging in senior-level courses, tutors in many cases would like to pick some articles as supplementary reading materials for them each week. Unlike researchers โ€˜Googlingโ€™ papers from the Internet, tutors, when making recommendations, should consider course syllabus and their assessment of learners along many dimensions. As such, simply โ€˜Googlingโ€™ articles from the Internet is far from enough. That is, learner models of each individual, including their learning interest, knowledge, goals, etc. should be considered when making paper recommendations, since the recommendation should be carried out so as to ensure that the suitability of a paper for a learner is calculated as the summation of the fitness of the appropriateness of it to help the learner in general. This type of the recommendation is called a Pedagogical Paper Recommender.In this thesis, we propose a set of recommendation methods for a Pedagogical Paper Recommender and study the various important issues surrounding it. Experimental studies confirm that making recommendations to learners in social learning environments is not the same as making recommendation to users in commercial environments such as Amazon.com. In such learning environments, learners are willing to accept items that are not interesting, yet meet their learning goals in some way or another; learnersโ€™ overall impression towards each paper is not solely dependent on the interestingness of the paper, but also other factors, such as the degree to which the paper can help to meet their โ€˜cognitiveโ€™ goals.It is also observed that most of the recommendation methods are scalable. Although the degree of this scalability is still unclear, we conjecture that those methods are consistent to up to 50 papers in terms of recommendation accuracy. The experiments conducted so far and suggestions made on the adoption of recommendation methods are based on the data we have collected during one semester of a course. Therefore, the generality of results needs to undergo further validation before more certain conclusion can be drawn. These follow up studies should be performed (ideally) in more semesters on the same course or related courses with more newly added papers. Then, some open issues can be further investigated. Despite these weaknesses, this study has been able to reach the research goals set out in the proposed pedagogical paper recommender which, although sounding intuitive, unfortunately has been largely ignored in the research community. Finding a โ€˜goodโ€™ paper is not trivial: it is not about the simple fact that the user will either accept the recommended items, or not; rather, it is a multiple step process that typically entails the users navigating the paper collections, understanding the recommended items, seeing what others like/dislike, and making decisions. Therefore, a future research goal to proceed from the study here is to design for different kinds of social navigation in order to study their respective impacts on user behavior, and how over time, user behavior feeds back to influence the system performance

    NormDial: A Comparable Bilingual Synthetic Dialog Dataset for Modeling Social Norm Adherence and Violation

    Full text link
    Social norms fundamentally shape interpersonal communication. We present NormDial, a high-quality dyadic dialogue dataset with turn-by-turn annotations of social norm adherences and violations for Chinese and American cultures. Introducing the task of social norm observance detection, our dataset is synthetically generated in both Chinese and English using a human-in-the-loop pipeline by prompting large language models with a small collection of expert-annotated social norms. We show that our generated dialogues are of high quality through human evaluation and further evaluate the performance of existing large language models on this task. Our findings point towards new directions for understanding the nuances of social norms as they manifest in conversational contexts that span across languages and cultures.Comment: EMNLP 2023 Main Conference, Short Paper; Data at https://github.com/Aochong-Li/NormDia

    Should We Collaborate with AI to Conduct Literature Reviews? Changing Epistemic Values in a Flattening World

    Get PDF
    In this paper, we revisit the issue of collaboration with artificial intelligence (AI) to conduct literature reviews and discuss if this should be done and how it could be done. We also call for further reflection on the epistemic values at risk when using certain types of AI tools based on machine learning or generative AI at different stages of the review process, which often require the scope to be redefined and fundamentally follow an iterative process. Although AI tools accelerate search and screening tasks, particularly when there are vast amounts of literature involved, they may compromise quality, especially when it comes to transparency and explainability. Expert systems are less likely to have a negative impact on these tasks. In a broader context, any AI method should preserve researchersโ€™ ability to critically select, analyze, and interpret the literature

    ์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™”์—์„œ์˜ ๋Œ€ํ™” ํŠน์„ฑ์„ ํ™œ์šฉํ•œ ์ง€์‹ ์„ ํƒ ๋ฐ ๋žญํ‚น ๋ฐฉ๋ฒ•

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2022. 8. ์ด์ƒ๊ตฌ.Knowledge grounded conversation (KGC) model aims to generate informative responses relevant to both conversation history and external knowledge. One of the most important parts of KGC models is to find the knowledge which provides the basis on which the responses are grounded. If the model selects inappropriate knowledge, it may produce responses that are irrelevant or lack knowledge. In this dissertation, we study the methods of leveraging conversational characteristics to select or rank the knowledge for knowledge grounded conversation. In particular, this dissertation provides novel two methods, where one of which focuses on the sequential structure of multi-turn conversation, and the other focuses on utilizing local context and topic of a long conversation. We first propose two knowledge selection strategies of which one preserves the sequential matching features and the other encodes the sequential nature of the conversation. Second, we propose a novel knowledge ranking model that composes an appropriate range of relevant documents by exploiting both the topic keywords and local context of a conversation. In addition, we apply the knowledge ranking model in quote recommendation with our new quote recommendation framework that provides hard negative samples to the model. Our experimental results show that the KGC models based on our proposed knowledge selection and ranking methods outperform the competitive models in terms of groundness and relevance.์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ๋ชจ๋ธ์€ ๋Œ€ํ™” ๊ธฐ๋ก๊ณผ ์™ธ๋ถ€ ์ง€์‹ ์ด ๋‘ ๊ฐ€์ง€ ๋ชจ๋‘์— ๊ด€๋ จ๋œ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ๋ชจ๋ธ์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„ ์ค‘ ํ•˜๋‚˜๋Š” ์‘๋‹ต์˜ ๊ธฐ๋ฐ˜์„ ์ œ๊ณตํ•˜๋Š” ์ง€์‹์„ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค. ์ง€์‹ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด ์ฃผ์–ด์ง„ ๋ฌธ๋งฅ์— ๋ถ€์ ํ•ฉํ•œ ์ง€์‹์„ ์ฐพ๋Š” ๊ฒฝ์šฐ ๊ด€๋ จ์„ฑ์ด ๋–จ์–ด์ง€๊ฑฐ๋‚˜ ์ง€์‹์ด ๋ถ€์กฑํ•œ ์‘๋‹ต์ด ์ƒ์„ฑ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™”๋ฅผ ์œ„ํ•ด ๋Œ€ํ™” ์—ฌ๋Ÿฌ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ์ง€์‹์„ ์„ ์ •ํ•˜๋Š” ์ง€์‹ ์„ ํƒ ๋ชจ๋ธ๊ณผ ์ง€์‹ ์ˆœ์œ„ ๋ชจ๋ธ์„ ์ œ์‹œํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์ค‘ ํ„ด ๋Œ€ํ™”์—์„œ์˜ ์ˆœ์ฐจ์  ๊ตฌ์กฐ ๋˜๋Š” ์‘๋‹ต ์ด์ „ ๋ฌธ๋งฅ๊ณผ ๋Œ€ํ™”์˜ ์ฃผ์ œ๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋ฐฉ๋ฒ•์œผ๋กœ์จ ๋ณธ ๋…ผ๋ฌธ์€ ๋‘ ๊ฐ€์ง€ ์ง€์‹ ์„ ํƒ ์ „๋žต์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์ „๋žต ์ค‘ ํ•˜๋‚˜๋Š” ์ง€์‹๊ณผ ๋Œ€ํ™” ํ„ด ๊ฐ„์˜ ์ˆœ์ฐจ์  ๋งค์นญ ํŠน์ง•์„ ๋ณด์กดํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๊ณ  ๋‹ค๋ฅธ ์ „๋žต์€ ๋Œ€ํ™”์˜ ์ˆœ์ฐจ์  ํŠน์„ฑ์„ ์ธ์ฝ”๋”ฉํ•˜์—ฌ ์ง€์‹์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ ๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€ํ™”์˜ ์ฃผ์ œ ํ‚ค์›Œ๋“œ์™€ ์‘๋‹ต ๋ฐ”๋กœ ์ด์ „์˜ ๋ฌธ๋งฅ์„ ๋ชจ๋‘ ํ™œ์šฉํ•˜์—ฌ ์ ์ ˆํ•œ ๋ฒ”์œ„์˜ ๊ด€๋ จ ๋ฌธ์„œ๋“ค๋กœ ๊ฒ€์ƒ‰ ๊ฒฐ๊ณผ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ์ƒˆ๋กœ์šด ์ง€์‹ ์ˆœ์œ„ ๋ชจ๋ธ์„ ์ œ์•ˆํ•œ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ง€์‹ ์ˆœ์œ„ ๋ชจ๋ธ์˜ ์ ์‘์„ฑ ๊ฒ€์ฆ์„ ์œ„ํ•ด ์ •๋‹ต ์ธ์šฉ๊ตฌ์™€ ์˜๋ฏธ์ ์œผ๋กœ ์œ ์‚ฌํ•˜์ง€๋งŒ ์ •๋‹ต์€ ์•„๋‹Œ ์ธ์šฉ๊ตฌ์˜ ์ง‘ํ•ฉ์„ ์ธ์šฉ๊ตฌ ์ˆœ์œ„ ๋ชจ๋ธ์— ์ œ๊ณตํ•˜๋Š” ์ธ์šฉ๊ตฌ ์ถ”์ฒœ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ์ง€์‹ ์„ ํƒ ๋ฐ ์ˆœ์œ„ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” ์ง€์‹ ๊ธฐ๋ฐ˜ ๋Œ€ํ™” ๋ชจ๋ธ์ด ๊ฒฝ์Ÿ ๋ชจ๋ธ๋ณด๋‹ค ์™ธ๋ถ€ ์ง€์‹ ๋ฐ ๋Œ€ํ™” ๋ฌธ๋งฅ๊ณผ์˜ ๊ด€๋ จ์„ฑ ์ธก๋ฉด์—์„œ ์šฐ์ˆ˜ํ•˜๋‹ค๋Š” ๊ฒƒ์„ ์‚ฌ๋žŒ ๊ฐ„์˜ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•œ ๋‹ค์ˆ˜์˜ ์‹คํ—˜์„ ํ†ตํ•ด ๊ฒ€์ฆํ•˜์˜€๋‹ค.Abstract 1 1. Introduction 17 2. Background and Related Works 25 2.1 Terminology 25 2.2 Overview of Technologies for Conversational Systems 27 2.2.1 Open-domain Dialogue System 27 2.2.2 Task-oriented Dialogue System 29 2.2.3 Question Answering System 29 2.3 Components of Knowledge Grounded Conversation Model 31 2.4 Related Works 36 2.4.1 KGC datasets 36 2.4.2 Soft Selection-based KGC Model 36 2.4.3 Hard Selection-based KGC Model 37 2.4.4 Retrieval-based KGC Models 39 2.4.5 Response Generation with Knowledge Integration 39 2.4.6 Quote Recommendation 42 2.5 Evaluation Methods 44 2.6 Problem Statements 47 3. Knowledge Selection with Sequential Structure of Conversation 48 3.1 Motivation 48 3.2 Reduce-Match Strategy & Match-Reduce Strategy 49 3.2.1 Backbone architecture 51 3.2.2 Reduce-Match Strategy-based Models 52 3.2.3 Match-Reduce Strategy-based Models 56 3.3 Experiments 62 3.3.1 Experimental Setup 62 3.3.2 Experimental Results 70 3.4 Analysis 72 3.4.1 Case Study 72 3.4.2 Impact of Matching Difficulty 75 3.4.3 Impact of Length of Context 77 3.4.4 Impact of Dialogue Act of Message 78 4. Knowledge Ranking with Local Context and Topic Keywords 81 4.1 Motivation 81 4.2 Retrieval-Augmented Knowledge Grounded Conversation Model 85 4.2.1 Base Model 86 4.2.2 Topic-aware Dual Matching for Knowledge Re-ranking 86 4.2.3 Data Weighting Scheme for Retrieval Augmented Generation Models 89 4.3 Experiments 90 4.3.1 Experimental Setup 90 4.3.2 Experimental Results 94 4.4 Analysis 98 4.4.1 Case Study 98 4.4.2 Ablation Study 99 4.4.3 Model Variations 104 4.4.4 Error Analysis 105 5. Application: Quote Recommendation with Knowledge Ranking 110 5.1 Motivation 110 5.2 CAGAR: A Framework for Quote Recommendation 112 5.2.1 Conversation Encoder 114 5.2.2 Quote Encoder 114 5.2.3 Candidate Generator 115 5.2.4 Re-ranker 116 5.2.5 Training and Inference 116 5.3 Experiments 117 5.3.1 Experimental Setup 117 5.3.2 Experimental Results 119 5.4 Analysis 120 5.4.1 Ablation Study 120 5.4.2 Case Study 121 5.4.3 Impact of Length of Context 121 5.4.4 Impact of Training Set Size per Quote 123 6. Conclusion 125 6.1 Contributions and Limitations 126 6.2 Future Works 128 Appendix A. Preliminary Experiments for Quote Recommendations 131 A.1 Methods 131 A.1.1 Matching Granularity Adjustment 131 A.1.2 Random Forest 133 A.1.3 Convolutional Neural Network 133 A.1.4 Recurrent Neural Network 134 A.2 Experiments 135 A.2.1 Baselines and Implementation Details 135 A.2.2 Datasets 136 A.2.3 Results and Discussions 137 ์ดˆ๋ก 162๋ฐ•
    • โ€ฆ
    corecore