5 research outputs found

    Latent Jailbreak: A Test Suite for Evaluating Both Text Safety and Output Robustness of Large Language Models

    Full text link
    Considerable research efforts have been devoted to ensuring that large language models (LLMs) align with human values and generate safe text. However, an excessive focus on sensitivity to certain topics can compromise the model's robustness in following instructions, thereby impacting its overall performance in completing tasks. Previous benchmarks for jailbreaking LLMs have primarily focused on evaluating the safety of the models without considering their robustness. In this paper, we propose a benchmark that assesses both the safety and robustness of LLMs, emphasizing the need for a balanced approach. To comprehensively study text safety and output robustness, we introduce a latent jailbreak prompt dataset, each involving malicious instruction embedding. Specifically, we instruct the model to complete a regular task, such as translation, with the text to be translated containing malicious instructions. To further analyze safety and robustness, we design a hierarchical annotation framework. We present a systematic analysis of the safety and robustness of LLMs regarding the position of explicit normal instructions, word replacements (verbs in explicit normal instructions, target groups in malicious instructions, cue words for explicit normal instructions), and instruction replacements (different explicit normal instructions). Our results demonstrate that current LLMs not only prioritize certain instruction verbs but also exhibit varying jailbreak rates for different instruction verbs in explicit normal instructions. Code and data are available at https://github.com/qiuhuachuan/latent-jailbreak.Comment: Code and data are available at https://github.com/qiuhuachuan/latent-jailbrea

    Selecting the optimal dialogue response once for all from a panoramic view

    Full text link
    As an essential component of dialogue systems, response selection aims to pick out the optimal response among candidates to continue the dialogue. In existing studies, this task is usually regarded as a binary classification problem, where every candidate is ranked respectively for appropriateness. To improve its performance, we reformulate this task as a multiple-choice problem that allows the best selection to be made in one-shot inference. This new view inspires us to propose an architecture called Panoramic-encoder (Our work will be open-source for reproducibility and future research.) with a novel Candidates Attention Mechanism (CAM), which allows context-wise attention between responses and leads to fine-grained comparisons. Furthermore, we investigate and incorporate several techniques that have been proven effective for improving response selection. Experiments on three benchmarks show that our method pushes the state-of-the-art while achieving approximately 3X faster inference speed

    Understanding Client Reactions in Online Mental Health Counseling

    Full text link
    Communication success relies heavily on reading participants' reactions. Such feedback is especially important for mental health counselors, who must carefully consider the client's progress and adjust their approach accordingly. However, previous NLP research on counseling has mainly focused on studying counselors' intervention strategies rather than their clients' reactions to the intervention. This work aims to fill this gap by developing a theoretically grounded annotation framework that encompasses counselors' strategies and client reaction behaviors. The framework has been tested against a large-scale, high-quality text-based counseling dataset we collected over the past two years from an online welfare counseling platform. Our study shows how clients react to counselors' strategies, how such reactions affect the final counseling outcomes, and how counselors can adjust their strategies in response to these reactions. We also demonstrate that this study can help counselors automatically predict their clients' states.Comment: Accept to ACL 2023, oral. For code and data, see https://github.com/dll-wu/Client-Reac

    PsyBench: a balanced and in-depth Psychological Chinese Evaluation Benchmark for Foundation Models

    Full text link
    As Large Language Models (LLMs) are becoming prevalent in various fields, there is an urgent need for improved NLP benchmarks that encompass all the necessary knowledge of individual discipline. Many contemporary benchmarks for foundational models emphasize a broad range of subjects but often fall short in presenting all the critical subjects and encompassing necessary professional knowledge of them. This shortfall has led to skewed results, given that LLMs exhibit varying performance across different subjects and knowledge areas. To address this issue, we present psybench, the first comprehensive Chinese evaluation suite that covers all the necessary knowledge required for graduate entrance exams. psybench offers a deep evaluation of a model's strengths and weaknesses in psychology through multiple-choice questions. Our findings show significant differences in performance across different sections of a subject, highlighting the risk of skewed results when the knowledge in test sets is not balanced. Notably, only the ChatGPT model reaches an average accuracy above 70%70\%, indicating that there is still plenty of room for improvement. We expect that psybench will help to conduct thorough evaluations of base models' strengths and weaknesses and assist in practical application in the field of psychology
    corecore