35 research outputs found

    Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective

    Full text link
    We address the fundamental challenge in Natural Language Generation (NLG) model evaluation, the design and validation of evaluation metrics. Recognizing the limitations of existing metrics and issues with human judgment, we propose using measurement theory, the foundation of test design, as a framework for conceptualizing and evaluating the validity and reliability of NLG evaluation metrics. This approach offers a systematic method for defining "good" metrics, developing robust metrics, and assessing metric performance. In this paper, we introduce core concepts in measurement theory in the context of NLG evaluation and key methods to evaluate the performance of NLG metrics. Through this framework, we aim to promote the design, evaluation, and interpretation of valid and reliable metrics, ultimately contributing to the advancement of robust and effective NLG models in real-world settings

    Supporting Qualitative Analysis with Large Language Models: Combining Codebook with GPT-3 for Deductive Coding

    Full text link
    Qualitative analysis of textual contents unpacks rich and valuable information by assigning labels to the data. However, this process is often labor-intensive, particularly when working with large datasets. While recent AI-based tools demonstrate utility, researchers may not have readily available AI resources and expertise, let alone be challenged by the limited generalizability of those task-specific models. In this study, we explored the use of large language models (LLMs) in supporting deductive coding, a major category of qualitative analysis where researchers use pre-determined codebooks to label the data into a fixed set of codes. Instead of training task-specific models, a pre-trained LLM could be used directly for various tasks without fine-tuning through prompt learning. Using a curiosity-driven questions coding task as a case study, we found, by combining GPT-3 with expert-drafted codebooks, our proposed approach achieved fair to substantial agreements with expert-coded results. We lay out challenges and opportunities in using LLMs to support qualitative coding and beyond.Comment: 28th International Conference on Intelligent User Interfaces (IUI '23 Companion), March 27--31, 2023, Sydney, NSW, Australi

    ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games

    Full text link
    In this work we examine the ability of language models to generate explicit world models of scientific and common-sense reasoning tasks by framing this as a problem of generating text-based games. To support this, we introduce ByteSized32, a corpus of 32 highly-templated text games written in Python totaling 24k lines of code, each centered around a particular task, and paired with a set of 16 unseen text game specifications for evaluation. We propose a suite of automatic and manual metrics for assessing simulation validity, compliance with task specifications, playability, winnability, and alignment with the physical world. In a single-shot evaluation of GPT-4 on this simulation-as-code-generation task, we find it capable of producing runnable games in 27% of cases, highlighting the difficulty of this challenge task. We discuss areas of future improvement, including GPT-4's apparent capacity to perform well at simulating near canonical task solutions, with performance dropping off as simulations include distractors or deviate from canonical solutions in the action space.Comment: 10 page

    Prevalence of insomnia symptoms and their associated factors in patients treated in outpatient clinics of four general hospitals in Guangzhou, China

    Get PDF
    Background: Data on the prevalence of insomnia symptoms in medical outpatient clinics in China are lacking. This study examined the prevalence of insomnia symptoms and their socio-demographic correlates in patients treated at medical outpatient clinics affiliated with four general hospitals in Guangzhou, a large metropolis in southern China. Method: A total of 4399 patients were consecutively invited to participate in the study. Data on insomnia and its socio-demographic correlates were collected with standardized questionnaires. Results: The prevalence of any type of insomnia symptoms was 22.1% (95% confidence interval (CI): 20.9–23.3%); the prevalence of difficulty initiating sleep was 14.3%, difficulty maintaining sleep was 16.2%, and early morning awakening was 12.4%. Only 17.5% of the patients suffering from insomnia received sleeping pills. Multiple logistic regression analysis revealed that male gender, education level, rural residence, and being unemployed or retired were negatively associated with insomnia symptoms, while lacking health insurance, older age and more severe depressive symptoms were positively associated with insomnia symptoms. Conclusions: Insomnia symptoms are common in patients attending medical outpatient clinics in Guangzhou. Increasing awareness of sleep hygiene measures, regular screening and psychosocial and pharmacological interventions for insomnia are needed in China. Trial registration: ChiCTR-INR-16008066. Registered 8 March 2016

    If I Hear You Correctly: Building and Evaluating Interview Chatbots with Active Listening Skills

    Full text link
    Interview chatbots engage users in a text-based conversation to draw out their views and opinions. It is, however, challenging to build effective interview chatbots that can handle user free-text responses to open-ended questions and deliver engaging user experience. As the first step, we are investigating the feasibility and effectiveness of using publicly available, practical AI technologies to build effective interview chatbots. To demonstrate feasibility, we built a prototype scoped to enable interview chatbots with a subset of active listening skills - the abilities to comprehend a user's input and respond properly. To evaluate the effectiveness of our prototype, we compared the performance of interview chatbots with or without active listening skills on four common interview topics in a live evaluation with 206 users. Our work presents practical design implications for building effective interview chatbots, hybrid chatbot platforms, and empathetic chatbots beyond interview tasks.Comment: Working draft. To appear in the ACM CHI Conference on Human Factors in Computing Systems (CHI 2020

    The Ninth Visual Object Tracking VOT2021 Challenge Results

    Get PDF
    acceptedVersionPeer reviewe

    What should I Ask: A Knowledge-driven Approach for Follow-up Questions Generation in Conversational Surveys

    Full text link
    Conversational surveys, where an agent asks open-ended questions through natural language interfaces, offer a new way to collect information from people. A good follow-up question in a conversational survey prompts high-quality information and delivers engaging experiences. However, generating high-quality follow-up questions on the fly is a non-trivial task. The agent needs to understand the diverse and complex participant responses, adhere to the survey goal, and generate clear and coherent questions. In this study, we propose a knowledge-driven follow-up question generation framework. The framework combines a knowledge selection module to identify salient topics in participants' responses and a generative model guided by selected knowledge entity-relation pairs. To investigate the effectiveness of the proposed framework, we build a new dataset for open-domain follow-up question generation and present a new set of reference-free evaluation metrics based on Gricean Maxim. Our experiments demonstrate that our framework outperforms a GPT-based baseline in both objective evaluation and human-expert evaluation

    An HBase-Based Optimization Model for Distributed Medical Data Storage and Retrieval

    No full text
    In medical services, the amount of data generated by medical devices is increasing explosively, and access to medical data is also put forward with higher requirements. Although HBase-based medical data storage solutions exist, they cannot meet the needs of fast locating and diversified access to medical data. In order to improve the retrieval speed, the recognition model S-TCR and the dynamic management algorithm SL-TCR, based on the behavior characteristics of access, were proposed to identify the frequently accessed hot data and dynamically manage the data storage medium as to maximize the system access performance. In order to improve the search performance of keys, an optimized secondary index strategy was proposed to reduce I/O overhead and optimize the search performance of non-primary key indexes. Comparative experiments were conducted on real medical data sets. The experimental results show that the optimized retrieval model can meet the needs of hot data access and diversified medical data retrieval

    Insufficient Fruit and Vegetable Intake and Low Potassium Intake Aggravate Early Renal Damage in Children: A Longitudinal Study

    No full text
    Insufficient fruit and vegetable intake (FVI) and low potassium intake are associated with many non-communicable diseases, but the association with early renal damage in children is uncertain. We aimed to identify the associations of early renal damage with insufficient FVI and daily potassium intake in a general pediatric population. We conducted four waves of urine assays based on our child cohort (PROC) study from October 2018 to November 2019 in Beijing, China. We investigated FVI and other lifestyle status via questionnaire surveys and measured urinary potassium, β2-microglobulin (β2-MG), and microalbumin (MA) excretion to assess daily potassium intake and renal damage among 1914 primary school children. The prevalence of insufficient FVI (<4/d) was 48.6% (95% CI: 46.4%, 50.9%) and the estimated potassium intake at baseline was 1.63 ± 0.48 g/d. Short sleep duration, long screen time, lower estimated potassium intake, higher β2-MG and MA excretion were significantly more frequent in the insufficient FVI group. We generated linear mixed effects models and observed the bivariate associations of urinary β2-MG and MA excretion with insufficient FVI (β = 0.012, 95% CI: 0.005, 0.020; β = 0.717, 95% CI: 0.075, 1.359), and estimated potassium intake (β = −0.042, 95% CI: −0.052, −0.033; β = −1.778, 95% CI: −2.600, −0.956), respectively; after adjusting for age, sex, BMI, SBP, sleep duration, screen time and physical activity. In multivariate models, we observed that urinary β2-MG excretion increased with insufficient FVI (β = 0.011, 95% CI: 0.004, 0.018) and insufficient potassium intake (<1.5 g/d) (β = 0.031, 95% CI: 0.023, 0.038); and urinary MA excretion increased with insufficient FVI (β = 0.658, 95% CI: 0.017, 1.299) and insufficient potassium intake (β = 1.185, 95% CI: 0.492, 1.878). We visualized different quartiles of potassium intake showing different renal damage with insufficient FVI for interpretation and validation of the findings. Insufficient FVI and low potassium intake aggravate early renal damage in children and underscores that healthy lifestyles, especially adequate FVI, should be advocated
    corecore