35 research outputs found
Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective
We address the fundamental challenge in Natural Language Generation (NLG)
model evaluation, the design and validation of evaluation metrics. Recognizing
the limitations of existing metrics and issues with human judgment, we propose
using measurement theory, the foundation of test design, as a framework for
conceptualizing and evaluating the validity and reliability of NLG evaluation
metrics. This approach offers a systematic method for defining "good" metrics,
developing robust metrics, and assessing metric performance. In this paper, we
introduce core concepts in measurement theory in the context of NLG evaluation
and key methods to evaluate the performance of NLG metrics. Through this
framework, we aim to promote the design, evaluation, and interpretation of
valid and reliable metrics, ultimately contributing to the advancement of
robust and effective NLG models in real-world settings
Supporting Qualitative Analysis with Large Language Models: Combining Codebook with GPT-3 for Deductive Coding
Qualitative analysis of textual contents unpacks rich and valuable
information by assigning labels to the data. However, this process is often
labor-intensive, particularly when working with large datasets. While recent
AI-based tools demonstrate utility, researchers may not have readily available
AI resources and expertise, let alone be challenged by the limited
generalizability of those task-specific models. In this study, we explored the
use of large language models (LLMs) in supporting deductive coding, a major
category of qualitative analysis where researchers use pre-determined codebooks
to label the data into a fixed set of codes. Instead of training task-specific
models, a pre-trained LLM could be used directly for various tasks without
fine-tuning through prompt learning. Using a curiosity-driven questions coding
task as a case study, we found, by combining GPT-3 with expert-drafted
codebooks, our proposed approach achieved fair to substantial agreements with
expert-coded results. We lay out challenges and opportunities in using LLMs to
support qualitative coding and beyond.Comment: 28th International Conference on Intelligent User Interfaces (IUI '23
Companion), March 27--31, 2023, Sydney, NSW, Australi
ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games
In this work we examine the ability of language models to generate explicit
world models of scientific and common-sense reasoning tasks by framing this as
a problem of generating text-based games. To support this, we introduce
ByteSized32, a corpus of 32 highly-templated text games written in Python
totaling 24k lines of code, each centered around a particular task, and paired
with a set of 16 unseen text game specifications for evaluation. We propose a
suite of automatic and manual metrics for assessing simulation validity,
compliance with task specifications, playability, winnability, and alignment
with the physical world. In a single-shot evaluation of GPT-4 on this
simulation-as-code-generation task, we find it capable of producing runnable
games in 27% of cases, highlighting the difficulty of this challenge task. We
discuss areas of future improvement, including GPT-4's apparent capacity to
perform well at simulating near canonical task solutions, with performance
dropping off as simulations include distractors or deviate from canonical
solutions in the action space.Comment: 10 page
Prevalence of insomnia symptoms and their associated factors in patients treated in outpatient clinics of four general hospitals in Guangzhou, China
Background: Data on the prevalence of insomnia symptoms in medical outpatient clinics in China are lacking. This study examined the prevalence of insomnia symptoms and their socio-demographic correlates in patients treated at medical outpatient clinics affiliated with four general hospitals in Guangzhou, a large metropolis in southern China.
Method: A total of 4399 patients were consecutively invited to participate in the study. Data on insomnia and its socio-demographic correlates were collected with standardized questionnaires.
Results: The prevalence of any type of insomnia symptoms was 22.1% (95% confidence interval (CI): 20.9–23.3%); the prevalence of difficulty initiating sleep was 14.3%, difficulty maintaining sleep was 16.2%, and early morning awakening was 12.4%. Only 17.5% of the patients suffering from insomnia received sleeping pills. Multiple logistic regression analysis revealed that male gender, education level, rural residence, and being unemployed or retired were negatively associated with insomnia symptoms, while lacking health insurance, older age and more severe depressive symptoms were positively associated with insomnia symptoms.
Conclusions: Insomnia symptoms are common in patients attending medical outpatient clinics in Guangzhou. Increasing awareness of sleep hygiene measures, regular screening and psychosocial and pharmacological interventions for insomnia are needed in China.
Trial registration: ChiCTR-INR-16008066. Registered 8 March 2016
If I Hear You Correctly: Building and Evaluating Interview Chatbots with Active Listening Skills
Interview chatbots engage users in a text-based conversation to draw out
their views and opinions. It is, however, challenging to build effective
interview chatbots that can handle user free-text responses to open-ended
questions and deliver engaging user experience. As the first step, we are
investigating the feasibility and effectiveness of using publicly available,
practical AI technologies to build effective interview chatbots. To demonstrate
feasibility, we built a prototype scoped to enable interview chatbots with a
subset of active listening skills - the abilities to comprehend a user's input
and respond properly. To evaluate the effectiveness of our prototype, we
compared the performance of interview chatbots with or without active listening
skills on four common interview topics in a live evaluation with 206 users. Our
work presents practical design implications for building effective interview
chatbots, hybrid chatbot platforms, and empathetic chatbots beyond interview
tasks.Comment: Working draft. To appear in the ACM CHI Conference on Human Factors
in Computing Systems (CHI 2020
The Ninth Visual Object Tracking VOT2021 Challenge Results
acceptedVersionPeer reviewe
What should I Ask: A Knowledge-driven Approach for Follow-up Questions Generation in Conversational Surveys
Conversational surveys, where an agent asks open-ended questions through
natural language interfaces, offer a new way to collect information from
people. A good follow-up question in a conversational survey prompts
high-quality information and delivers engaging experiences. However, generating
high-quality follow-up questions on the fly is a non-trivial task. The agent
needs to understand the diverse and complex participant responses, adhere to
the survey goal, and generate clear and coherent questions. In this study, we
propose a knowledge-driven follow-up question generation framework. The
framework combines a knowledge selection module to identify salient topics in
participants' responses and a generative model guided by selected knowledge
entity-relation pairs. To investigate the effectiveness of the proposed
framework, we build a new dataset for open-domain follow-up question generation
and present a new set of reference-free evaluation metrics based on Gricean
Maxim. Our experiments demonstrate that our framework outperforms a GPT-based
baseline in both objective evaluation and human-expert evaluation
An HBase-Based Optimization Model for Distributed Medical Data Storage and Retrieval
In medical services, the amount of data generated by medical devices is increasing explosively, and access to medical data is also put forward with higher requirements. Although HBase-based medical data storage solutions exist, they cannot meet the needs of fast locating and diversified access to medical data. In order to improve the retrieval speed, the recognition model S-TCR and the dynamic management algorithm SL-TCR, based on the behavior characteristics of access, were proposed to identify the frequently accessed hot data and dynamically manage the data storage medium as to maximize the system access performance. In order to improve the search performance of keys, an optimized secondary index strategy was proposed to reduce I/O overhead and optimize the search performance of non-primary key indexes. Comparative experiments were conducted on real medical data sets. The experimental results show that the optimized retrieval model can meet the needs of hot data access and diversified medical data retrieval
Insufficient Fruit and Vegetable Intake and Low Potassium Intake Aggravate Early Renal Damage in Children: A Longitudinal Study
Insufficient fruit and vegetable intake (FVI) and low potassium intake are associated with many non-communicable diseases, but the association with early renal damage in children is uncertain. We aimed to identify the associations of early renal damage with insufficient FVI and daily potassium intake in a general pediatric population. We conducted four waves of urine assays based on our child cohort (PROC) study from October 2018 to November 2019 in Beijing, China. We investigated FVI and other lifestyle status via questionnaire surveys and measured urinary potassium, β2-microglobulin (β2-MG), and microalbumin (MA) excretion to assess daily potassium intake and renal damage among 1914 primary school children. The prevalence of insufficient FVI (<4/d) was 48.6% (95% CI: 46.4%, 50.9%) and the estimated potassium intake at baseline was 1.63 ± 0.48 g/d. Short sleep duration, long screen time, lower estimated potassium intake, higher β2-MG and MA excretion were significantly more frequent in the insufficient FVI group. We generated linear mixed effects models and observed the bivariate associations of urinary β2-MG and MA excretion with insufficient FVI (β = 0.012, 95% CI: 0.005, 0.020; β = 0.717, 95% CI: 0.075, 1.359), and estimated potassium intake (β = −0.042, 95% CI: −0.052, −0.033; β = −1.778, 95% CI: −2.600, −0.956), respectively; after adjusting for age, sex, BMI, SBP, sleep duration, screen time and physical activity. In multivariate models, we observed that urinary β2-MG excretion increased with insufficient FVI (β = 0.011, 95% CI: 0.004, 0.018) and insufficient potassium intake (<1.5 g/d) (β = 0.031, 95% CI: 0.023, 0.038); and urinary MA excretion increased with insufficient FVI (β = 0.658, 95% CI: 0.017, 1.299) and insufficient potassium intake (β = 1.185, 95% CI: 0.492, 1.878). We visualized different quartiles of potassium intake showing different renal damage with insufficient FVI for interpretation and validation of the findings. Insufficient FVI and low potassium intake aggravate early renal damage in children and underscores that healthy lifestyles, especially adequate FVI, should be advocated