3 research outputs found
GLOBEM Dataset: Multi-Year Datasets for Longitudinal Human Behavior Modeling Generalization
Recent research has demonstrated the capability of behavior signals captured
by smartphones and wearables for longitudinal behavior modeling. However, there
is a lack of a comprehensive public dataset that serves as an open testbed for
fair comparison among algorithms. Moreover, prior studies mainly evaluate
algorithms using data from a single population within a short period, without
measuring the cross-dataset generalizability of these algorithms. We present
the first multi-year passive sensing datasets, containing over 700 user-years
and 497 unique users' data collected from mobile and wearable sensors, together
with a wide range of well-being metrics. Our datasets can support multiple
cross-dataset evaluations of behavior modeling algorithms' generalizability
across different users and years. As a starting point, we provide the benchmark
results of 18 algorithms on the task of depression detection. Our results
indicate that both prior depression detection algorithms and domain
generalization techniques show potential but need further research to achieve
adequate cross-dataset generalizability. We envision our multi-year datasets
can support the ML community in developing generalizable longitudinal behavior
modeling algorithms.Comment: Thirty-sixth Conference on Neural Information Processing Systems
Datasets and Benchmarks Trac
From Classification to Clinical Insights: Towards Analyzing and Reasoning About Mobile and Behavioral Health Data With Large Language Models
Passively collected behavioral health data from ubiquitous sensors holds
significant promise to provide mental health professionals insights from
patient's daily lives; however, developing analysis tools to use this data in
clinical practice requires addressing challenges of generalization across
devices and weak or ambiguous correlations between the measured signals and an
individual's mental health. To address these challenges, we take a novel
approach that leverages large language models (LLMs) to synthesize clinically
useful insights from multi-sensor data. We develop chain of thought prompting
methods that use LLMs to generate reasoning about how trends in data such as
step count and sleep relate to conditions like depression and anxiety. We first
demonstrate binary depression classification with LLMs achieving accuracies of
61.1% which exceed the state of the art. While it is not robust for clinical
use, this leads us to our key finding: even more impactful and valued than
classification is a new human-AI collaboration approach in which clinician
experts interactively query these tools and combine their domain expertise and
context about the patient with AI generated reasoning to support clinical
decision-making. We find models like GPT-4 correctly reference numerical data
75% of the time, and clinician participants express strong interest in using
this approach to interpret self-tracking data