Comparison of existing methods for algorithmic classification of dementia in the Health and Retirement Study

Abstract

Background: Dementia ascertainment is difficult and costly, hindering the use of large, representative studies such as the Health and Retirement Study (HRS) to monitor trends or disparities in dementia. To address this issue, multiple groups of researchers have developed algorithms to classify dementia status in HRS participants using data from HRS and the Aging, Demographics, and Memory Study (ADAMS), an HRS sub-study that systematically ascertained dementia status. However, the relative performance of each algorithm has not been systematically evaluated. Objective: To compare the performance of five existing algorithms, overall and by sociodemographic subgroups. Methods: We created two standardized datasets: (a) training data (N=786, i.e. ADAMS Wave A and corresponding HRS data, which was used previously to create the algorithms) and (b) validation data (N=530, i.e. ADAMS Waves B, C, and D and corresponding HRS data which was not used previously to create the algorithms). In both, we used each algorithm to classify HRS participants as demented or not demented and compared the algorithmic diagnoses to the ADAMS diagnoses. Results: In the training data, overall classification accuracies ranged from 80% to 87%, sensitivity ranged from 53% to 90%, and specificity ranged from 79% to 96% across the five algorithms. Though overall classification accuracy was similar in the validation data (range: 79% to 88%), sensitivity was much lower (range: 17% to 61%), while specificity was higher (range: 82% to 98%) compared to the training data. Classification accuracy was generally worse in non-Hispanic blacks (range: 68% to 85%) and Hispanics (range: 65% to 88%), compared to non-Hispanic whites (range: 79% to 88%). Across datasets, sensitivity was generally higher for proxy-respondents, while specificity (and overall accuracy) was higher for self-respondents. Conclusions: Worse sensitivity in the validation dataset may suggest either overfitting or that the algorithms are better at identifying prevalent versus incident dementia, while differences in performance across algorithms suggest that the usefulness of each will vary depending on the user’s purpose. Further planned work will evaluate algorithm performance in external validation datasets

    Similar works