Mental health is a significant and growing public health concern. As language
usage can be leveraged to obtain crucial insights into mental health
conditions, there is a need for large-scale, labeled, mental health-related
datasets of users who have been diagnosed with one or more of such conditions.
In this paper, we investigate the creation of high-precision patterns to
identify self-reported diagnoses of nine different mental health conditions,
and obtain high-quality labeled data without the need for manual labelling. We
introduce the SMHD (Self-reported Mental Health Diagnoses) dataset and make it
available. SMHD is a novel large dataset of social media posts from users with
one or multiple mental health conditions along with matched control users. We
examine distinctions in users' language, as measured by linguistic and
psychological variables. We further explore text classification methods to
identify individuals with mental conditions through their language.Comment: COLING 201