45 research outputs found
Mental distress detection and triage in forum posts: the LT3 CLPsych 2016 shared task system
This paper describes the contribution of LT3 for the CLPsych 2016 Shared Task on automatic triage of mental health forum posts. Our systems use multiclass Support Vector Machines (SVM), cascaded binary SVMs and ensembles with a rich feature set. The best systems obtain macro-averaged F-scores of 40% on the full task and 80% on the green versus alarming distinction. Multiclass SVMs with all features score best in terms of F-score, whereas feature filtering with bi-normal separation and classifier ensembling are found to improve recall of alarming posts
Depression and Self-Harm Risk Assessment in Online Forums
Users suffering from mental health conditions often turn to online resources
for support, including specialized online support communities or general
communities such as Twitter and Reddit. In this work, we present a neural
framework for supporting and studying users in both types of communities. We
propose methods for identifying posts in support communities that may indicate
a risk of self-harm, and demonstrate that our approach outperforms strong
previously proposed methods for identifying such posts. Self-harm is closely
related to depression, which makes identifying depressed users on general
forums a crucial related task. We introduce a large-scale general forum dataset
("RSDD") consisting of users with self-reported depression diagnoses matched
with control users. We show how our method can be applied to effectively
identify depressed users from their use of language alone. We demonstrate that
our method outperforms strong baselines on this general forum dataset.Comment: Expanded version of EMNLP17 paper. Added sections 6.1, 6.2, 6.4,
FastText baseline, and CNN-
Predicting suicide risk from online postings in Reddit : the UGent-IDLab submission to the CLPysch 2019 Shared Task A
This paper describes IDLab’s text classification systems submitted to Task A as part of the CLPsych 2019 shared task. The aim of this shared task was to develop automated systems that predict the degree of suicide risk of people based on their posts on Reddit. Bag-of-words features, emotion features and post level predictions are used to derive user-level predictions. Linear models and ensembles of these models are used to predict final scores. We find that predicting fine-grained risk levels is much more difficult than flagging potentially at-risk users. Furthermore, we do not find clear added value from building richer ensembles compared to simple baselines, given the available training data and the nature of the prediction task
Triaging Content Severity in Online Mental Health Forums
Mental health forums are online communities where people express their issues
and seek help from moderators and other users. In such forums, there are often
posts with severe content indicating that the user is in acute distress and
there is a risk of attempted self-harm. Moderators need to respond to these
severe posts in a timely manner to prevent potential self-harm. However, the
large volume of daily posted content makes it difficult for the moderators to
locate and respond to these critical posts. We present a framework for triaging
user content into four severity categories which are defined based on
indications of self-harm ideation. Our models are based on a feature-rich
classification framework which includes lexical, psycholinguistic, contextual
and topic modeling features. Our approaches improve the state of the art in
triaging the content severity in mental health forums by large margins (up to
17% improvement over the F-1 scores). Using the proposed model, we analyze the
mental state of users and we show that overall, long-term users of the forum
demonstrate a decreased severity of risk over time. Our analysis on the
interaction of the moderators with the users further indicates that without an
automatic way to identify critical content, it is indeed challenging for the
moderators to provide timely response to the users in need.Comment: Accepted for publication in Journal of the Association for
Information Science and Technology (2017
Empirical Evaluation of Pre-trained Transformers for Human-Level NLP: The Role of Sample Size and Dimensionality
In human-level NLP tasks, such as predicting mental health, personality, or
demographics, the number of observations is often smaller than the standard
768+ hidden state sizes of each layer within modern transformer-based language
models, limiting the ability to effectively leverage transformers. Here, we
provide a systematic study on the role of dimension reduction methods
(principal components analysis, factorization techniques, or multi-layer
auto-encoders) as well as the dimensionality of embedding vectors and sample
sizes as a function of predictive performance. We first find that fine-tuning
large models with a limited amount of data pose a significant difficulty which
can be overcome with a pre-trained dimension reduction regime. RoBERTa
consistently achieves top performance in human-level tasks, with PCA giving
benefit over other reduction methods in better handling users that write longer
texts. Finally, we observe that a majority of the tasks achieve results
comparable to the best performance with just of the embedding
dimensions.Comment: 2021 Annual Conference of the North American Chapter of the
Association for Computational Linguistics (NAACL-HLT
SMHD : a large-scale resource for exploring online language usage for multiple mental health conditions
Mental health is a significant and growing public health concern. As language usage can be leveraged to obtain crucial insights into mental health conditions, there is a need for large-scale, labeled, mental health-related datasets of users who have been diagnosed with one or more of such conditions. In this paper, we investigate the creation of high-precision patterns to identify self-reported diagnoses of nine different mental health conditions, and obtain high-quality labeled
data without the need for manual labelling. We introduce the SMHD (Self-reported Mental Health Diagnoses) dataset and make it available. SMHD is a novel large dataset of social media posts from users with one or multiple mental health conditions along with matched control users. We examine distinctions in users’ language, as measured by linguistic and psychological variables. We further explore text classification methods to identify individuals with mental conditions
through their language