The wealth of text data generated by social media has enabled new kinds of
analysis of emotions with language models. These models are often trained on
small and costly datasets of text annotations produced by readers who guess the
emotions expressed by others in social media posts. This affects the quality of
emotion identification methods due to training data size limitations and noise
in the production of labels used in model development. We present LEIA, a model
for emotion identification in text that has been trained on a dataset of more
than 6 million posts with self-annotated emotion labels for happiness,
affection, sadness, anger, and fear. LEIA is based on a word masking method
that enhances the learning of emotion words during model pre-training. LEIA
achieves macro-F1 values of approximately 73 on three in-domain test datasets,
outperforming other supervised and unsupervised methods in a strong benchmark
that shows that LEIA generalizes across posts, users, and time periods. We
further perform an out-of-domain evaluation on five different datasets of
social media and other sources, showing LEIA's robust performance across media,
data collection methods, and annotation schemes. Our results show that LEIA
generalizes its classification of anger, happiness, and sadness beyond the
domain it was trained on. LEIA can be applied in future research to provide
better identification of emotions in text from the perspective of the writer.
The models produced for this article are publicly available at
https://huggingface.co/LEI