Preprocessing messages posted by dentists to an Internet mailing list: a report of methods developed for a study of clinical content

Abstract

Objectives: Mining social media artifacts requires substantial processing before content analyses. In this report, we describe our procedures for preprocessing 14,576 e-mail messages sent to a mailing list of several hundred dental professionals. Our goal was to transform the messages into a format useful for natural language processing (NLP) to enable subsequent discovery of clinical topics expressed in the corpus. Methods: Preprocessing involved message capture, database creation and import, extraction of multipurpose Internet mail extensions, decoding of encoded text, de-identification, and cleaning. We also developed a Web-based tool to identify signals for noisy strings and sections, and to verify the effectiveness of customized noise filters. We tailored our cleaning strategies to delete text and images that would impede NLP and in-depth content analyses. Before applying the full set of filters to each message, we determined an effective filter order. Results: Preprocessing messages improved effectiveness of NLP by 38%. Sources of noise included personal information in the salutation, the farewell, and the signature block; names and places mentioned in the body of the text; threads with quoted text; advertisements; embedded or attached images; spam- and virus-scanning notifications; auto text parts; e-mail addresses; and Web links. We identified 53 patterns of noise and delivered a set of de-identified and cleaned messages to the NLP analyst. Conclusion: Preprocessing electronic messages can markedly improve subsequent NLP to enable discovery of clinical topics. Keywords: Electronic mail; data processing; natural language processing; dental informatic

    Similar works