Tick parasitism classification from noisy medical records

Abstract

Much of the health information in the medical domain comes in the form of clinical narratives. The rich semantic information contained in these notes can be modeled to make inferences that assist the decision making process for medical practitioners, which is particularly important under time and resource constraints. However, the creation of such assistive tools is made difficult given the ubiquity of misspellings, unsegmented words and morphologically complex or rare medical terms. This reduces the coverage of vocabulary terms present in commonly used pretrained distributed word representations that are passed as input to parametric models that makes such predictions. This paper presents an ensemble architecture that combines indomain and general word embeddings to overcome these challenges, showing best performance on a binary classification task when compared to various other baselines. We demonstrate our approach in the context of the veterinary domain for the task of identifying tick parasitism from small animals. The best model shows 84.29% test accuracy, showing some improvement over models, which only use pretrained embeddings that are not specifically trained for the medical sub-domain of interest

    Similar works