Many companies are facing growing data archives leading to an increasing focus on the automated classification of documents in corporate processes. Due to data protection guidelines, development with clear data is often difficult. One way to overcome this difficulty is to desensitize documents using document redaction. The following study, therefore, examines the impact of redaction on the document classification performance of a deep CNN model by analyzing how the classifica- tion performance deteriorates when the model is trained on unredacted documents and evaluated on redacted data (unredacted model) or trained on redacted data and applied to unredacted documents (redacted model). For the former condition, a loss in accuracy of 2.56%P was found and a loss of 2.08%P for the latter. We were also able to show that the loss in performance differed greatly between document classes and was influenced by their proportion of redacted area (unredacted model: r=0.31; redacted model: r=0.87). For the model trained with redacted and evaluated on unredacted data, we also determined that the decrease in classification accu- racy was affected by the intra-class variability of the redacted area (r=0.74). From these results, recommendations for dealing with redacted data in document classification systems are derived
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.