Government documents must be reviewed to identify any sensitive information
they may contain, before they can be released to the public. However,
traditional paper-based sensitivity review processes are not practical for reviewing
born-digital documents. Therefore, there is a timely need for automatic sensitivity
classification techniques, to assist the digital sensitivity review process.
However, sensitivity is typically a product of the relations between combinations
of terms, such as who said what about whom, therefore, automatic sensitivity
classification is a difficult task. Vector representations of terms, such as word
embeddings, have been shown to be effective at encoding latent term features
that preserve semantic relations between terms, which can also be beneficial to
sensitivity classification. In this work, we present a thorough evaluation of the
effectiveness of semantic word embedding features, along with term and grammatical
features, for sensitivity classification. On a test collection of government
documents containing real sensitivities, we show that extending text classification
with semantic features and additional term n-grams results in significant improvements
in classification effectiveness, correctly classifying 9.99% more sensitive
documents compared to the text classification baseline