Enhancing Sensitivity Classification with Semantic Features using Word Embeddings

B Fung; C Cortes; F Sebastiani; G McDonald; J Mitchell; Q McNemar; R Collobert; T Joachims; ZS Harris

research

Enhancing Sensitivity Classification with Semantic Features using Word Embeddings

Authors: B Fung
C Cortes
F Sebastiani
G McDonald
J Mitchell
Q McNemar
R Collobert
T Joachims
ZS Harris
Publication date: 1 January 2017
Publisher: 'Springer Science and Business Media LLC'
Doi

Abstract

Government documents must be reviewed to identify any sensitive information they may contain, before they can be released to the public. However, traditional paper-based sensitivity review processes are not practical for reviewing born-digital documents. Therefore, there is a timely need for automatic sensitivity classification techniques, to assist the digital sensitivity review process. However, sensitivity is typically a product of the relations between combinations of terms, such as who said what about whom, therefore, automatic sensitivity classification is a difficult task. Vector representations of terms, such as word embeddings, have been shown to be effective at encoding latent term features that preserve semantic relations between terms, which can also be beneficial to sensitivity classification. In this work, we present a thorough evaluation of the effectiveness of semantic word embedding features, along with term and grammatical features, for sensitivity classification. On a test collection of government documents containing real sensitivities, we show that extending text classification with semantic features and additional term n-grams results in significant improvements in classification effectiveness, correctly classifying 9.99% more sensitive documents compared to the text classification baseline

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Crossref

info:doi/10.1007%2F978-3-319-5...

Last time updated on 05/06/2019

Enlighten

oai:eprints.gla.ac.uk:135030

Last time updated on 01/02/2017

Enlighten: Publications

oai:eprints.gla.ac.uk:135030

Last time updated on 09/04/2020