404,611 research outputs found
DOC: Deep Open Classification of Text Documents
Traditional supervised learning makes the closed-world assumption that the
classes appeared in the test data must have appeared in training. This also
applies to text learning or text classification. As learning is used
increasingly in dynamic open environments where some new/test documents may not
belong to any of the training classes, identifying these novel documents during
classification presents an important problem. This problem is called open-world
classification or open classification. This paper proposes a novel deep
learning based approach. It outperforms existing state-of-the-art techniques
dramatically.Comment: accepted at EMNLP 201
A Study of SVM Kernel Functions for Sensitivity Classification Ensembles with POS Sequences
Freedom of Information (FOI) laws legislate that government documents should be opened to the public. However, many government documents contain sensitive information, such as confidential information, that is exempt from release. Therefore, government documents must be sensitivity reviewed prior to release, to identify and close any sensitive information. With the adoption of born-digital documents, such as email, there is a need for automatic sensitivity classification to assist digital sensitivity review. SVM classifiers and Part-of-Speech sequences have separately been shown to be promising for sensitivity classification. However, sequence classification methodologies, and specifically SVM kernel functions, have not been fully investigated for sensitivity classification. Therefore, in this work, we present an evaluation of five SVM kernel functions for sensitivity classification using POS sequences. Moreover, we show that an ensemble classifier that combines POS sequence classification with text classification can significantly improve sensitivity classification effectiveness (+6.09% F2) compared with a text classification baseline, according to McNemar's test of significance
Enhancing Sensitivity Classification with Semantic Features using Word Embeddings
Government documents must be reviewed to identify any sensitive information
they may contain, before they can be released to the public. However,
traditional paper-based sensitivity review processes are not practical for reviewing
born-digital documents. Therefore, there is a timely need for automatic sensitivity
classification techniques, to assist the digital sensitivity review process.
However, sensitivity is typically a product of the relations between combinations
of terms, such as who said what about whom, therefore, automatic sensitivity
classification is a difficult task. Vector representations of terms, such as word
embeddings, have been shown to be effective at encoding latent term features
that preserve semantic relations between terms, which can also be beneficial to
sensitivity classification. In this work, we present a thorough evaluation of the
effectiveness of semantic word embedding features, along with term and grammatical
features, for sensitivity classification. On a test collection of government
documents containing real sensitivities, we show that extending text classification
with semantic features and additional term n-grams results in significant improvements
in classification effectiveness, correctly classifying 9.99% more sensitive
documents compared to the text classification baseline
Cross Language Text Classification via Subspace Co-Regularized Multi-View Learning
In many multilingual text classification problems, the documents in different
languages often share the same set of categories. To reduce the labeling cost
of training a classification model for each individual language, it is
important to transfer the label knowledge gained from one language to another
language by conducting cross language classification. In this paper we develop
a novel subspace co-regularized multi-view learning method for cross language
text classification. This method is built on parallel corpora produced by
machine translation. It jointly minimizes the training error of each classifier
in each language while penalizing the distance between the subspace
representations of parallel documents. Our empirical study on a large set of
cross language text classification tasks shows the proposed method consistently
outperforms a number of inductive methods, domain adaptation methods, and
multi-view learning methods.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Graph Convolutional Networks for Text Classification
Text classification is an important and classical problem in natural language
processing. There have been a number of studies that applied convolutional
neural networks (convolution on regular grid, e.g., sequence) to
classification. However, only a limited number of studies have explored the
more flexible graph convolutional neural networks (convolution on non-grid,
e.g., arbitrary graph) for the task. In this work, we propose to use graph
convolutional networks for text classification. We build a single text graph
for a corpus based on word co-occurrence and document word relations, then
learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text
GCN is initialized with one-hot representation for word and document, it then
jointly learns the embeddings for both words and documents, as supervised by
the known class labels for documents. Our experimental results on multiple
benchmark datasets demonstrate that a vanilla Text GCN without any external
word embeddings or knowledge outperforms state-of-the-art methods for text
classification. On the other hand, Text GCN also learns predictive word and
document embeddings. In addition, experimental results show that the
improvement of Text GCN over state-of-the-art comparison methods become more
prominent as we lower the percentage of training data, suggesting the
robustness of Text GCN to less training data in text classification.Comment: Accepted by 33rd AAAI Conference on Artificial Intelligence (AAAI
2019
Czech Text Document Corpus v 2.0
This paper introduces "Czech Text Document Corpus v 2.0", a collection of
text documents for automatic document classification in Czech language. It is
composed of the text documents provided by the Czech News Agency and is freely
available for research purposes at http://ctdc.kiv.zcu.cz/. This corpus was
created in order to facilitate a straightforward comparison of the document
classification approaches on Czech data. It is particularly dedicated to
evaluation of multi-label document classification approaches, because one
document is usually labelled with more than one label. Besides the information
about the document classes, the corpus is also annotated at the morphological
layer. This paper further shows the results of selected state-of-the-art
methods on this corpus to offer the possibility of an easy comparison with
these approaches.Comment: Accepted for LREC 201
- …