147 research outputs found
Weakly-Supervised Neural Text Classification
Deep neural networks are gaining increasing popularity for the classic text
classification task, due to their strong expressive power and less requirement
for feature engineering. Despite such attractiveness, neural text
classification models suffer from the lack of training data in many real-world
applications. Although many semi-supervised and weakly-supervised text
classification models exist, they cannot be easily applied to deep neural
models and meanwhile support limited supervision types. In this paper, we
propose a weakly-supervised method that addresses the lack of training data in
neural text classification. Our method consists of two modules: (1) a
pseudo-document generator that leverages seed information to generate
pseudo-labeled documents for model pre-training, and (2) a self-training module
that bootstraps on real unlabeled data for model refinement. Our method has the
flexibility to handle different types of weak supervision and can be easily
integrated into existing deep neural models for text classification. We have
performed extensive experiments on three real-world datasets from different
domains. The results demonstrate that our proposed method achieves inspiring
performance without requiring excessive training data and outperforms baseline
methods significantly.Comment: CIKM 2018 Full Pape
Efficient Path Prediction for Semi-Supervised and Weakly Supervised Hierarchical Text Classification
Hierarchical text classification has many real-world applications. However,
labeling a large number of documents is costly. In practice, we can use
semi-supervised learning or weakly supervised learning (e.g., dataless
classification) to reduce the labeling cost. In this paper, we propose a path
cost-sensitive learning algorithm to utilize the structural information and
further make use of unlabeled and weakly-labeled data. We use a generative
model to leverage the large amount of unlabeled data and introduce path
constraints into the learning algorithm to incorporate the structural
information of the class hierarchy. The posterior probabilities of both
unlabeled and weakly labeled data can be incorporated with path-dependent
scores. Since we put a structure-sensitive cost to the learning algorithm to
constrain the classification consistent with the class hierarchy and do not
need to reconstruct the feature vectors for different structures, we can
significantly reduce the computational cost compared to structural output
learning. Experimental results on two hierarchical text classification
benchmarks show that our approach is not only effective but also efficient to
handle the semi-supervised and weakly supervised hierarchical text
classification.Comment: Aceepted by 2019 World Wide Web Conference (WWW19
The challenges of German archival document categorization on insufficient labeled data
Document exploration in archives is often challenging due to the lack of organization in topic-based categories. Moreover, archival records only provide short text which is often insufficient for capturing the semantic. This paper proposes and explores a dataless categorization approach that utilizes word embeddings and TF-IDF to categorize archival documents. Additionally, it introduces a visual approach built on top of the word embeddings to enhance the exploration of data. Preliminary results suggest that current vector representations alone do not provide enough external knowledge to solve this task
HiGitClass: Keyword-Driven Hierarchical Classification of GitHub Repositories
GitHub has become an important platform for code sharing and scientific
exchange. With the massive number of repositories available, there is a
pressing need for topic-based search. Even though the topic label functionality
has been introduced, the majority of GitHub repositories do not have any
labels, impeding the utility of search and topic-based analysis. This work
targets the automatic repository classification problem as keyword-driven
hierarchical classification. Specifically, users only need to provide a label
hierarchy with keywords to supply as supervision. This setting is flexible,
adaptive to the users' needs, accounts for the different granularity of topic
labels and requires minimal human effort. We identify three key challenges of
this problem, namely (1) the presence of multi-modal signals; (2) supervision
scarcity and bias; (3) supervision format mismatch. In recognition of these
challenges, we propose the HiGitClass framework, comprising of three modules:
heterogeneous information network embedding; keyword enrichment; topic modeling
and pseudo document generation. Experimental results on two GitHub repository
collections confirm that HiGitClass is superior to existing weakly-supervised
and dataless hierarchical classification methods, especially in its ability to
integrate both structured and unstructured data for repository classification.Comment: 10 pages; Accepted to ICDM 2019; Some typos fixe
- …