4,220 research outputs found
PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks
Unsupervised text embedding methods, such as Skip-gram and Paragraph Vector,
have been attracting increasing attention due to their simplicity, scalability,
and effectiveness. However, comparing to sophisticated deep learning
architectures such as convolutional neural networks, these methods usually
yield inferior results when applied to particular machine learning tasks. One
possible reason is that these text embedding methods learn the representation
of text in a fully unsupervised way, without leveraging the labeled information
available for the task. Although the low dimensional representations learned
are applicable to many different tasks, they are not particularly tuned for any
task. In this paper, we fill this gap by proposing a semi-supervised
representation learning method for text data, which we call the
\textit{predictive text embedding} (PTE). Predictive text embedding utilizes
both labeled and unlabeled data to learn the embedding of text. The labeled
information and different levels of word co-occurrence information are first
represented as a large-scale heterogeneous text network, which is then embedded
into a low dimensional space through a principled and efficient algorithm. This
low dimensional embedding not only preserves the semantic closeness of words
and documents, but also has a strong predictive power for the particular task.
Compared to recent supervised approaches based on convolutional neural
networks, predictive text embedding is comparable or more effective, much more
efficient, and has fewer parameters to tune.Comment: KDD 201
Weakly-Supervised Neural Text Classification
Deep neural networks are gaining increasing popularity for the classic text
classification task, due to their strong expressive power and less requirement
for feature engineering. Despite such attractiveness, neural text
classification models suffer from the lack of training data in many real-world
applications. Although many semi-supervised and weakly-supervised text
classification models exist, they cannot be easily applied to deep neural
models and meanwhile support limited supervision types. In this paper, we
propose a weakly-supervised method that addresses the lack of training data in
neural text classification. Our method consists of two modules: (1) a
pseudo-document generator that leverages seed information to generate
pseudo-labeled documents for model pre-training, and (2) a self-training module
that bootstraps on real unlabeled data for model refinement. Our method has the
flexibility to handle different types of weak supervision and can be easily
integrated into existing deep neural models for text classification. We have
performed extensive experiments on three real-world datasets from different
domains. The results demonstrate that our proposed method achieves inspiring
performance without requiring excessive training data and outperforms baseline
methods significantly.Comment: CIKM 2018 Full Pape
Cross-Lingual Adaptation using Structural Correspondence Learning
Cross-lingual adaptation, a special case of domain adaptation, refers to the
transfer of classification knowledge between two languages. In this article we
describe an extension of Structural Correspondence Learning (SCL), a recently
proposed algorithm for domain adaptation, for cross-lingual adaptation. The
proposed method uses unlabeled documents from both languages, along with a word
translation oracle, to induce cross-lingual feature correspondences. From these
correspondences a cross-lingual representation is created that enables the
transfer of classification knowledge from the source to the target language.
The main advantages of this approach over other approaches are its resource
efficiency and task specificity.
We conduct experiments in the area of cross-language topic and sentiment
classification involving English as source language and German, French, and
Japanese as target languages. The results show a significant improvement of the
proposed method over a machine translation baseline, reducing the relative
error due to cross-lingual adaptation by an average of 30% (topic
classification) and 59% (sentiment classification). We further report on
empirical analyses that reveal insights into the use of unlabeled data, the
sensitivity with respect to important hyperparameters, and the nature of the
induced cross-lingual correspondences
Cross Language Text Classification via Subspace Co-Regularized Multi-View Learning
In many multilingual text classification problems, the documents in different
languages often share the same set of categories. To reduce the labeling cost
of training a classification model for each individual language, it is
important to transfer the label knowledge gained from one language to another
language by conducting cross language classification. In this paper we develop
a novel subspace co-regularized multi-view learning method for cross language
text classification. This method is built on parallel corpora produced by
machine translation. It jointly minimizes the training error of each classifier
in each language while penalizing the distance between the subspace
representations of parallel documents. Our empirical study on a large set of
cross language text classification tasks shows the proposed method consistently
outperforms a number of inductive methods, domain adaptation methods, and
multi-view learning methods.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
Cross-lingual Distillation for Text Classification
Cross-lingual text classification(CLTC) is the task of classifying documents
written in different languages into the same taxonomy of categories. This paper
presents a novel approach to CLTC that builds on model distillation, which
adapts and extends a framework originally proposed for model compression. Using
soft probabilistic predictions for the documents in a label-rich language as
the (induced) supervisory labels in a parallel corpus of documents, we train
classifiers successfully for new languages in which labeled training data are
not available. An adversarial feature adaptation technique is also applied
during the model training to reduce distribution mismatch. We conducted
experiments on two benchmark CLTC datasets, treating English as the source
language and German, French, Japan and Chinese as the unlabeled target
languages. The proposed approach had the advantageous or comparable performance
of the other state-of-art methods.Comment: Accepted at ACL 2017; Code available at
https://github.com/xrc10/cross-distil
- …