138,544 research outputs found
Bayesian nonparametric learning for complicated text mining
University of Technology Sydney. Faculty of Engineering and Information Technology.Text mining has gained the ever-increasing attention of researchers in recent years because text is one of the most natural and easy ways to express human knowledge and opinions, and is therefore believed to have a variety of application scenarios and a potentially high commercial value. It is commonly accepted that Bayesian models with finite-dimensional probability distributions as building blocks, also known as parametric topic models, are effective tools for text mining. However, one problem in existing parametric topic models is that the hidden topic number needs to be fixed in advance. Determining an appropriate number is very difficult, and sometimes unrealistic, for many real-world applications and may lead to over-fitting or under-fitting issues. Bayesian nonparametric learning is a key approach for learning the number of mixtures in a mixture model (also called the model selection problem), and has emerged as an elegant way to handle a flexible number of topics. The core idea of Bayesian nonparametric models is to use stochastic processes as building blocks, instead of traditional fixed-dimensional probability distributions. Even though Bayesian nonparametric learning has gained considerable research attention and undergone rapid development, its ability to conduct complicated text mining tasks, such as: document-word co-clustering, document network learning, multi-label document learning, and so on, is still weak. Therefore, there is still a gap between the Bayesian nonparametric learning theory and complicated real-world text mining tasks.
To fill this gap, this research aims to develop a set of Bayesian nonparametric models to accomplish four selected complex text mining tasks. First, three Bayesian nonparametric sparse nonnegative matrix factorization models, based on two innovative dependent Indian buffet processes, are proposed for document-word co-clustering tasks. Second, a Dirichlet mixture probability measure strategy is proposed to link the topics from different layers, and is used to build a Bayesian nonparametric deep topic model for topic hierarchy learning. Third, the thesis develops a Bayesian nonparametric relational topic model for document network learning tasks by a subsampling Markov random field. Lastly, the thesis develops Bayesian nonparametric cooperative hierarchical structure models for multi-label document learning task based on two stochastic process operations: inheritance and cooperation. The findings of this research not only contribute to the development of Bayesian nonparametric learning theory, but also provide a set of effective tools for complicated text mining applications
Multi-task Learning of Pairwise Sequence Classification Tasks Over Disparate Label Spaces
We combine multi-task learning and semi-supervised learning by inducing a
joint embedding space between disparate label spaces and learning transfer
functions between label embeddings, enabling us to jointly leverage unlabelled
data and auxiliary, annotated datasets. We evaluate our approach on a variety
of sequence classification tasks with disparate label spaces. We outperform
strong single and multi-task baselines and achieve a new state-of-the-art for
topic-based sentiment analysis.Comment: To appear at NAACL 2018 (long
Cross-Domain Labeled LDA for Cross-Domain Text Classification
Cross-domain text classification aims at building a classifier for a target
domain which leverages data from both source and target domain. One promising
idea is to minimize the feature distribution differences of the two domains.
Most existing studies explicitly minimize such differences by an exact
alignment mechanism (aligning features by one-to-one feature alignment,
projection matrix etc.). Such exact alignment, however, will restrict models'
learning ability and will further impair models' performance on classification
tasks when the semantic distributions of different domains are very different.
To address this problem, we propose a novel group alignment which aligns the
semantics at group level. In addition, to help the model learn better semantic
groups and semantics within these groups, we also propose a partial supervision
for model's learning in source domain. To this end, we embed the group
alignment and a partial supervision into a cross-domain topic model, and
propose a Cross-Domain Labeled LDA (CDL-LDA). On the standard 20Newsgroup and
Reuters dataset, extensive quantitative (classification, perplexity etc.) and
qualitative (topic detection) experiments are conducted to show the
effectiveness of the proposed group alignment and partial supervision.Comment: ICDM 201
A Survey of Methods for Addressing Class Imbalance in Deep-Learning Based Natural Language Processing
Many natural language processing (NLP) tasks are naturally imbalanced, as
some target categories occur much more frequently than others in the real
world. In such scenarios, current NLP models still tend to perform poorly on
less frequent classes. Addressing class imbalance in NLP is an active research
topic, yet, finding a good approach for a particular task and imbalance
scenario is difficult.
With this survey, the first overview on class imbalance in deep-learning
based NLP, we provide guidance for NLP researchers and practitioners dealing
with imbalanced data. We first discuss various types of controlled and
real-world class imbalance. Our survey then covers approaches that have been
explicitly proposed for class-imbalanced NLP tasks or, originating in the
computer vision community, have been evaluated on them. We organize the methods
by whether they are based on sampling, data augmentation, choice of loss
function, staged learning, or model design. Finally, we discuss open problems
such as dealing with multi-label scenarios, and propose systematic benchmarking
and reporting in order to move forward on this problem as a community
- …