2 research outputs found
Pre-train, Interact, Fine-tune: A Novel Interaction Representation for Text Classification
Text representation can aid machines in understanding text. Previous work on
text representation often focuses on the so-called forward implication, i.e.,
preceding words are taken as the context of later words for creating
representations, thus ignoring the fact that the semantics of a text segment is
a product of the mutual implication of words in the text: later words
contribute to the meaning of preceding words. We introduce the concept of
interaction and propose a two-perspective interaction representation, that
encapsulates a local and a global interaction representation. Here, a local
interaction representation is one that interacts among words with
parent-children relationships on the syntactic trees and a global interaction
interpretation is one that interacts among all the words in a sentence. We
combine the two interaction representations to develop a Hybrid Interaction
Representation (HIR).
Inspired by existing feature-based and fine-tuning-based pretrain-finetuning
approaches to language models, we integrate the advantages of feature-based and
fine-tuning-based methods to propose the Pre-train, Interact, Fine-tune (PIF)
architecture.
We evaluate our proposed models on five widely-used datasets for text
classification tasks. Our ensemble method, outperforms state-of-the-art
baselines with improvements ranging from 2.03% to 3.15% in terms of error rate.
In addition, we find that, the improvements of PIF against most
state-of-the-art methods is not affected by increasing of the length of the
text.Comment: 32 pages, 5 figure
PGLDA: enhancing the precision of topic modelling using poisson gamma (PG) and latent dirichlet allocation (LDA) for text information retrieval
The Poisson document length distribution has been used extensively in the past for
modeling topics with the expectation that its effect will disintegrate at the end of the
model definition. This procedure often leads to down Playing word correlation with
topics and reducing retrieved documents' precision or accuracy. The existing
document model, such as the Latent Dirichlet Allocation (LDA) model, does not
accommodate words' semantic representation. Therefore, in this thesis, the PoissonGamma
Latent Dirichlet Allocation (PGLDA) model for modeling word
dependencies in topic modeling is introduced. The PGLDA model relaxes the words
independence assumption in the existing Latent Dirichlet Allocation (LDA) model
by introducing the Gamma distribution that captures the correlation between adjacent
words in documents. The PGLDA is hybridized with the distributed representation of
documents (Doc2Vec) and topics (Topic2Vec) to form a new model named
PGLDA2Vec. The hybridization process was achieved by averaging the Doc2Vec
and Topic2Vec vectors to form new word representation vectors, combined with
topics with the largest estimated probability using PGLDA. Model estimations for
PGLDA and PGLDA2Vec models were achieved by combining the Laplacian
approximation of log-likelihood for PGLDA and Feed-Forward Neural Network
(FFN) approaches of Doc2Vec and Topic2Vec. The proposed PGLDA and the
hybrid PGLDA2Vec models were assessed using precision, micro F1 scores,
perplexity, and coherence score. The empirical analysis results using three real-world
datasets (20 Newsgroups, AG'News, and Reuters) showed that the hybrid
PGLDA2Vec model with an average precision of 86.6%, and an average F1 score of
96.3%, across the three datasets is better than other competing models reviewed