27,827 research outputs found
G2T: A simple but versatile framework for topic modeling based on pretrained language model and community detection
It has been reported that clustering-based topic models, which cluster
high-quality sentence embeddings with an appropriate word selection method, can
generate better topics than generative probabilistic topic models. However,
these approaches suffer from the inability to select appropriate parameters and
incomplete models that overlook the quantitative relation between words with
topics and topics with text. To solve these issues, we propose graph to topic
(G2T), a simple but effective framework for topic modelling. The framework is
composed of four modules. First, document representation is acquired using
pretrained language models. Second, a semantic graph is constructed according
to the similarity between document representations. Third, communities in
document semantic graphs are identified, and the relationship between topics
and documents is quantified accordingly. Fourth, the word--topic distribution
is computed based on a variant of TFIDF. Automatic evaluation suggests that G2T
achieved state-of-the-art performance on both English and Chinese documents
with different lengths. Human judgements demonstrate that G2T can produce
topics with better interpretability and coverage than baselines. In addition,
G2T can not only determine the topic number automatically but also give the
probabilistic distribution of words in topics and topics in documents. Finally,
G2T is publicly available, and the distillation experiments provide instruction
on how it works
Improving Statistical Language Model Performance with Automatically Generated Word Hierarchies
An automatic word classification system has been designed which processes
word unigram and bigram frequency statistics extracted from a corpus of natural
language utterances. The system implements a binary top-down form of word
clustering which employs an average class mutual information metric. Resulting
classifications are hierarchical, allowing variable class granularity. Words
are represented as structural tags --- unique -bit numbers the most
significant bit-patterns of which incorporate class information. Access to a
structural tag immediately provides access to all classification levels for the
corresponding word. The classification system has successfully revealed some of
the structure of English, from the phonemic to the semantic level. The system
has been compared --- directly and indirectly --- with other recent word
classification systems. Class based interpolated language models have been
constructed to exploit the extra information supplied by the classifications
and some experiments have shown that the new models improve model performance.Comment: 17 Page Paper. Self-extracting PostScript Fil
Topic-based mixture language modelling
This paper describes an approach for constructing a mixture of language models based on simple statistical notions of semantics using probabilistic models developed for information retrieval. The approach encapsulates corpus-derived semantic information and is able to model varying styles of text. Using such information, the corpus texts are clustered in an unsupervised manner and a mixture of topic-specific language models is automatically created. The principal contribution of this work is to characterise the document space resulting from information retrieval techniques and to demonstrate the approach for mixture language modelling.
A comparison is made between manual and automatic clustering in order to elucidate how the global content information is expressed in the space. We also compare (in terms of association with manual clustering and language modelling accuracy) alternative term-weighting schemes and the effect of singular value decomposition dimension reduction (latent semantic analysis). Test set perplexity results using the British National Corpus indicate that the approach can improve the potential of statistical language modelling. Using an adaptive procedure, the conventional model may be tuned to track text data with a slight increase in computational cost
Inducing a Semantically Annotated Lexicon via EM-Based Clustering
We present a technique for automatic induction of slot annotations for
subcategorization frames, based on induction of hidden classes in the EM
framework of statistical estimation. The models are empirically evalutated by a
general decision test. Induction of slot labeling for subcategorization frames
is accomplished by a further application of EM, and applied experimentally on
frame observations derived from parsing large corpora. We outline an
interpretation of the learned representations as theoretical-linguistic
decompositional lexical entries.Comment: 8 pages, uses colacl.sty. Proceedings of the 37th Annual Meeting of
the ACL, 199
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
- …