23,813 research outputs found
Bibliographic Analysis on Research Publications using Authors, Categorical Labels and the Citation Network
Bibliographic analysis considers the author's research areas, the citation
network and the paper content among other things. In this paper, we combine
these three in a topic model that produces a bibliographic model of authors,
topics and documents, using a nonparametric extension of a combination of the
Poisson mixed-topic link model and the author-topic model. This gives rise to
the Citation Network Topic Model (CNTM). We propose a novel and efficient
inference algorithm for the CNTM to explore subsets of research publications
from CiteSeerX. The publication datasets are organised into three corpora,
totalling to about 168k publications with about 62k authors. The queried
datasets are made available online. In three publicly available corpora in
addition to the queried datasets, our proposed model demonstrates an improved
performance in both model fitting and document clustering, compared to several
baselines. Moreover, our model allows extraction of additional useful knowledge
from the corpora, such as the visualisation of the author-topics network.
Additionally, we propose a simple method to incorporate supervision into topic
modelling to achieve further improvement on the clustering task.Comment: Preprint for Journal Machine Learnin
The Extraction of Community Structures from Publication Networks to Support Ethnographic Observations of Field Differences in Scientific Communication
The scientific community of researchers in a research specialty is an
important unit of analysis for understanding the field specific shaping of
scientific communication practices. These scientific communities are, however,
a challenging unit of analysis to capture and compare because they overlap,
have fuzzy boundaries, and evolve over time. We describe a network analytic
approach that reveals the complexities of these communities through examination
of their publication networks in combination with insights from ethnographic
field studies. We suggest that the structures revealed indicate overlapping
sub- communities within a research specialty and we provide evidence that they
differ in disciplinary orientation and research practices. By mapping the
community structures of scientific fields we aim to increase confidence about
the domain of validity of ethnographic observations as well as of collaborative
patterns extracted from publication networks thereby enabling the systematic
study of field differences. The network analytic methods presented include
methods to optimize the delineation of a bibliographic data set in order to
adequately represent a research specialty, and methods to extract community
structures from this data. We demonstrate the application of these methods in a
case study of two research specialties in the physical and chemical sciences.Comment: Accepted for publication in JASIS
Nonparametric Bayesian Topic Modelling with Auxiliary Data
The intent of this dissertation in computer science is to study
topic models for text analytics. The first objective of this
dissertation is to incorporate auxiliary information present in
text corpora to improve topic modelling for natural language
processing (NLP) applications. The second objective of this
dissertation is to extend existing topic models to employ
state-of-the-art nonparametric Bayesian techniques for better
modelling of text data. In particular, this dissertation focusses
on:
- incorporating hashtags, mentions, emoticons, and target-opinion
dependency present in tweets, together with an external sentiment
lexicon, to perform opinion mining or sentiment analysis on
products and services;
- leveraging abstracts, titles, authors, keywords, categorical
labels, and the citation network to perform bibliographic
analysis on research publications, using a supervised or
semi-supervised topic model; and
- employing the hierarchical Pitman-Yor process (HPYP) and the
Gaussian process (GP) to jointly model text, hashtags, authors,
and the follower network in tweets for corpora exploration and
summarisation.
In addition, we provide a framework for implementing arbitrary
HPYP topic models to ease the development of our proposed topic
models, made possible by modularising the Pitman-Yor processes.
Through extensive experiments and qualitative assessment, we find
that topic models fit better to the data as we utilise more
auxiliary information and by employing the Bayesian nonparametric
method
- …