8 research outputs found
Evaluating Similarity Metrics for Latent Twitter Topics
Topic modelling approaches such as LDA, when applied on a tweet corpus, can often generate a topic model containing redundant topics. To evaluate the quality of a topic model in terms of redundancy, topic similarity metrics can be applied to estimate the similarity among topics in a topic model. There are various topic similarity metrics in the literature, e.g. the Jensen Shannon (JS) divergence-based metric. In this paper, we evaluate the performances of four distance/divergence-based topic similarity metrics and examine how they align with human judgements, including a newly proposed similarity metric that is based on computing word semantic similarity using word embeddings (WE). To obtain human judgements, we conduct a user study through crowdsourcing. Among various insights, our study shows that in general the cosine similarity (CS) and WE-based metrics perform better and appear to be complementary. However, we also find that the human assessors cannot easily distinguish between the distance/divergence-based and the semantic similarity-based metrics when identifying similar latent Twitter topics
G2T: A simple but versatile framework for topic modeling based on pretrained language model and community detection
It has been reported that clustering-based topic models, which cluster
high-quality sentence embeddings with an appropriate word selection method, can
generate better topics than generative probabilistic topic models. However,
these approaches suffer from the inability to select appropriate parameters and
incomplete models that overlook the quantitative relation between words with
topics and topics with text. To solve these issues, we propose graph to topic
(G2T), a simple but effective framework for topic modelling. The framework is
composed of four modules. First, document representation is acquired using
pretrained language models. Second, a semantic graph is constructed according
to the similarity between document representations. Third, communities in
document semantic graphs are identified, and the relationship between topics
and documents is quantified accordingly. Fourth, the word--topic distribution
is computed based on a variant of TFIDF. Automatic evaluation suggests that G2T
achieved state-of-the-art performance on both English and Chinese documents
with different lengths. Human judgements demonstrate that G2T can produce
topics with better interpretability and coverage than baselines. In addition,
G2T can not only determine the topic number automatically but also give the
probabilistic distribution of words in topics and topics in documents. Finally,
G2T is publicly available, and the distillation experiments provide instruction
on how it works
Texty frankofonního hip hopu z pohledu Digital humanities
This paper contains an explorative analysis of French rap lyrics. The focus of the work is to answer three questions: What is the vocabulary used in the French rap lyrics? What are the topics present in the French rap lyrics? Are media and technology a significant topic in French rap lyrics? The methods used to obtain the answers to the research questions can be listed as follows: literature research, data collection of French rap lyrics, statistical textual analysis, and algorithmic topic modelling. The findings confirm previous research on the subject by supporting claims that the level of non- standard language in French rap lyrics is not as high as myths suggested. In terms of the topics, the topic modelling confirms that there is a variety of themes present in French rap lyrics, including anti- systemic sentiments, struggle, censorship, and false information presented by the authorities.Tato práce obsahuje explorační analýzu textů rapových písní ve francouzském jazyce. Hlavním záměrem této práce je odpovědět na tři základní otázky: Jaká je slovní zásoba rapových písní ve francouzském jazyce? Jaká témata tyto písně reflektují? Jsou v těchto písních média a technologie důležitá témata? Metody použité za účelem zodpovězení těchto otázek jsou následující: přehled dostupné literatury, sběr dat (sběr francouzských rapových textů), statistická textová analýza a algoritmické modelování témat. Výsledky tohoto výzkumu potvrzují závěry nalezené v přehledu literatury. Podporují tvrzení, že výskyt nestandardních slov ve francouzských rapových textech není tak vysoký. Co se týče modelování témat, výsledky potvrzují, že francouzské rapové texty obsahují mnoho různých témat, například protisystémové postoje, existenční potíže, cenzuru a kritiku nepravdivých informací prezentovaných autoritami.Institute of Information Studies and Librarianship - New Media StudiesÚstav informačních studií - studia nových médiíFilozofická fakultaFaculty of Art
A Topic Coverage Approach to Evaluation of Topic Models
Topic models are widely used unsupervised models of text capable of learning
topics - weighted lists of words and documents - from large collections of text
documents. When topic models are used for discovery of topics in text
collections, a question that arises naturally is how well the model-induced
topics correspond to topics of interest to the analyst. In this paper we
revisit and extend a so far neglected approach to topic model evaluation based
on measuring topic coverage - computationally matching model topics with a set
of reference topics that models are expected to uncover. The approach is well
suited for analyzing models' performance in topic discovery and for large-scale
analysis of both topic models and measures of model quality. We propose new
measures of coverage and evaluate, in a series of experiments, different types
of topic models on two distinct text domains for which interest for topic
discovery exists. The experiments include evaluation of model quality, analysis
of coverage of distinct topic categories, and the analysis of the relationship
between coverage and other methods of topic model evaluation. The contributions
of the paper include new measures of coverage, insights into both topic models
and other methods of model evaluation, and the datasets and code for
facilitating future research of both topic coverage and other approaches to
topic model evaluation.Comment: Results and contributions unchanged; Added new references; Improved
the contextualization and the description of the work (abstr, intro, 7.1
concl, rw, concl); Moved technical details of data and model building to
appendices; Improved layout
The Palgrave Handbook of Digital Russia Studies
This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today
The Palgrave Handbook of Digital Russia Studies
This open access handbook presents a multidisciplinary and multifaceted perspective on how the ‘digital’ is simultaneously changing Russia and the research methods scholars use to study Russia. It provides a critical update on how Russian society, politics, economy, and culture are reconfigured in the context of ubiquitous connectivity and accounts for the political and societal responses to digitalization. In addition, it answers practical and methodological questions in handling Russian data and a wide array of digital methods. The volume makes a timely intervention in our understanding of the changing field of Russian Studies and is an essential guide for scholars, advanced undergraduate and graduate students studying Russia today
Računalni postupci za modeliranje i analizu medijske agende temeljeni na strojnome učenju
Rad se bavi računalnim postupcima analize medijske agende (engl. Media Agenda) temeljenima
na tematskim modelima (engl. Topic Models) te metodama vrednovanja tematskih modela.
Analiza medijske agende provodi se radi stjecanja uvida u strukturu i zastupljenost medijskih
tema, što je od interesa za društvenoznanstvena istraživanja te za medijsku industriju i druge
komercijalne i političke aktere. Računalni postupci analize medijske agende omogućuju auto-
matsko otkrivanje tema u velikim skupovima tekstova i mjerenje njihove zastupljenosti. Ovi
postupci pružaju analitičaru uvid u teme prisutne u medijima i uvid u zastupljenost tema u poje-
dinim medijima i vremenskim razdobljima te omogućuju analizu korelacije zastupljenosti tema
sa podacima poput ljudske percepcije njihove važnosti.
Cilj istraživanja bio je razvoj računalnih postupaka za eksplorativnu analizu i mjerenje me-
dijske agende temeljenih na tematskim modelima, klasi modela strojnog učenja pogodnih za
analizu tematske strukture teksta. Istraživanje obuhvaća razvoj postupaka primjene tematskih
modela na otkrivanje medijskih tema i mjerenje njihove zastupljenosti te razvoj računalnih alata
za unaprijed̄enje i provedbu tih postupaka. Ti alati obuhvaćaju metode vrednovanja tematskih
modela te programsku potporu za implementaciju postupaka analize agende i vrednovanja mo-
dela. Primjena postupaka na analizu medijskih tekstova brzo je pokazala potrebu za razvojem
novih metoda vrednovanja tematskih modela radi povećanja efikasnosti na modelima temelje-
nih postupaka. Iz tog je razloga poseban naglasak istraživanja bio na razvoju i analizi metoda
vrednovanja tematskih modela.
Prvo je provedeno istraživanje postupaka primjene tematskih modela na analizu medijske
agende. Na temelju istraživanja postojećih postupaka predložen je poboljšani postupak koji se
sastoji od tri koraka: koraka otkrivanja tema, koraka definicije tema i koraka mjerenja tema.
Predloženi postupak otklanja uočene nedostatke ranijih metoda: upotrebu samo jednog modela
za otkrivanje tema, nemogućnost prilagodbe i definicije novih tema te izostanak kvantitativnog
vrednovanja metoda mjerenja. Postupak je primijenjen u dvije analize medijske agende prove-
dene na zbirkama američkih i hrvatskih političkih vijesti. Na temelju opažanja i podataka iz tih
analiza uočena je potreba za mjerom interpretabilnosti tema modela te za metodom mjerenja
pokrivenosti skupa koncepata od strane modela.
Drugi istraženi problem bio je problem mjerenja interpretabilnosti tema modela. Standardni
pristup ovom problemu je mjerenje semantičke koherentosti tema, a postojeće mjere koherent-
nosti temelje se na računanju koherentosti skupa uz temu vezanih riječi. Ove mjere pokazale su
se nepogodnima u slučaju prolaznih medijskih tema karakteriziranih semantički nepovezanim
riječima. Predložena je nova klasa mjera koherentosti medijskih tema temeljenih na uz teme
vezanim dokumentima. Vrednovanje niza predloženih mjera na skupovima engleskih i hrvat-
skih medijskih tema otkrilo je najbolju mjeru koja računa koherentnost agregacijom lokalne povezanosti grafa dokumenata. Provedena je kvantitativna i kvalitativna usporedba razvijenih
mjera dokumentne koherentosti s postojećim mjerama koherentnosti riječi koja je otkrila kom-
plementarnost ova dva tipa mjera.
Treći istraženi problem je problem pokrivenosti tema, motiviran podacima iz primjene pos-
tupka analize medijske agende, koji su pokazali da jedan tematski model pokriva samo dio
svih otkrivenih koncepata. Problem pokrivenosti nadilazi domenu medijskih tekstova i unatoč
važnosti ovog problema dosadašnja istraživanja na tu temu su rudimentarana. Problem pokri-
venosti razmotren je u općenitosti i definiran kao problem mjerenja poklapanja izmed̄u skupa
automatski naučenih tema modela i skupa referentnih tema koji sadrži od ljudi uočene koncepte.
Predložena je metoda izrade skupa referentnih tema i dvije metode mjerenja pokrivenosti teme-
ljene na računanju poklapanja tema. Predložene mjere vrednovane su na dva raznorodna skupa
podataka, medijskom i biološkom, te primijenjene na analizu četiri različite klase standardnih
tematskih modela.
Završni korak istraživanja postupka analize medijske agende bio je poboljšanje postupka
na temelju predloženih metoda vrednovanja tematskih modela i iskustava iz primjena postupka
na analizu hrvatskih i američkih medija. Glavna poboljšanja odnose se na korak eksplorativne
analize odnosno otkrivanja tema i temelje se na razvijenim mjerama pokrivenosti i dokumentne
koherentosti tema. Ova poboljšanja imaju za cilj brže otkrivanje većeg broja koncepata. Ostala
poboljšanja odnose se na povećanje efikasnosti postupka interpretacije tema modela.
Tijekom istraživanja postupka analize medijske agende i metoda vrednovanja tematskih mo-
dela uočen je niz problema vezanih uz upotrebu, izgradnju, pohranu i dohvat tematskih modela
i vezanih resursa. Ovi problemi javljaju se kod implementacije grafičkog korisničkog sučelja
za provedbu postupka i kod provedbe eksperimenata vrednovanja. Rješavanju ovih problema
pristupilo se sustavno i oblikovan je radni okvir za izgradnju i upravljanje resursima u temat-
skom modeliranju. Arhitektura okvira temelji se na četiri načela koja u kombinaciji definiraju
općenitu i fleksibilnu metodu izrade programske potpore za primjenu i vrednovanje tematskih
modela. Razvijeni su i grafičko korisničko sučelje za eksplorativnu analizu i potporu mjere-
nju zastupljenosti tema te aplikacija namijenjena izradi zbirki medijskih tekstova koja tijekom
duljeg vremenskog razdoblja sakuplja tekstove iz niza web-izvora.This thesis focuses on computational methods for media agenda analysis based on topic mo-
dels and methods of topic model evaluation. The goal of a media agenda analysis is gaining
insights into the structure and frequency of media topics. Such analyses are of interest for
social scientists studying news media, journalists, media analysts, and other commercial and
political actors. Computational methods for media agenda analysis enable automatic discovery
of topics in large corpora of news text and measuring of topics’ frequency. Data obtained by
such analyses provides insights into the type and structure of topics occurring in the media,
enables the analysis of topic cooccurrence, and analysis of correlation between topics and other
variables such as text metadata and human perception of topic significance.
The goal of the research presented in the thesis is development of efficient computational
methods for the discovery of topics that constitute the media agenda and methods for measu-
ring frequencies of these topics. The proposed methods are based on topic models – a class
of unsupervised machine learning models widely used for exploratory analysis of topical text
structure. The research encompasses the development of applications of topic models for dis-
covery of media topics and for measuring topics’ frequency, as well as development of methods
for improvement and facilitation of these applications. The improvement and facilitation met-
hods encompass methods of topic model evaluation and software tools for working with topic
models. Methods of topic model evaluation can be used for selection of high-quality models
and for accelerating the process of topic discovery. Namely, topic models are a useful tool, but
due to the stohasticity of the model learning algorithms the quality of learned topics varies. For
this reason the methods of topic model evaluation have the potential to increase the efficiency
of the methods based on topic models.
Media agenda consists of a set of topics discussed in the media, and the problem of media
agenda analysis consists of two sub-tasks: discovery of the topics on the agenda and measu-
ring the frequencies of these topics. The first contribution of the thesis is a method for media
agenda analysis based on topic models that builds upon previous approaches to the problem
and addresses their deficiencies. Three notable deficiencies are: usage of a single topic model
for topic discovery, lack of possibility to define new topics that match the analyst’s interests,
and the lack of precise evaluation of methods for measuring topics’ frequency. In addition to
addressing the identified deficiencies, the method also systematizes the previous approaches
to the problem and is evaluated in two case studies of media agenda analysis. The proposed
experimental method for media agenda analysis consists of three steps: topic discovery, topic definition, and topic measuring steps.
In order to achieve better topic coverage, the discovery step is based not on a single model
but on a set of topic models. The type and number of topic models used depends on available
model implementations and the time available for topic annotation, while the hyperparameter
defining the number of model topics depends on the desired generality of learned topics. Reaso-
nable default settings for model construction are proposed based on the existing agenda analysis
studies and an iterative procedure for tuning the number of topics is described. After the topic
models are constructed, topic discovery is performed by human inspection and interpretation
of the topics. Topic interpretation produces semantic topics (concepts) that are recorded in a
reference table of semantic topics that serves as a record of topics and as a tool for synchroni-
zation of human annotators. After all the model topics are inspected, annotators can optionally
perform the error correcting step of revising the semantic topics, as well as the step of building
a taxonomy of semantic topics. Topic discovery is supported with a graphical user interface
developed for topic inspection and annotation.
The step of topic definition is based on semantic topics obtained during topic discovery. The
purpose of topic definition is to define new semantic topics that closely match the analyst’s exact
interests. The possibility of defining new semantic topics is an important difference between the
proposed and the existing media agenda analysis approaches. Namely, the existing approaches
base the analysis only on model-produced topics, although there is no guarantee that these topics
will match the concepts of interest to the analyst. During topic definition, the analysts infers
definitions of new semantic topics based on previously discovered topics and describes these
topics with word lists. Discovered semantic topics that already closely match the concepts of
interest are used without modification.
During the step of topic measuring the frequencies of semantic topics obtained during the
discovery and definition steps are measured. Topic frequency is defined as the number of news
articles in which a topic occurrs, and the measuring problem is cast as the problem of multi-label
classification in which each news article is being tagged with one or more semantic topics. This
formulation allows for precise quantitative evaluation of methods for measuring topic frequency.
Two measuring methods are considered. The baseline is a supervised method using the method
of binary relevance in combination with a linear kernel SVM model. The second method is a
newly proposed weakly supervised approach, in which the measured semantic topics are first
described by sets of highly discriminative words, after which a new LDA model is constructed
in such a way that the topics of the model correspond to measured topics, which is achieved via
prior probabilities of model topics. The method for selecting words highly discriminative for
a semantic topic represents the main difference between the proposed and the previous weakly
supervised approaches. This method consists of inspecting, for each measured semantic topic,
closely related model topics, and selecting words highly discriminative for the topic by means of inspecting word-related documents and assessing their correspondence with the topic.
The proposed three-step method for media agenda analysis is applied to two media agenda
analyses: the analysis of mainstream US political news and the analysis of mainstream Croatian
political news in the election period. The applications of the proposed method show that the
topic discovery step gives a good overview of the media agenda and leads to the discovery of
useful topics, and that the usage of more than one topic model leads to a more comprehensive
set of topics. The two analyses also demonstrate the necessity of the proposed topic definition
step – in the case of US news new sensible topics corresponding to issues are pinpointed during
this step, while in the case of Croatian election-related news the analysis is based entirely on
newly defined semantic topics that describe the pre- and post-election processes. Quantitative
evaluation of topic frequency measuring shows that the proposed weakly supervised approach
works better than the supervised SVM-based method since it achieves better or comparable
performance with less labeling effort. In contrast to the supervised method, weakly supervi-
sed models have a higher recall and work well for smaller topics. Qualitative evaluation of
measuring models confirms the quality of the proposed approach – measured topic frequency
correlates well with real-world events and the election-related conclusions based on measuring
models are in line with conclusions drawn from social-scientific studies.
Observations from two media agenda analysis studies and the analysis of collected topic data
underlined two problems related to methods of topic model evaluation. The first is the problem
of measuring topic quality – the studies both confirmed variations in topic quality and indicated
the inadequacy of existing word-based measures of topic coherence. The second is the problem
of topic coverage – while the data confirms the limited ability of a single topic model to cover
all the semantic topics, no available methods for measuring topic coverage exist, so it is not
possible to identify the high-coverage models. These observations motivated the development
of new methods of topic model evaluation – document-based coherence measures and methods
for topic coverage analysis.
As described, the analysis of topics produced during the applications of topic discovery
confirmed variations in topics’ quality and underlined the need for better measures of topic
quality. The analysis also indicated that existing word-based measures of topic coherence are
inadequate for evaluating quality of media topics often characterized by semantically unrelated
word sets. Based on the observation that media topics can be successfully interpreted using
topic-related documents, a new class of document-based topic coherence measures is proposed.
The proposed measures calculate topic coherence in three steps: selection of topic-related
documents, document vectorization, and computation of the coherence score from document
vectors. Topic-related documents are selected using a simple model-independent strategy – a
fixed number of documents with top document-topic weights is selected. Two families of docu-
ment vectorization methods are considered. The first family consists of two standard methods based on calculation of word and document frequencies: probabilistic bag-of-words vectoriza-
tion and tf-idf vectorization. Methods in the second family vectorize documents by aggregating
either CBOW or GloVe word embeddings. Three types of methods are considered for cohe-
rence score computation: distance-based methods that model coherence via mutual document
distance, probability-based methods that model coherence as probabilistic compactness of do-
cument vectors, and graph-based methods that model coherence via connectivity of the docu-
ment graph. The space of all the coherence measures is parametrized and sensible parameter
values are defined to obtain a smaller set of several thousand measures. Then the selection and
evaluation of the coherence measures is performed, using model topics manually labeled with
document-based coherence scores and using the area under the ROC curve (AUC) as the per-
formance criterion. The measures are partitioned in structural categories and the best measure
from each category is selected using AUC on the development set as a criterion. These best
measures are then evaluated on two test sets containing English and Croatian news topics.
The evaluation of document-based coherence measures shows that the graph-based measu-
res achieve best results. Namely, best approximators of human coherence scores are the graph-
based measures that use frequency-based document vectorization, build sparse graphs of locally
connected documents and calculate coherence by aggregating a local connectivity score such as
closeness centrality. Quantitative evaluation of word-based measures confirms the observations
that word-based measures fail to approximate document-based coherence scores well and qu-
alitative evaluation of coherence measures indicates that document- and word-based coherence
measures complement each other and should be used in combination to obtain a more complete
model of topic coherence.
Motivated by the data from the topic discovery steps performed in two media agenda analyses
and by the obvious need to increase the number of topics discovered by a single topic model, the
problem of topic coverage is defined and solutions are proposed. This problem occurrs in ap-
plication of topic models to any text domain, i.e., it is domain-independent and extends beyond
applications to media text. The problem of topic coverage consists of measuring how automati-
cally learned model topics cover a set of reference topics – topical concepts defined by humans.
Two basic aspects of the problem are the reference topics that represent the concepts topic mo-
dels are expected to cover and the measures of topic coverage that calculate a score measuring
overlap between the model topics and reference topics. Finally, the third aspect encompasses
evaluation of a set of topic models using a reference set and coverage measures.
The coverage experiments are conducted using two datasets that correspond to two separate
text domains – news media texts and biological texts. Each dataset contains a text corpus,
a set of reference topics, and a set of topic models. Reference topics consist of topics that
standard topic models are expected to be able to cover. These topics are constructed by human
inspection, selection, and modification of model-learned topics. Both sets of reference topics are representative of useful topics discovered during the process of exploratory text analysis.
Two approaches to measuring topic coverage are developed – an approach based on super-
vised approximation of topic matching and an unsupervised approach based on integrating co-
verage across a range of topic-matching criteria. The supervised approach is based on building
a classification model that approximates human intuition of topic matching. A binary classifier
is learned from a set of topic pairs annotated with matching scores. Four standard classification
models are considered: logistic regression, support vector machine, random forest, and multi-
layer perceptron. Topic pairs are represented as distances of topic-related word and document
vectors using four distinct distance measures: cosine, hellinger, L1 , and L2. Model selection
and evaluation shows that the proposed method approximates human scores very well, and that
logistic regression is the best-performing model. The second proposed method for measuring
coverage uses a measure of topic distance and a distance threshold to approximate the equality
of a reference topic and a model topic. The threshold value is varied and for each threshold
coverage is calculated as a proportion of reference topics that are matched by at least one model
topic at a distance below the threshold. Varying the threshold results in a curve with threshold
values on the x-axis and coverage scores on the y-axis. The final coverage score is calculated
as the area under this curve. This unsupervised measure of coverage, dubbed area under the
coverage-distance curve, correlates very well with the supervised measures of coverage, while
the curve itself is a useful tool for visual analysis of topic coverage. This measure enables the
users to quickly perform coverage measurements on new domains, without the need to annotate
topic pairs in order to construct a supervised coverage measure.
Using the proposed coverage measures and two sets of reference topics, coverage experi-
ments in two distinct text domains are performed. Experiments consist of measuring coverages
obtained by a set of topic models of distinct types constructed using different hyperparameters.
In addition to demonstrating application of coverage methods, the experiments show that the
NMF model has high coverage scores, is robust to domain change and able to discover topics
on a high level of precision. Nonparametric model based on Pitman-Yor priors achieves the best
coverage for news topics.
Two proposed methods of topic model evaluation – document-based coherence measures
and methods devised for solving the coverage problem – are applied in order to improve the
previously proposed topic-model-based method of media agenda analysis. The improvements
refer to the step of topic discovery and lead to quicker discovery of a larger number of concepts.
This is achieved by using more interpretable models with higher coverage, and by ordering mo-
del topics, before