Search CORE

9 research outputs found

A text segmentation approach for automated annotation of online customer reviews, based on topic modeling

Author: Hananto Valentinus Roby
Kryssanov Victor
Serdült Uwe
Publication venue: 'MDPI AG'
Publication date: 27/03/2022
Field of study

Online customer review classification and analysis have been recognized as an important problem in many domains, such as business intelligence, marketing, and e-governance. To solve this problem, a variety of machine learning methods was developed in the past decade. Existing methods, however, either rely on human labeling or have high computing cost, or both. This makes them a poor fit to deal with dynamic and ever-growing collections of short but semantically noisy texts of customer reviews. In the present study, the problem of multi-topic online review clustering is addressed by generating high quality bronze-standard labeled sets for training efficient classifier models. A novel unsupervised algorithm is developed to break reviews into sequential semantically homogeneous segments. Segment data is then used to fine-tune a Latent Dirichlet Allocation (LDA) model obtained for the reviews, and to classify them along categories detected through topic modeling. After testing the segmentation algorithm on a benchmark text collection, it was successfully applied in a case study of tourism review classification. In all experiments conducted, the proposed approach produced results similar to or better than baseline methods. The paper critically discusses the main findings and paves ways for future work

ZORA

Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

Author: Alaiz Rodríguez Rocío
Alegre Gutiérrez Enrique
Fidalgo Fernández Eduardo
González Castro Víctor
Jáñez-Martino Francisco
Publication venue: 'Elsevier BV'
Publication date: 17/04/2023
Field of study

[EN] Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, Näive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in 2ms and 2.2ms on average, respectively.S

Leon University (Spain)

Large-scale fine-grained semantic indexing of biomedical literature based on weakly-supervised deep learning

Author: Chatzopoulos Thomas
Krithara Anastasia
Nentidis Anastasios
Paliouras Georgios
Tsoumakas Grigorios
Publication venue
Publication date: 23/01/2023
Field of study

Semantic indexing of biomedical literature is usually done at the level of MeSH descriptors, representing topics of interest for the biomedical community. Several related but distinct biomedical concepts are often grouped together in a single coarse-grained descriptor and are treated as a single topic for semantic indexing. This study proposes a new method for the automated refinement of subject annotations at the level of concepts, investigating deep learning approaches. Lacking labelled data for this task, our method relies on weak supervision based on concept occurrence in the abstract of an article. The proposed approach is evaluated on an extended large-scale retrospective scenario, taking advantage of concepts that eventually become MeSH descriptors, for which annotations become available in MEDLINE/PubMed. The results suggest that concept occurrence is a strong heuristic for automated subject annotation refinement and can be further enhanced when combined with dictionary-based heuristics. In addition, such heuristics can be useful as weak supervision for developing deep learning models that can achieve further improvement in some cases.Comment: 48 pages, 5 figures, 9 tables, 1 algorith

arXiv.org e-Print Archive

Multi-label dataless text classification with topic modeling

Author: C Li
Chenliang Li
Daochen Zha
DM Blei
G Tsoumakas
H Ishwaran
H Mahmoud
J Read
T Joachims
TL Griffiths
TN Rubin
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Enhancing extremist data classification through textual analysis

Author: Owoeye Kolade Olawande
Publication venue
Publication date
Field of study

The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy. Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model. However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation).The high volume of extremist materials on the Internet has created the need for intelligence gathering via the Web and real-time monitoring of potential websites for evidence of extremist activities. However, the manual classification for such contents is practically difficult and time-consuming. In response to this challenge, the work reported here developed several classification frameworks. Each framework provides a basis of text representation before being fed into machine learning algorithm. The basis of text representation are Sentiment-rule, Posit-textual analysis with word-level features, and an extension of Posit analysis, known as Extended-Posit, which adopts character-level as well as word-level data. Identifying some gaps in the aforementioned techniques created avenues for further improvements, most especially in handling larger datasets with better classification accuracy. Consequently, a novel basis of text representation known as the Composite-based method was developed. This is a computational framework that explores the combination of both sentiment and syntactic features of textual contents of a Web page. Subsequently, these techniques are applied on a dataset that had been subjected to a manual classification process, thereafter fed into machine learning algorithm. This is to generate a measure of how well each page can be classified into their appropriate classes. The classifiers considered are both Neural Network (RNN and MLP) and Machine Learning classifiers (such as J48, Random Forest and KNN). In addition, features selection and model optimisation were evaluated to know the cost when creating machine learning model. However, considering all the result obtained from each of the framework, the results indicated that composite features are preferable to solely syntactic or sentiment features which offer improved classification accuracy when used with machine learning algorithms. Furthermore, the extension of Posit analysis to include both word and character-level data out-performed word-level feature alone when applied on the assembled textual data. Moreover, Random Forest classifier outperformed other classifiers explored. Taking cost into account, feature selection improves classification accuracy and save time better than hyperparameter turning (model optimisation)

STAX (Strathclyde Repository)

Exploring a Modelling Method with Semantic Link Network and Resource Space Model

Author: Rafi Muhammad Adnan
Publication venue
Publication date
Field of study

To model the complex reality, it is necessary to develop a powerful semantic model. A rational approach is to integrate a relational view and a multi-dimensional view of reality. The Semantic Link Network (SLN) is a semantic model based on a relational view and the Resource Space Model (RSM) is a multi-dimensional view for managing, sharing and specifying versatile resources with a universal resource observation. The motivation of this research consists of four aspects: (1) verify the roles of Semantic Link Network and the Resource Space Model in effectively managing various types of resources, (2) demonstrate the advantages of the Resource Space Model and Semantic Link Network, (3) uncover the rules through applications, and (4) generalize a methodology for modelling complex reality and managing various resources. The main contribution of this work consists of the following aspects: 1. A new text summarization method is proposed by segmenting a document into clauses based on semantic discourse relations and ranking and extracting the informative clauses according to their relations and roles. The Resource Space Model benefits from using semantic link network, ranking techniques and language characteristics. Compared with other summarization approaches, the proposed approach based on semantic relations achieves a higher recall score. Three implications are obtained from this research. 2. An SLN-based model for recommending research collaboration is proposed by extracting a semantic link network of different types of semantic nodes and different types of semantic links from scientific publications. Experiments on three data sets of scientific publications show that the model achieves a good performance in predicting future collaborators. This research further unveils that different semantic links play different roles in representing texts. 3. A multi-dimensional method for managing software engineering processes is developed. Software engineering processes are mapped into multiple dimensions for supporting analysis, development and maintenance of software systems. It can be used to uniformly classify and manage software methods and models through multiple dimensions so that software systems can be developed with appropriate methods. Interfaces for visualizing Resource Space Model are developed to support the proposed method by keeping the consistency among interface, the structure of model and faceted navigation

Aston Publications Explorer

Računalni postupci za modeliranje i analizu medijske agende temeljeni na strojnome učenju

Author: Korenčić Dami
Publication venue
Publication date: 11/06/2019
Field of study

Rad se bavi računalnim postupcima analize medijske agende (engl. Media Agenda) temeljenima na tematskim modelima (engl. Topic Models) te metodama vrednovanja tematskih modela. Analiza medijske agende provodi se radi stjecanja uvida u strukturu i zastupljenost medijskih tema, što je od interesa za društvenoznanstvena istraživanja te za medijsku industriju i druge komercijalne i političke aktere. Računalni postupci analize medijske agende omogućuju auto- matsko otkrivanje tema u velikim skupovima tekstova i mjerenje njihove zastupljenosti. Ovi postupci pružaju analitičaru uvid u teme prisutne u medijima i uvid u zastupljenost tema u poje- dinim medijima i vremenskim razdobljima te omogućuju analizu korelacije zastupljenosti tema sa podacima poput ljudske percepcije njihove važnosti. Cilj istraživanja bio je razvoj računalnih postupaka za eksplorativnu analizu i mjerenje me- dijske agende temeljenih na tematskim modelima, klasi modela strojnog učenja pogodnih za analizu tematske strukture teksta. Istraživanje obuhvaća razvoj postupaka primjene tematskih modela na otkrivanje medijskih tema i mjerenje njihove zastupljenosti te razvoj računalnih alata za unaprijed̄enje i provedbu tih postupaka. Ti alati obuhvaćaju metode vrednovanja tematskih modela te programsku potporu za implementaciju postupaka analize agende i vrednovanja mo- dela. Primjena postupaka na analizu medijskih tekstova brzo je pokazala potrebu za razvojem novih metoda vrednovanja tematskih modela radi povećanja efikasnosti na modelima temelje- nih postupaka. Iz tog je razloga poseban naglasak istraživanja bio na razvoju i analizi metoda vrednovanja tematskih modela. Prvo je provedeno istraživanje postupaka primjene tematskih modela na analizu medijske agende. Na temelju istraživanja postojećih postupaka predložen je poboljšani postupak koji se sastoji od tri koraka: koraka otkrivanja tema, koraka definicije tema i koraka mjerenja tema. Predloženi postupak otklanja uočene nedostatke ranijih metoda: upotrebu samo jednog modela za otkrivanje tema, nemogućnost prilagodbe i definicije novih tema te izostanak kvantitativnog vrednovanja metoda mjerenja. Postupak je primijenjen u dvije analize medijske agende prove- dene na zbirkama američkih i hrvatskih političkih vijesti. Na temelju opažanja i podataka iz tih analiza uočena je potreba za mjerom interpretabilnosti tema modela te za metodom mjerenja pokrivenosti skupa koncepata od strane modela. Drugi istraženi problem bio je problem mjerenja interpretabilnosti tema modela. Standardni pristup ovom problemu je mjerenje semantičke koherentosti tema, a postojeće mjere koherent- nosti temelje se na računanju koherentosti skupa uz temu vezanih riječi. Ove mjere pokazale su se nepogodnima u slučaju prolaznih medijskih tema karakteriziranih semantički nepovezanim riječima. Predložena je nova klasa mjera koherentosti medijskih tema temeljenih na uz teme vezanim dokumentima. Vrednovanje niza predloženih mjera na skupovima engleskih i hrvat- skih medijskih tema otkrilo je najbolju mjeru koja računa koherentnost agregacijom lokalne povezanosti grafa dokumenata. Provedena je kvantitativna i kvalitativna usporedba razvijenih mjera dokumentne koherentosti s postojećim mjerama koherentnosti riječi koja je otkrila kom- plementarnost ova dva tipa mjera. Treći istraženi problem je problem pokrivenosti tema, motiviran podacima iz primjene pos- tupka analize medijske agende, koji su pokazali da jedan tematski model pokriva samo dio svih otkrivenih koncepata. Problem pokrivenosti nadilazi domenu medijskih tekstova i unatoč važnosti ovog problema dosadašnja istraživanja na tu temu su rudimentarana. Problem pokri- venosti razmotren je u općenitosti i definiran kao problem mjerenja poklapanja izmed̄u skupa automatski naučenih tema modela i skupa referentnih tema koji sadrži od ljudi uočene koncepte. Predložena je metoda izrade skupa referentnih tema i dvije metode mjerenja pokrivenosti teme- ljene na računanju poklapanja tema. Predložene mjere vrednovane su na dva raznorodna skupa podataka, medijskom i biološkom, te primijenjene na analizu četiri različite klase standardnih tematskih modela. Završni korak istraživanja postupka analize medijske agende bio je poboljšanje postupka na temelju predloženih metoda vrednovanja tematskih modela i iskustava iz primjena postupka na analizu hrvatskih i američkih medija. Glavna poboljšanja odnose se na korak eksplorativne analize odnosno otkrivanja tema i temelje se na razvijenim mjerama pokrivenosti i dokumentne koherentosti tema. Ova poboljšanja imaju za cilj brže otkrivanje većeg broja koncepata. Ostala poboljšanja odnose se na povećanje efikasnosti postupka interpretacije tema modela. Tijekom istraživanja postupka analize medijske agende i metoda vrednovanja tematskih mo- dela uočen je niz problema vezanih uz upotrebu, izgradnju, pohranu i dohvat tematskih modela i vezanih resursa. Ovi problemi javljaju se kod implementacije grafičkog korisničkog sučelja za provedbu postupka i kod provedbe eksperimenata vrednovanja. Rješavanju ovih problema pristupilo se sustavno i oblikovan je radni okvir za izgradnju i upravljanje resursima u temat- skom modeliranju. Arhitektura okvira temelji se na četiri načela koja u kombinaciji definiraju općenitu i fleksibilnu metodu izrade programske potpore za primjenu i vrednovanje tematskih modela. Razvijeni su i grafičko korisničko sučelje za eksplorativnu analizu i potporu mjere- nju zastupljenosti tema te aplikacija namijenjena izradi zbirki medijskih tekstova koja tijekom duljeg vremenskog razdoblja sakuplja tekstove iz niza web-izvora.This thesis focuses on computational methods for media agenda analysis based on topic mo- dels and methods of topic model evaluation. The goal of a media agenda analysis is gaining insights into the structure and frequency of media topics. Such analyses are of interest for social scientists studying news media, journalists, media analysts, and other commercial and political actors. Computational methods for media agenda analysis enable automatic discovery of topics in large corpora of news text and measuring of topics’ frequency. Data obtained by such analyses provides insights into the type and structure of topics occurring in the media, enables the analysis of topic cooccurrence, and analysis of correlation between topics and other variables such as text metadata and human perception of topic significance. The goal of the research presented in the thesis is development of efficient computational methods for the discovery of topics that constitute the media agenda and methods for measu- ring frequencies of these topics. The proposed methods are based on topic models – a class of unsupervised machine learning models widely used for exploratory analysis of topical text structure. The research encompasses the development of applications of topic models for dis- covery of media topics and for measuring topics’ frequency, as well as development of methods for improvement and facilitation of these applications. The improvement and facilitation met- hods encompass methods of topic model evaluation and software tools for working with topic models. Methods of topic model evaluation can be used for selection of high-quality models and for accelerating the process of topic discovery. Namely, topic models are a useful tool, but due to the stohasticity of the model learning algorithms the quality of learned topics varies. For this reason the methods of topic model evaluation have the potential to increase the efficiency of the methods based on topic models. Media agenda consists of a set of topics discussed in the media, and the problem of media agenda analysis consists of two sub-tasks: discovery of the topics on the agenda and measu- ring the frequencies of these topics. The first contribution of the thesis is a method for media agenda analysis based on topic models that builds upon previous approaches to the problem and addresses their deficiencies. Three notable deficiencies are: usage of a single topic model for topic discovery, lack of possibility to define new topics that match the analyst’s interests, and the lack of precise evaluation of methods for measuring topics’ frequency. In addition to addressing the identified deficiencies, the method also systematizes the previous approaches to the problem and is evaluated in two case studies of media agenda analysis. The proposed experimental method for media agenda analysis consists of three steps: topic discovery, topic definition, and topic measuring steps. In order to achieve better topic coverage, the discovery step is based not on a single model but on a set of topic models. The type and number of topic models used depends on available model implementations and the time available for topic annotation, while the hyperparameter defining the number of model topics depends on the desired generality of learned topics. Reaso- nable default settings for model construction are proposed based on the existing agenda analysis studies and an iterative procedure for tuning the number of topics is described. After the topic models are constructed, topic discovery is performed by human inspection and interpretation of the topics. Topic interpretation produces semantic topics (concepts) that are recorded in a reference table of semantic topics that serves as a record of topics and as a tool for synchroni- zation of human annotators. After all the model topics are inspected, annotators can optionally perform the error correcting step of revising the semantic topics, as well as the step of building a taxonomy of semantic topics. Topic discovery is supported with a graphical user interface developed for topic inspection and annotation. The step of topic definition is based on semantic topics obtained during topic discovery. The purpose of topic definition is to define new semantic topics that closely match the analyst’s exact interests. The possibility of defining new semantic topics is an important difference between the proposed and the existing media agenda analysis approaches. Namely, the existing approaches base the analysis only on model-produced topics, although there is no guarantee that these topics will match the concepts of interest to the analyst. During topic definition, the analysts infers definitions of new semantic topics based on previously discovered topics and describes these topics with word lists. Discovered semantic topics that already closely match the concepts of interest are used without modification. During the step of topic measuring the frequencies of semantic topics obtained during the discovery and definition steps are measured. Topic frequency is defined as the number of news articles in which a topic occurrs, and the measuring problem is cast as the problem of multi-label classification in which each news article is being tagged with one or more semantic topics. This formulation allows for precise quantitative evaluation of methods for measuring topic frequency. Two measuring methods are considered. The baseline is a supervised method using the method of binary relevance in combination with a linear kernel SVM model. The second method is a newly proposed weakly supervised approach, in which the measured semantic topics are first described by sets of highly discriminative words, after which a new LDA model is constructed in such a way that the topics of the model correspond to measured topics, which is achieved via prior probabilities of model topics. The method for selecting words highly discriminative for a semantic topic represents the main difference between the proposed and the previous weakly supervised approaches. This method consists of inspecting, for each measured semantic topic, closely related model topics, and selecting words highly discriminative for the topic by means of inspecting word-related documents and assessing their correspondence with the topic. The proposed three-step method for media agenda analysis is applied to two media agenda analyses: the analysis of mainstream US political news and the analysis of mainstream Croatian political news in the election period. The applications of the proposed method show that the topic discovery step gives a good overview of the media agenda and leads to the discovery of useful topics, and that the usage of more than one topic model leads to a more comprehensive set of topics. The two analyses also demonstrate the necessity of the proposed topic definition step – in the case of US news new sensible topics corresponding to issues are pinpointed during this step, while in the case of Croatian election-related news the analysis is based entirely on newly defined semantic topics that describe the pre- and post-election processes. Quantitative evaluation of topic frequency measuring shows that the proposed weakly supervised approach works better than the supervised SVM-based method since it achieves better or comparable performance with less labeling effort. In contrast to the supervised method, weakly supervi- sed models have a higher recall and work well for smaller topics. Qualitative evaluation of measuring models confirms the quality of the proposed approach – measured topic frequency correlates well with real-world events and the election-related conclusions based on measuring models are in line with conclusions drawn from social-scientific studies. Observations from two media agenda analysis studies and the analysis of collected topic data underlined two problems related to methods of topic model evaluation. The first is the problem of measuring topic quality – the studies both confirmed variations in topic quality and indicated the inadequacy of existing word-based measures of topic coherence. The second is the problem of topic coverage – while the data confirms the limited ability of a single topic model to cover all the semantic topics, no available methods for measuring topic coverage exist, so it is not possible to identify the high-coverage models. These observations motivated the development of new methods of topic model evaluation – document-based coherence measures and methods for topic coverage analysis. As described, the analysis of topics produced during the applications of topic discovery confirmed variations in topics’ quality and underlined the need for better measures of topic quality. The analysis also indicated that existing word-based measures of topic coherence are inadequate for evaluating quality of media topics often characterized by semantically unrelated word sets. Based on the observation that media topics can be successfully interpreted using topic-related documents, a new class of document-based topic coherence measures is proposed. The proposed measures calculate topic coherence in three steps: selection of topic-related documents, document vectorization, and computation of the coherence score from document vectors. Topic-related documents are selected using a simple model-independent strategy – a fixed number of documents with top document-topic weights is selected. Two families of docu- ment vectorization methods are considered. The first family consists of two standard methods based on calculation of word and document frequencies: probabilistic bag-of-words vectoriza- tion and tf-idf vectorization. Methods in the second family vectorize documents by aggregating either CBOW or GloVe word embeddings. Three types of methods are considered for cohe- rence score computation: distance-based methods that model coherence via mutual document distance, probability-based methods that model coherence as probabilistic compactness of do- cument vectors, and graph-based methods that model coherence via connectivity of the docu- ment graph. The space of all the coherence measures is parametrized and sensible parameter values are defined to obtain a smaller set of several thousand measures. Then the selection and evaluation of the coherence measures is performed, using model topics manually labeled with document-based coherence scores and using the area under the ROC curve (AUC) as the per- formance criterion. The measures are partitioned in structural categories and the best measure from each category is selected using AUC on the development set as a criterion. These best measures are then evaluated on two test sets containing English and Croatian news topics. The evaluation of document-based coherence measures shows that the graph-based measu- res achieve best results. Namely, best approximators of human coherence scores are the graph- based measures that use frequency-based document vectorization, build sparse graphs of locally connected documents and calculate coherence by aggregating a local connectivity score such as closeness centrality. Quantitative evaluation of word-based measures confirms the observations that word-based measures fail to approximate document-based coherence scores well and qu- alitative evaluation of coherence measures indicates that document- and word-based coherence measures complement each other and should be used in combination to obtain a more complete model of topic coherence. Motivated by the data from the topic discovery steps performed in two media agenda analyses and by the obvious need to increase the number of topics discovered by a single topic model, the problem of topic coverage is defined and solutions are proposed. This problem occurrs in ap- plication of topic models to any text domain, i.e., it is domain-independent and extends beyond applications to media text. The problem of topic coverage consists of measuring how automati- cally learned model topics cover a set of reference topics – topical concepts defined by humans. Two basic aspects of the problem are the reference topics that represent the concepts topic mo- dels are expected to cover and the measures of topic coverage that calculate a score measuring overlap between the model topics and reference topics. Finally, the third aspect encompasses evaluation of a set of topic models using a reference set and coverage measures. The coverage experiments are conducted using two datasets that correspond to two separate text domains – news media texts and biological texts. Each dataset contains a text corpus, a set of reference topics, and a set of topic models. Reference topics consist of topics that standard topic models are expected to be able to cover. These topics are constructed by human inspection, selection, and modification of model-learned topics. Both sets of reference topics are representative of useful topics discovered during the process of exploratory text analysis. Two approaches to measuring topic coverage are developed – an approach based on super- vised approximation of topic matching and an unsupervised approach based on integrating co- verage across a range of topic-matching criteria. The supervised approach is based on building a classification model that approximates human intuition of topic matching. A binary classifier is learned from a set of topic pairs annotated with matching scores. Four standard classification models are considered: logistic regression, support vector machine, random forest, and multi- layer perceptron. Topic pairs are represented as distances of topic-related word and document vectors using four distinct distance measures: cosine, hellinger, L1 , and L2. Model selection and evaluation shows that the proposed method approximates human scores very well, and that logistic regression is the best-performing model. The second proposed method for measuring coverage uses a measure of topic distance and a distance threshold to approximate the equality of a reference topic and a model topic. The threshold value is varied and for each threshold coverage is calculated as a proportion of reference topics that are matched by at least one model topic at a distance below the threshold. Varying the threshold results in a curve with threshold values on the x-axis and coverage scores on the y-axis. The final coverage score is calculated as the area under this curve. This unsupervised measure of coverage, dubbed area under the coverage-distance curve, correlates very well with the supervised measures of coverage, while the curve itself is a useful tool for visual analysis of topic coverage. This measure enables the users to quickly perform coverage measurements on new domains, without the need to annotate topic pairs in order to construct a supervised coverage measure. Using the proposed coverage measures and two sets of reference topics, coverage experi- ments in two distinct text domains are performed. Experiments consist of measuring coverages obtained by a set of topic models of distinct types constructed using different hyperparameters. In addition to demonstrating application of coverage methods, the experiments show that the NMF model has high coverage scores, is robust to domain change and able to discover topics on a high level of precision. Nonparametric model based on Pitman-Yor priors achieves the best coverage for news topics. Two proposed methods of topic model evaluation – document-based coherence measures and methods devised for solving the coverage problem – are applied in order to improve the previously proposed topic-model-based method of media agenda analysis. The improvements refer to the step of topic discovery and lead to quicker discovery of a larger number of concepts. This is achieved by using more interpretable models with higher coverage, and by ordering mo- del topics, before

Full-text Institutional Repository of the Ruđer Bošković Institute