Search CORE

4 research outputs found

Comparing LDA with pLSI as a Dimensionality Reduction Method in Document Clustering

Author: A.P. Dempster
D. Blei
D. Blei
F.-C. Hsu
K. Barnard
K. Nigam
K. Rose
M. Fattori
T. Griffiths
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

In this paper, we compare latent Dirichlet allocation (LDA) with probabilistic latent semantic indexing (pLSI) as a dimensionality reduction method and investigate their effectiveness in document clustering by using real-world document sets. For clustering of documents, we use a method based on multinomial mixture, which is known as an efficient framework for text mining. Clustering results are evaluated by F-measure, i.e., harmonic mean of precision and recall. We use Japanese and Korean Web articles for evaluation and regard the category assigned to each Web article as the ground truth for the evaluation of clustering results. Our experiment shows that the dimensionality reduction via LDA and pLSI results in document clusters of almost the same quality as those obtained by using original feature vectors. Therefore, we can reduce the vector dimension without degrading cluster quality. Further, both LDA and pLSI are more effective than random projection, the baseline method in our experiment. However, our experiment provides no meaningful difference between LDA and pLSI. This result suggests that LDA does not replace pLSI at least for dimensionality reduction in document clustering.The original publication is available at www.springerlink.comLarge-scale Knowledge Resources: Construction and Application - Third International Conference on Large-scale Knowledge Resources, Lkr 2008, Tokyo, Japan, March 3-5, 2008, Proceeding

Crossref

Nagasaki University's Academic Output SITE: NAOSITE

Institutional Repositories DataBase (IRDB)

Fast and modular regularized topic modelling

Author: Denis Kochedykov
Konstantin Vorontsov
Lev Golitsyn
Murat Apishev
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2017
Field of study

Topic modelling is an area of text mining that has been actively developed in the last 15 years. A probabilistic topic model extracts a set of hidden topics from a collection of text documents. It defines each topic by a probability distribution over words and describes each document with a probability distribution over topics. In applications, there are often many requirements, such as, for example, problem-specific knowledge and additional data, to be taken into account. Therefore, it is natural for topic modelling to be considered a multiobjective optimization problem. However, historically, Bayesian learning became the most popular approach for topic modelling. In the Bayesian paradigm, all requirements are formalized in terms of a probabilistic generative process. This approach is not always convenient due to some limitations and technical difficulties. In this work, we develop a non-Bayesian multiobjective approach called the Additive Regularization of Topic Models (ARTM). It is based on regularized Maximum Likelihood Estimation (MLE), and we show that many of the well-known Bayesian topic models can be re-formulated in a much simpler way using the regularization point of view. We review some of the most important types of topic models: multimodal, multilingual, temporal, hierarchical, graph-based, and short-text. The ARTM framework enables easy combination of different types of models to create new models with the desired properties for applications. This modular “lego-style” technology for topic modelling is implemented in the open-source library BigARTM

Directory of Open Access Journals

Recommended from our members

High performance latent dirichlet allocation for text mining

Author: Liu Zelong
Publication venue: Brunel University School of Engineering and Design PhD Theses
Publication date: 01/01/2013
Field of study

This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Latent Dirichlet Allocation (LDA), a total probability generative model, is a three-tier Bayesian model. LDA computes the latent topic structure of the data and obtains the significant information of documents. However, traditional LDA has several limitations in practical applications. LDA cannot be directly used in classification because it is a non-supervised learning model. It needs to be embedded into appropriate classification algorithms. LDA is a generative model as it normally generates the latent topics in the categories where the target documents do not belong to, producing the deviation in computation and reducing the classification accuracy. The number of topics in LDA influences the learning process of model parameters greatly. Noise samples in the training data also affect the final text classification result. And, the quality of LDA based classifiers depends on the quality of the training samples to a great extent. Although parallel LDA algorithms are proposed to deal with huge amounts of data, balancing computing loads in a computer cluster poses another challenge. This thesis presents a text classification method which combines the LDA model and Support Vector Machine (SVM) classification algorithm for an improved accuracy in classification when reducing the dimension of datasets. Based on Density-Based Spatial Clustering of Applications with Noise (DBSCAN), the algorithm automatically optimizes the number of topics to be selected which reduces the number of iterations in computation. Furthermore, this thesis presents a noise data reduction scheme to process noise data. When the noise ratio is large in the training data set, the noise reduction scheme can always produce a high level of accuracy in classification. Finally, the thesis parallelizes LDA using the MapReduce model which is the de facto computing standard in supporting data intensive applications. A genetic algorithm based load balancing algorithm is designed to balance the workloads among computers in a heterogeneous MapReduce cluster where the computers have a variety of computing resources in terms of CPU speed, memory space and hard disk space

Brunel University Research Archive

Is operational research in UK universities fit-for-purpose for the growing field of analytics?

Author: Michael J. Mortenson (5247896)
Publication venue
Publication date: 01/01/2018
Field of study

Over the last decade considerable interest has been generated into the use of analytical methods in organisations. Along with this, many have reported a significant gap between organisational demand for analytical-trained staff, and the number of potential recruits qualified for such roles. This interest is of high relevance to the operational research discipline, both in terms of raising the profile of the field, as well as in the teaching and training of graduates to fill these roles. However, what is less clear, is the extent to which operational research teaching in universities, or indeed teaching on the various courses labelled as analytics , are offering a curriculum that can prepare graduates for these roles. It is within this space that this research is positioned, specifically seeking to analyse the suitability of current provisions, limited to master s education in UK universities, and to make recommendations on how curricula may be developed. To do so, a mixed methods research design, in the pragmatic tradition, is presented. This includes a variety of research instruments. Firstly, a computational literature review is presented on analytics, assessing (amongst other things) the amount of research into analytics from a range of disciplines. Secondly, a historical analysis is performed of the literature regarding elements that can be seen as the pre-cursor of analytics, such as management information systems, decision support systems and business intelligence. Thirdly, an analysis of job adverts is included, utilising an online topic model and correlations analyses. Fourthly, online materials from UK universities concerning relevant degrees are analysed using a bagged support vector classifier and a bespoke module analysis algorithm. Finally, interviews with both potential employers of graduates, and also academics involved in analytics courses, are presented. The results of these separate analyses are synthesised and contrasted. The outcome of this is an assessment of the current state of the market, some reflections on the role operational research make have, and a framework for the development of analytics curricula. The principal contribution of this work is practical; providing tangible recommendations on curricula design and development, as well as to the operational research community in general in respect to how it may react to the growth of analytics. Additional contributions are made in respect to methodology, with a novel, mixed-method approach employed, and to theory, with insights as to the nature of how trends develop in both the jobs market and in academia. It is hoped that the insights here, may be of value to course designers seeking to react to similar trends in a wide range of disciplines and fields

Loughborough University Institutional Repository