Search CORE

4 research outputs found

An Innovative Aim for Collecting and Retrieving Documents from Web Domain Using SSARC (Spontaneous Sorting and Retrieving Clock) Algorithm

Author: Dr A Vijaya
V Annapoorani
Publication venue
Publication date: 11/04/2020
Field of study

ABSTRACT: This paper presents an algorithm for generating and grouping documents from the web. In current years, due to the immense accessible of large document collections and the need to effective operate on them (for instance: navigate, analyze, query and summarize), there has been an increased emphasis on developing efficient and effective clustering algorithms for large document collections. In our novel algorithm collects all the documents from the web then it sorts the documents in an alphabetical order and stores the documents in clockwise structure algorithm which can easily retrieve the documents related to the user's query. This novel algorithm called as SSARC Algorithm, it is the expansion of "Spontaneous Sorting and Retrieving Clock" algorithm. We propose the overall architecture and depict two innovative algorithms which produce notable improvement over traditional clustering algorithms and form the basis for the query scrutinization and exploration of this algorithm

CiteSeerX

Topic Modeling for Segment-based Documents

Author: Andrea Tagarelli
George Karypis
Giovanni Ponti
Publication venue
Publication date: 03/04/2020
Field of study

Abstract. Statistical topic models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents that are relatively long and show an explicit multi-topic structure. In this paper we describe a generative model that exploits a given decomposition of documents in smaller, topically cohesive text units, or segments. The key-idea is to introduce a new variable in the generative process to model the document segments in order to relate the word generation not only to the topics but also to the segments. Moreover, the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown the significance of the proposed model and its better support for the document clustering task compared to other existing generative models

CiteSeerX

A Statistical Model for Topic Segmentation and Clustering

Author: D. Beeferman
D.M. Blei
L. Pevzner
M. Steyvers
M.A. Hearst
M.M. Shafiei
W.L. Buntine
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Crossref