Search CORE

250 research outputs found

Cluster Based Term Weighting Model for Web Document Clustering

Author: Amit Singhal
Gerard Salton
Gerard Salton
JC David MacKay
MF Porter
PN Tan
Ronan Cummins
Sudipto Guha
Ying Zhao
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 04/03/2014
Field of study

The term weight is based on the frequency with which the term appears in that document. The term weighting scheme measures the importance of a term with respect to a document and a collection. A term with higher weight is more important than a term with lower weight. A document ranking model uses these term weights to find the rank of a document in a collection. We propose a cluster-based term weighting models based on the TF-IDF model. This term weighting model update the inter-cluster and intra-cluster frequency components uses the generated clusters as a reference in improving the retrieved relevant documents. These inter cluster and intra-cluster frequency components are used for weighting the importance of a term in addition to the term and document frequency components

Crossref

ePrints@Bangalore University

CLUSTER-BASED TERM WEIGHTING AND DOCUMENT RANKING MODELS

Author: Murugesan Keerthiram
Publication venue: UKnowledge
Publication date: 01/01/2011
Field of study

A term weighting scheme measures the importance of a term in a collection. A document ranking model uses these term weights to find the rank or score of a document in a collection. We present a series of cluster-based term weighting and document ranking models based on the TF-IDF and Okapi BM25 models. These term weighting and document ranking models update the inter-cluster and intra-cluster frequency components based on the generated clusters. These inter-cluster and intra-cluster frequency components are used for weighting the importance of a term in addition to the term and document frequency components. In this thesis, we will show how these models outperform the TF-IDF and Okapi BM25 models in document clustering and ranking

University of Kentucky

Pre Processing Techniques for Arabic Documents Clustering

Author: Alhanjouri Mohammed A.
Publication venue: 'Vandana Publications'
Publication date: 01/01/2017
Field of study

Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: term pruning, term weighting using (TF-IDF), morphological analysis techniques using (root-based stemming, light stemming, and raw text), and normalization. Experimental work examined the effect of clustering algorithms using a most widely used partitional algorithm, K-means, compared with other clustering partitional algorithm, Expectation Maximization (EM) algorithm. Comparison between the effect of both Euclidean Distance and Manhattan similarity measurement function was attempted in order to produce best results in document clustering. Results were investigated by measuring evaluation of clustered documents in many cases of preprocessing techniques. Experimental results show that evaluation of document clustering can be enhanced by implementing term weighting (TF-IDF) and term pruning with small value for minimum term frequency. In morphological analysis, light stemming, is found more appropriate than root-based stemming and raw text. Normalization, also improved clustering process of Arabic documents, and evaluation is enhanced

Institutional Repository of the Islamic University of Gaza

Exploring the feasability and accuracy of Latent Semantic Analysis based text mining techniques to detect similarity between patent documents and scientific publications.

Author: Magerman Tom
Song Xiaoyan
Van Looy Bart
Publication venue
Publication date
Field of study

Research Papers in Economics

Open Directory Project based universal taxonomy for Personalization of Online (Re)sources

Author: Amini
Baeza-Yates
Bird
Blei
Borges
Cowie
Fathy
He
Hu
Jurica Ševa
Kalinov
Lee
Markus Schatten
Miller
Pedregosa
Perkins
Perugini
Petra Grd
Porter
Rajalakshmi
Robertson
Salton
Salton
Salton
Salton
Salton
Van Rijsbergen
Vargiu
Yun
Zhu
Řehůřek
Publication venue: 'Elsevier BV'
Publication date: 01/10/2015
Field of study

Content personalization reflects the ability of content classification into (predefined) thematic units or information domains. Content nodes in a single thematic unit are related to a greater or lesser extent. An existing connection between two available content nodes assumes that the user will be interested in both resources (but not necessarily to the same extent). Such a connection (and its value) can be established through the process of automatic content classification and labeling. One approach for the classification of content nodes is the use of a predefined classification taxonomy. With the help of such classification taxonomy it is possible to automatically classify and label existing content nodes as well as create additional descriptors for future use in content personalization and recommendation systems. For these purposes existing web directories can be used in creating a universal, purely content based, classification taxonomy. This work analyzes Open Directory Project (ODP) web directory and proposes a novel use of its structure and content as the basis for such a classification taxonomy. The goal of a unified classification taxonomy is to allow for content personalization from heterogeneous sources. In this work we focus on the overall quality of ODP as the basis for such a classification taxonomy and the use of its hierarchical structure for automatic labeling. Due to the structure of data in ODP different grouping schemes are devised and tested to find the optimal content and structure combination for a proposed classification taxonomy as well as automatic labeling processes. The results provide an in-depth analysis of ODP and ODP based content classification and automatic labeling models. Although the use of ODP is well documented, this question has not been answered to date

Crossref

White Rose Research Online

Exploiting Ontology Recommendation Using Text Categorization Approach

Author: Ahmed Ghufran
Ahmed Mansoor
Ali Akhtar
Habib Asad
Hussain Shahid
Khalid Muhammad
Raza Mohsin
Sarwar Muhammad Azeem
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 18/02/2021
Field of study

Semantic Web is considered as the backbone of web 3.0 and ontologies are an integral part of the Semantic Web. Though an increase of ontologies in different domains is reported due to various benefits which include data heterogeneity, automated information analysis, and reusability, however, finding an appropriate ontology according to user requirement remains cumbersome task due to time and efforts required, context-awareness, and computational complexity. To overcome these issues, an ontology recommendation framework is proposed. The Proposed framework employs text categorization and unsupervised learning techniques. The benefits of the proposed framework are twofold: 1) ontology organization according to the opinion of domain experts and 2) ontology recommendation with respect to user requirement. Moreover, an evaluation model is also proposed to assess the effectiveness of the proposed framework in terms of ontologies organization and recommendation. The main consequences of the proposed framework are 1) ontologies of a corpus can be organized effectively, 2) no effort and time are required to select an appropriate ontology, 3) computational complexity is only limited to the use of unsupervised learning techniques, and 4) due to no requirement of context awareness, the proposed framework can be effective for any corpus or online libraries of ontologies

Northumbria University Research Portal