250 research outputs found

    Cluster Based Term Weighting Model for Web Document Clustering

    Get PDF
    The term weight is based on the frequency with which the term appears in that document. The term weighting scheme measures the importance of a term with respect to a document and a collection. A term with higher weight is more important than a term with lower weight. A document ranking model uses these term weights to find the rank of a document in a collection. We propose a cluster-based term weighting models based on the TF-IDF model. This term weighting model update the inter-cluster and intra-cluster frequency components uses the generated clusters as a reference in improving the retrieved relevant documents. These inter cluster and intra-cluster frequency components are used for weighting the importance of a term in addition to the term and document frequency components

    CLUSTER-BASED TERM WEIGHTING AND DOCUMENT RANKING MODELS

    Get PDF
    A term weighting scheme measures the importance of a term in a collection. A document ranking model uses these term weights to find the rank or score of a document in a collection. We present a series of cluster-based term weighting and document ranking models based on the TF-IDF and Okapi BM25 models. These term weighting and document ranking models update the inter-cluster and intra-cluster frequency components based on the generated clusters. These inter-cluster and intra-cluster frequency components are used for weighting the importance of a term in addition to the term and document frequency components. In this thesis, we will show how these models outperform the TF-IDF and Okapi BM25 models in document clustering and ranking

    Pre Processing Techniques for Arabic Documents Clustering

    Get PDF
    Clustering of text documents is an important technique for documents retrieval. It aims to organize documents into meaningful groups or clusters. Preprocessing text plays a main role in enhancing clustering process of Arabic documents. This research examines and compares text preprocessing techniques in Arabic document clustering. It also studies effectiveness of text preprocessing techniques: term pruning, term weighting using (TF-IDF), morphological analysis techniques using (root-based stemming, light stemming, and raw text), and normalization. Experimental work examined the effect of clustering algorithms using a most widely used partitional algorithm, K-means, compared with other clustering partitional algorithm, Expectation Maximization (EM) algorithm. Comparison between the effect of both Euclidean Distance and Manhattan similarity measurement function was attempted in order to produce best results in document clustering. Results were investigated by measuring evaluation of clustered documents in many cases of preprocessing techniques. Experimental results show that evaluation of document clustering can be enhanced by implementing term weighting (TF-IDF) and term pruning with small value for minimum term frequency. In morphological analysis, light stemming, is found more appropriate than root-based stemming and raw text. Normalization, also improved clustering process of Arabic documents, and evaluation is enhanced

    Open Directory Project based universal taxonomy for Personalization of Online (Re)sources

    Get PDF
    Content personalization reflects the ability of content classification into (predefined) thematic units or information domains. Content nodes in a single thematic unit are related to a greater or lesser extent. An existing connection between two available content nodes assumes that the user will be interested in both resources (but not necessarily to the same extent). Such a connection (and its value) can be established through the process of automatic content classification and labeling. One approach for the classification of content nodes is the use of a predefined classification taxonomy. With the help of such classification taxonomy it is possible to automatically classify and label existing content nodes as well as create additional descriptors for future use in content personalization and recommendation systems. For these purposes existing web directories can be used in creating a universal, purely content based, classification taxonomy. This work analyzes Open Directory Project (ODP) web directory and proposes a novel use of its structure and content as the basis for such a classification taxonomy. The goal of a unified classification taxonomy is to allow for content personalization from heterogeneous sources. In this work we focus on the overall quality of ODP as the basis for such a classification taxonomy and the use of its hierarchical structure for automatic labeling. Due to the structure of data in ODP different grouping schemes are devised and tested to find the optimal content and structure combination for a proposed classification taxonomy as well as automatic labeling processes. The results provide an in-depth analysis of ODP and ODP based content classification and automatic labeling models. Although the use of ODP is well documented, this question has not been answered to date

    Exploiting Ontology Recommendation Using Text Categorization Approach

    Get PDF
    Semantic Web is considered as the backbone of web 3.0 and ontologies are an integral part of the Semantic Web. Though an increase of ontologies in different domains is reported due to various benefits which include data heterogeneity, automated information analysis, and reusability, however, finding an appropriate ontology according to user requirement remains cumbersome task due to time and efforts required, context-awareness, and computational complexity. To overcome these issues, an ontology recommendation framework is proposed. The Proposed framework employs text categorization and unsupervised learning techniques. The benefits of the proposed framework are twofold: 1) ontology organization according to the opinion of domain experts and 2) ontology recommendation with respect to user requirement. Moreover, an evaluation model is also proposed to assess the effectiveness of the proposed framework in terms of ontologies organization and recommendation. The main consequences of the proposed framework are 1) ontologies of a corpus can be organized effectively, 2) no effort and time are required to select an appropriate ontology, 3) computational complexity is only limited to the use of unsupervised learning techniques, and 4) due to no requirement of context awareness, the proposed framework can be effective for any corpus or online libraries of ontologies
    • …
    corecore