20,731 research outputs found

    Web Document Clustering Using Document Index Graph

    Get PDF
    Document Clustering is an important tool for many Information Retrieval (IR) tasks. The huge increase in amount of information present on web poses new challenges in clustering regarding to underlying data model and nature of clustering algorithm. Document clustering techniques mostly rely on single term analysis of document data set. To achieve more accurate document clustering, more informative feature such as phrases are important in this scenario. Hence first part of the paper presents phrase-based model, Document Index Graph (DIG), which allows incremental phrase-based encoding of documents and efficient phrase matching. It emphasizes on effectiveness of phrase-based similarity measure over traditional single term based similarities. In the second part, a Document Index Graph based Clustering (DIGBC) algorithm is proposed to enhance the DIG model for incremental and soft clustering. This algorithm incrementally clusters documents based on proposed clusterdocument similarity measure. It allows assignment of a document to more than one cluster. The DIGBC algorithm is more efficient as compared to existing clustering algorithms such as single pass, K-NN and Hierarchical Agglomerative Clustering (HAC) algorithm

    Klusterisasi Dokumen Berita Berbahasa Indonesia menggunakan Document Index Graph

    Get PDF
    ABSTRAKSI: Berita elektronik merupakan media informasi yang paling populer dan interaktif saat ini. Begitu interaktifnya, hingga perkembangannya cukup pesat. Terbukti bertambah banyaknya situs perusahaan maupun situs personal, yang berarti semakin meningkatkan jumlah informasi dan data. Peningkatan yang pesat ini juga dipacu oleh penggunaan internet yang semakin berkembang dibandingkan era sebelumnya. Sebagai akibatnya, jumlah informasi meningkat secara eksponensial.Banyaknya data yang ada, semestinya dapat memberikan manfaat yang banyak pula. Clustering merupakan salah satu metode untuk pengelompokan dokumen dengan menemukan keterkaitan antar dokumen. Saat ini, kebanyakan metode klusterisasi hanya mengandalkan perhitungan kesamaan berdasarkan kata dan tidak memperhatikan aspek lain, misalnya kesamaan frasa, misalnya Vector Space Model. Pada tugas akhir ini berusaha mengklusterkan dokumen dengan metode Document Index Graph yang menggunakan kombinasi dua kesamaan dokumen yaitu; kesamaan berbasis kata dan kesamaan berbasis frasa.Metode ini diuji coba dengan menggunakan sampel berita berbahasa Indonesia dari media massa berbasis web. Pemilihan similarity blend factor dan similarity threshold yang tepat akan meningkatkan kualitas kluster. Hasil klusterisasi dievaluasi berdasarkan nilai precision dan recall.Kata Kunci : clustering, Document, Index, Graph, similarity.ABSTRACT: Nowadays, Electonic news are most popular and interactive information media. Because of that, it grows fast. It is proved by the increasing of company sites and personal sites which is mean that more information and data. This fast increasing is paced by the increasing in using internet recently than before. As the effect, the volume of informations getting bigger and increase exponentially.The number of data should be serve many advantages. Clustering is one of method to group documents by founding the document relations. Nowadays, the clustering methods mostly rely on term based similarity but ignore another aspect , such as Vector Space Model. On this final project, documents are grouped by Document Index Graph which combine term based similarity and phrase based similarity.This method is tested by using Indonesian news article from world wide web. An exact selected similarity blend factor and similarity threshold will improve cluster’s quality. The result of clustering will evaluate according precision and recall value.Keyword: clustering, Document, Index , Graph, similarity

    Abstractive Multi-Document Summarization via Phrase Selection and Merging

    Full text link
    We propose an abstraction-based multi-document summarization framework that can construct new sentences by exploring more fine-grained syntactic units than sentences, namely, noun/verb phrases. Different from existing abstraction-based approaches, our method first constructs a pool of concepts and facts represented by phrases from the input documents. Then new sentences are generated by selecting and merging informative phrases to maximize the salience of phrases and meanwhile satisfy the sentence construction constraints. We employ integer linear optimization for conducting phrase selection and merging simultaneously in order to achieve the global optimal solution for a summary. Experimental results on the benchmark data set TAC 2011 show that our framework outperforms the state-of-the-art models under automated pyramid evaluation metric, and achieves reasonably well results on manual linguistic quality evaluation.Comment: 11 pages, 1 figure, accepted as a full paper at ACL 201

    EXPLOITING N-GRAM IMPORTANCE AND ADDITIONAL KNOWEDGE BASED ON WIKIPEDIA FOR IMPROVEMENTS IN GAAC BASED DOCUMENT CLUSTERING

    Get PDF
    This paper provides a solution to the issue: “How can we use Wikipedia based concepts in document\ud clustering with lesser human involvement, accompanied by effective improvements in result?” In the\ud devised system, we propose a method to exploit the importance of N-grams in a document and use\ud Wikipedia based additional knowledge for GAAC based document clustering. The importance of N-grams\ud in a document depends on several features including, but not limited to: frequency, position of their\ud occurrence in a sentence and the position of the sentence in which they occur, in the document. First, we\ud introduce a new similarity measure, which takes the weighted N-gram importance into account, in the\ud calculation of similarity measure while performing document clustering. As a result, the chances of topical similarity in clustering are improved. Second, we use Wikipedia as an additional knowledge base both, to remove noisy entries from the extracted N-grams and to reduce the information gap between N-grams that are conceptually-related, which do not have a match owing to differences in writing scheme or strategies. Our experimental results on the publicly available text dataset clearly show that our devised system has a significant improvement in performance over bag-of-words based state-of-the-art systems in this area

    Using noun phrases extraction for the improvement of hybrid clustering with text- and citation-based components. The example of “Information Systems Research”

    Get PDF
    The hybrid clustering approach combining lexical and link-based similarities suffered for a long time from the different properties of the underlying networks. We propose a method based on noun phrase extraction using natural language processing to improve the measurement of the lexical component. Term shingles of different length are created form each of the extracted noun phrases. Hybrid networks are built based on weighted combination of the two types of similarities with seven different weights. We conclude that removing all single term shingles provides the best results at the level of computational feasibility, comparability with bibliographic coupling and also in a community detection application
    • …
    corecore