31,461 research outputs found

    Klasterisasi Dokumen Berita berbahasa Indonesia menggunakan Information-Theoretic Co-Clustering News Document Clustering in Indonesia Language Using Information-Theoretic Co-Clustering

    Get PDF
    ABSTRAKSI: Perkembangan dokumen teks sangat cepat di internet, perpustakaan digital, dan artikel-artikel. Informasi di internet sangat bermanfaat bagi pengguna, khususnya artikel berita dalam bentuk dokumen teks. Begitu banyak artikel berita yang ada di internet sehingga sangat menyulitkan pengguna internet untuk mendapatkan artikel berita yang diinginkan. Untuk itu diperlukan kategorisasi artikel berita berdasarkan informasi yang terkandung di dalamnya. Sehingga artikel berita tersebut bisa di kategorikan pada topik tertentu.Klasterisasi dokumen/artikel merupakan salah satu metode yang dilakukan untuk menggali informasi yang terkandung dalam dokumen/artikel tersebut. Klasterisasi dilakukan untuk membuat klaster agar dokumen yang berhubungan atau dokumen yang informasinya mirip satu sama lain bisa berada dalam klaster yang sama. Sebuah dokumen mempunyai sifat dimensi tinggi dan volume data yang besar. Untuk itu diperlukan metode yang bisa menangani dimensi yang tinggi dan volume data yang besar.Pada umumnya algoritma klasterisasi hanya fokus pada klasterisasi satu arah, misalnya membuat klaster dokumen berdasarkan distribusi kata, atau membuat klaster kata/word berdasarkan distribusi dokumen. Oleh sebab itu, dikembangkan metode Co- Clustering yang membuat klaster secara simultan pada kedua dimensi tabel. Hal ini dilakukan untuk mengurangi dimensi secara efektif dan efisien.Kata Kunci : co-clustering, information theory, mutual informationABSTRACT: Text document are growth rapidly on internet, digital library, and articles. The information on the web especially news articles are useful for user. A lot of news articles are available on internet. So, it is hard for user to get the news articles that they wanted. Hence news articles categorization based information content is needed. So, it can be categorized based on the topic.Documents/articles clustering is one of ways to mine the information contents of documents/articles. Some related documents are located in the same cluster by clustering. A document can have high dimension and huge volume of data. For that reason, a technique is needed to handle it.Generally, clustering algorithms focus on one-way clustering, for example, clustering based on words distribution or based on document distribution. therefore, coclustering which makes cluster simultaneously on the both of table dimension is developed to reduce the dimension effectively and efficiently.Keyword: co-clustering, information theory, mutual informatio

    A PAC-Bayesian Analysis of Graph Clustering and Pairwise Clustering

    Full text link
    We formulate weighted graph clustering as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. This formulation enables practical and theoretical comparison of different approaches to graph clustering as well as comparison of graph clustering with other possible ways to model the graph. We adapt the PAC-Bayesian analysis of co-clustering (Seldin and Tishby, 2008; Seldin, 2009) to derive a PAC-Bayesian generalization bound for graph clustering. The bound shows that graph clustering should optimize a trade-off between empirical data fit and the mutual information that clusters preserve on the graph nodes. A similar trade-off derived from information-theoretic considerations was already shown to produce state-of-the-art results in practice (Slonim et al., 2005; Yom-Tov and Slonim, 2009). This paper supports the empirical evidence by providing a better theoretical foundation, suggesting formal generalization guarantees, and offering a more accurate way to deal with finite sample issues. We derive a bound minimization algorithm and show that it provides good results in real-life problems and that the derived PAC-Bayesian bound is reasonably tight

    A PAC-Bayesian Analysis of Co-clustering, Graph Clustering, and Pairwise Clustering

    Get PDF
    We review briefly the PAC-Bayesian analysis of co-clustering (Seldin and Tishby, 2008, 2009, 2010), which provided generalization guarantees and regularization terms absent in the preceding formulations of this problem and achieved state-of-the-art prediction results in MovieLens collaborative filtering task. Inspired by this analysis we formulate weighted graph clustering1 as a prediction problem: given a subset of edge weights we analyze the ability of graph clustering to predict the remaining edge weights. This formulation enables practical and theoretical comparison of different approaches to graph clustering as well as comparison of graph clustering with other possible ways to model the graph. Following the lines of (Seldin and Tishby, 2010) we derive PAC-Bayesian generalization bounds for graph clustering. The bounds show that graph clustering should optimize a trade-off between empirical data fit and the mutual information that clusters preserve on the graph nodes. A similar trade-off derived from information-theoretic considerations was already shown to produce state-of-the-art results in practice (Slonim et al., 2005; Yom-Tov and Slonim, 2009). This paper supports the empirical evidence by providing a better theoretical foundation, suggesting formal generalization guarantees, and offering a more accurate way to deal with finite sample issues

    Probabilistic Clustering Using Maximal Matrix Norm Couplings

    Full text link
    In this paper, we present a local information theoretic approach to explicitly learn probabilistic clustering of a discrete random variable. Our formulation yields a convex maximization problem for which it is NP-hard to find the global optimum. In order to algorithmically solve this optimization problem, we propose two relaxations that are solved via gradient ascent and alternating maximization. Experiments on the MSR Sentence Completion Challenge, MovieLens 100K, and Reuters21578 datasets demonstrate that our approach is competitive with existing techniques and worthy of further investigation.Comment: Presented at 56th Annual Allerton Conference on Communication, Control, and Computing, 201
    • …
    corecore