11 research outputs found
Rancang Bangun Sistem Peringkasan Teks Multi-Dokumen
Seiring dengan bertumbuhnya jumlah dokumen digital yang sangat pesat, membuat pengguna membutuhakan suatu sistem yang dapat melakukan peringkasan teks. Pada penelitian ini diusulkan sebuah rancangan peringksan teks multi-dokumen berbasis pendekatan clustering dan pemilihan kalimat. Metode yang digunakan proses clustering kalimat adalah Latent Semantic Indexing (LSI) dan Similarity Based Histogram Clustering (SHC). Metode LSI dilakukan untuk menghitung tingkat kemiripan antarpasangan kalimat dan metode SHC digunakan untuk mengelompokkan kalimat-kalimat ke dalam cluster. Sedangkan metode yang digunakan dalam pemilihan kalimat adalah Sentences Information Density (SID). Metode tersebut merupakan metode pemilihan berbasis positional text graph. Kombinasi metode tersebut mampu menghasilkan sebuah peringkasan teks multi-dokumen yang mengandung coverage, diversity dan koherensi yang tinggi
Analisis Judul Majalah Kawanku Menggunakan Clustering K-Means Dengan Konsep Simulasi Big Data Pada Hadoop Multi Node Cluster
Abstrak
Saat ini pembaca e-magazine seperti majalah Kawanku semakin marak dan terus berkembang. Sehingga penggunaan data besar sangat dibutuhkan pada majalah Kawanku. Selain itu, dibutuhkan pengkategorian setiap bacaan ke dalam tujuh kategori judul pada majalah Kawanku. Sehingga dibutuhkan suatu pengolahan, pengelompokkan, dan pengkomunikasian antar data teks menggunakan text mining. Kombinasi text mining dengan Big Data dapat menjadi sebuah solusi yang menyediakan cara yang efisien dan reliabel untuk penyimpanan data dan infrastruktur yang efektif. Lalu pengkategorian teks dengan clustering K-Means dirasa cukup meskipun menggunakan data besar karena hasilnya memiliki keakuratan yang tinggi. Dari hasil pengujian yang dilakukan, disimpulkan bahwa perbedaan dari banyaknya data tidak mempengaruhi waktu eksekusi karena perbedaan jumlah data yang digunakan tidak terlalu besar.
Kata kunci: text mining, k-means, hadoop, big data, clustering, multi node cluster
Abstract
Nowadays e-magazine reader like Kawanku magazine are increasing more and more.. So the use of Big Data is needed in managing e-magazine data in server. In addition, it takes the categorization of each reading into 7 categories of Kawanku magazine. So it takes a processing, grouping, and communicating between the text data using text mining. The combination of text mining with Big Data can be a solution that provides an efficient and reliable way for data storage and effective infrastructure. Then the text categorization with K-Means clustering is enough although using Big Data as a result has a high accuracy. From the results of tests performed, it was concluded that the difference of the number of data does not affect the execution time due to differences in the amount of data used is not too big.
Keywords: text mining, k-means, hadoop, big data, clustering, multi node cluste
Analisis Judul Majalah Kawanku Menggunakan Clustering K-Means Dengan Konsep Simulasi Big Data Pada Hadoop Multi Node Cluster
AbstrakSaat ini pembaca e-magazine seperti majalah Kawanku semakin marak dan terus berkembang. Sehingga penggunaan data besar sangat dibutuhkan pada majalah Kawanku. Selain itu, dibutuhkan pengkategorian setiap bacaan ke dalam tujuh kategori judul pada majalah Kawanku. Sehingga dibutuhkan suatu pengolahan, pengelompokkan, dan pengkomunikasian antar data teks menggunakan text mining. Kombinasi text mining dengan Big Data dapat menjadi sebuah solusi yang menyediakan cara yang efisien dan reliabel untuk penyimpanan data dan infrastruktur yang efektif. Lalu pengkategorian teks dengan clustering K-Means dirasa cukup meskipun menggunakan data besar karena hasilnya memiliki keakuratan yang tinggi. Dari hasil pengujian yang dilakukan, disimpulkan bahwa perbedaan dari banyaknya data tidak mempengaruhi waktu eksekusi karena perbedaan jumlah data yang digunakan tidak terlalu besar.Kata kunci: text mining, k-means, hadoop, big data, clustering, multi node cluster AbstractNowadays e-magazine reader like Kawanku magazine are increasing more and more.. So the use of Big Data is needed in managing e-magazine data in server. In addition, it takes the categorization of each reading into 7 categories of Kawanku magazine. So it takes a processing, grouping, and communicating between the text data using text mining. The combination of text mining with Big Data can be a solution that provides an efficient and reliable way for data storage and effective infrastructure. Then the text categorization with K-Means clustering is enough although using Big Data as a result has a high accuracy. From the results of tests performed, it was concluded that the difference of the number of data does not affect the execution time due to differences in the amount of data used is not too big.Keywords: text mining, k-means, hadoop, big data, clustering, multi node cluste
Rancang Bangun Sistem Peringkasan Teks Multi-Dokumen
Seiring dengan bertumbuhnya jumlah dokumen digital yang sangat pesat, membuat pengguna membutuhakan suatu sistem yang dapat melakukan peringkasan teks. Pada penelitian ini diusulkan sebuah rancangan peringksan teks multi-dokumen berbasis pendekatan clustering dan pemilihan kalimat. Metode yang digunakan proses clustering kalimat adalah Latent Semantic Indexing (LSI) dan Similarity Based Histogram Clustering (SHC). Metode LSI dilakukan untuk menghitung tingkat kemiripan antarpasangan kalimat dan metode SHC digunakan untuk mengelompokkan kalimat-kalimat ke dalam cluster. Sedangkan metode yang digunakan dalam pemilihan kalimat adalah Sentences Information Density (SID). Metode tersebut merupakan metode pemilihan berbasis positional text graph. Kombinasi metode tersebut mampu menghasilkan sebuah peringkasan teks multi-dokumen yang mengandung coverage, diversity dan koherensi yang tinggi
PEMBOBOTAN KATA BERDASARKAN KLASTER PADA OPTIMISASI COVERAGE, DIVERSITY DAN COHERENCE UNTUK PERINGKASAN MULTI DOKUMEN
[Id]Peringkasan yang baik dapat diperoleh dengan coverage, diversity dan coherence yang optimal. Namun, terkadang sub-sub topik yang terkandug dalam dokumen tidak terekstrak dengan baik, sehingga keterwakilan setiap sub-sub topik tersebut tidak ada dalam hasil peringkasan dokumen. Pada paper ini diusulkan metode baru pembobotan kata berdasarkan klaster pada optimisasi coverage, diversity dan coherence untuk peringkasan multi-dokumen. Metode optimasi yang digunakan ialah self-adaptive differential evolution (SaDE) dengan penambahan pembobotan kata berdasarkan hasil dari pembentukan cluster dengan metode Similarity Based Histogram Clustering (SHC). Metode SHC digunakan untuk mengklaster kalimat sehingga setiap sub-topik pada dokumen bisa terwakili dalam hasil peringkasan. Metode SaDE digunakan untuk mencari solusi hasil ringkasan yang memiliki tingkat coverage, diversity, dan coherence paling tinggi. Uji coba dilakukan pada 15 topik dataset Text Analysis Conference (TAC) 2008. Hasil uji coba menunjukkan bahwa metode yang diusulkan dapat menghasilkan ringkasan skor ROUGE-1 sebesar 0.6704, ROUGE-2 sebesar 0.2051, ROUGE-L sebesar 0.6271 dan ROUGE-SU sebesar 0.3951.Kata kunci : peringkasan multi dokumen, similarity based histogram clustering, coverage, diversity, coherence[En]Good summary can be obtained with optimizing coverage, diversity, and coherence. Nevertheless, sometime sub-topics wich is contained in the document is not extracted well, so that the representation of each sub-topic is appear in docment summarizarion result. In this paper, we propose new of term weighting based on? cluster in optimizing coverage, diversity, and coherence for multi-document summarization. Optimization method which is used is self-adaptive differential evolution (SaDE) with additional term weighting based on clustering result with Similarity Based Histogram Clustering (SHC). SHC is used to cluster sentence so that every sub-topic in the document can be represented in summarization result. SaDE is used to search summarization result solution which has high coverage, diversity, and coherence level. Experiment is done on 15 topics in Text Analysis Conference (TAC) 2008 dataset. Experimental results show that this proposed method can produce summarization score? ROUGE-1 0.6704, ROUGE-2 0.2051, ROUGE-L 0.6271 and ROUGE-SU 0.3951.Keywords: multy-document summarization, similarity based histogram clustering, coverage, diversity, coherence
Web information search and sharing :
制度:新 ; 報告番号:甲2735号 ; 学位の種類:博士(人間科学) ; 授与年月日:2009/3/15 ; 早大学位記番号:新493
Cooperative Clustering Model and Its Applications
Data clustering plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. This thesis addresses some of these challenges through cooperation between multiple clustering approaches.
We introduce a Cooperative Clustering (CC) model that involves multiple clustering techniques; the goal of the cooperative model is to increase the homogeneity of objects within clusters through cooperation by developing two data structures, cooperative contingency graph and histogram representation of pair-wise similarities. The two data structures are designed to find the matching sub-clusters between different clusterings and to obtain the final set of cooperative clusters through a merging process. Obtaining the co-occurred objects from the different clusterings enables the cooperative model to group objects based on a multiple agreement between the invoked clustering techniques. In addition, merging this set of sub-clusters using histograms poses a new trend of grouping objects into more homogenous clusters. The cooperative model is consistent, reusable, and scalable in terms of the number of the adopted clustering approaches.
In order to deal with noisy data, a novel Cooperative Clustering Outliers Detection (CCOD) algorithm is implemented through the implication of the cooperation methodology for better detection of outliers in data. The new detection approach is designed in four phases, (1) Global non-cooperative Clustering, (2) Cooperative Clustering, (3) Possible outlier’s Detection, and finally (4) Candidate Outliers Detection. The detection of outliers is established in a bottom-up scenario.
The thesis also addresses cooperative clustering in distributed Peer-to-Peer (P2P) networks. Mining large and inherently distributed datasets poses many challenges, one of which is the extraction of a global model as a global summary of the clustering solutions generated from all nodes for the purpose of interpreting the clustering quality of the distributed dataset as if it was located at one node. We developed distributed cooperative model and architecture that work on a two-tier super-peer P2P network. The model is called Distributed Cooperative Clustering in Super-peer P2P Networks (DCCP2P). This model aims at producing one clustering solution across the whole network. It specifically addresses scalability of network size, and consequently the distributed clustering complexity, by modeling the distributed clustering problem as two layers of peer neighborhoods and super-peers. Summarization of the global distributed clusters is achieved through a distributed version of the cooperative clustering model.
Three clustering algorithms, k-means (KM), Bisecting k-means (BKM) and Partitioning Around Medoids (PAM) are invoked in the cooperative model. Results on various gene expression and text documents datasets with different properties, configurations and different degree of outliers reveal that: (i) the cooperative clustering model achieves significant improvement in the quality of the clustering solutions compared to that of the non-cooperative individual approaches; (ii) the cooperative detection algorithm discovers the nonconforming objects in data with better accuracy than the contemporary approaches, and (iii) the distributed cooperative model attains the same quality or even better as the centralized approach and achieves decent speedup by increasing number of nodes. The distributed model offers high degree of flexibility, scalability, and interpretability of large distributed repositories. Achieving the same results using current methodologies requires polling the data first to one center location, which is sometimes not feasible
An Update Algorithm for Restricted Random Walk Clusters
This book presents the dynamic extension of the Restricted Random Walk Cluster Algorithm by Schöll and Schöll-Paschinger. The dynamic variant allows to quickly integrate changes in the underlying object set or the similarity matrix into the clusters; the results are indistinguishable from the renewed execution of the original algorithm on the updated data set
Language-independent pre-processing of large document bases for text classification
Text classification is a well-known topic in the research of knowledge discovery in
databases. Algorithms for text classification generally involve two stages. The first
is concerned with identification of textual features (i.e. words andlor phrases) that
may be relevant to the classification process. The second is concerned with
classification rule mining and categorisation of "unseen" textual data. The first
stage is the subject of this thesis and often involves an analysis of text that is both
language-specific (and possibly domain-specific), and that may also be
computationally costly especially when dealing with large datasets. Existing
approaches to this stage are not, therefore, generally applicable to all languages. In
this thesis, we examine a number of alternative keyword selection methods and
phrase generation strategies, coupled with two potential significant word list
construction mechanisms and two final significant word selection mechanisms, to
identify such words andlor phrases in a given textual dataset that are expected to
serve to distinguish between classes, by simple, language-independent statistical
properties. We present experimental results, using common (large) textual datasets
presented in two distinct languages, to show that the proposed approaches can
produce good performance with respect to both classification accuracy and
processing efficiency. In other words, the study presented in this thesis
demonstrates the possibility of efficiently solving the traditional text classification
problem in a language-independent (also domain-independent) manner