31 research outputs found

    Document Clustering using Self-Organizing Maps

    Get PDF
    Cluster analysis of textual documents is a common technique for better ltering, navigation, under-standing and comprehension of the large document collection. Document clustering is an autonomous methodthat separate out large heterogeneous document collection into smaller more homogeneous sub-collections calledclusters. Self-organizing maps (SOM) is a type of arti cial neural network (ANN) that can be used to performautonomous self-organization of high dimension feature space into low-dimensional projections called maps. Itis considered a good method to perform clustering as both requires unsupervised processing. In this paper, weproposed a SOM using multi-layer, multi-feature to cluster documents. The paper implements a SOM usingfour layers containing lexical terms, phrases and sequences in bottom layers respectively and combining all atthe top layers. The documents are processed to extract these features to feed the SOM. The internal weightsand interconnections between these layers features(neurons) automatically settle through iterations with a smalllearning rate to discover the actual clusters. We have performed extensive set of experiments on standard textmining datasets like: NEWS20, Reuters and WebKB with evaluation measures F-Measure and Purity. Theevaluation gives encouraging results and outperforms some of the existing approaches. We conclude that SOMwith multi-features (lexical terms, phrases and sequences) and multi-layers can be very e ective in producinghigh quality clusters on large document collections

    Clustering Documents with Maximal Substrings

    Get PDF
    This paper provides experimental results showing that we can use maximal substrings as elementary building blocks of documents in place of the words extracted by a current state-of-the-art supervised word extraction. Maximal substrings are defined as the substrings each giving a smaller number of occurrences even by appending only one character to its head or tail. The main feature of maximal substrings is that they can be extracted quite efficiently in an unsupervised manner. We extract maximal substrings from a document set and represent each document as a bag of maximal substrings. We also obtain a bag of words representation by using a state-of-the-art supervised word extraction over the same document set. We then apply the same document clustering method to both representations and obtain two clustering results for a comparison of their quality. We adopt a Bayesian document clustering based on Dirichlet compound multinomials for avoiding overfitting. Our experiment shows that the clustering quality achieved with maximal substrings is acceptable enough to use them in place of the words extracted by a supervised word extraction

    Text mining techniques for patent analysis.

    Get PDF
    Abstract Patent documents contain important research results. However, they are lengthy and rich in technical terminology such that it takes a lot of human efforts for analyses. Automatic tools for assisting patent engineers or decision makers in patent analysis are in great demand. This paper describes a series of text mining techniques that conforms to the analytical process used by patent analysts. These techniques include text segmentation, summary extraction, feature selection, term association, cluster generation, topic identification, and information mapping. The issues of efficiency and effectiveness are considered in the design of these techniques. Some important features of the proposed methodology include a rigorous approach to verify the usefulness of segment extracts as the document surrogates, a corpus-and dictionary-free algorithm for keyphrase extraction, an efficient co-word analysis method that can be applied to large volume of patents, and an automatic procedure to create generic cluster titles for ease of result interpretation. Evaluation of these techniques was conducted. The results confirm that the machine-generated summaries do preserve more important content words than some other sections for classification. To demonstrate the feasibility, the proposed methodology was applied to a realworld patent set for domain analysis and mapping, which shows that our approach is more effective than existing classification systems. The attempt in this paper to automate the whole process not only helps create final patent maps for topic analyses, but also facilitates or improves other patent analysis tasks such as patent classification, organization, knowledge sharing, and prior art searches

    Augmenting Translation Lexica by Learning Generalised Translation Patterns

    Get PDF
    Bilingual Lexicons do improve quality: of parallel corpora alignment, of newly extracted translation pairs, of Machine Translation, of cross language information retrieval, among other applications. In this regard, the first problem addressed in this thesis pertains to the classification of automatically extracted translations from parallel corpora-collections of sentence pairs that are translations of each other. The second problem is concerned with machine learning of bilingual morphology with applications in the solution of first problem and in the generation of Out-Of-Vocabulary translations. With respect to the problem of translation classification, two separate classifiers for handling multi-word and word-to-word translations are trained, using previously extracted and manually classified translation pairs as correct or incorrect. Several insights are useful for distinguishing the adequate multi-word candidates from those that are inadequate such as, lack or presence of parallelism, spurious terms at translation ends such as determiners, co-ordinated conjunctions, properties such as orthographic similarity between translations, the occurrence and co-occurrence frequency of the translation pairs. Morphological coverage reflecting stem and suffix agreements are explored as key features in classifying word-to-word translations. Given that the evaluation of extracted translation equivalents depends heavily on the human evaluator, incorporation of an automated filter for appropriate and inappropriate translation pairs prior to human evaluation contributes to tremendously reduce this work, thereby saving the time involved and progressively improving alignment and extraction quality. It can also be applied to filtering of translation tables used for training machine translation engines, and to detect bad translation choices made by translation engines, thus enabling significative productivity enhancements in the post-edition process of machine made translations. An important attribute of the translation lexicon is the coverage it provides. Learning suffixes and suffixation operations from the lexicon or corpus of a language is an extensively researched task to tackle out-of-vocabulary terms. However, beyond mere words or word forms are the translations and their variants, a powerful source of information for automatic structural analysis, which is explored from the perspective of improving word-to-word translation coverage and constitutes the second part of this thesis. In this context, as a phase prior to the suggestion of out-of-vocabulary bilingual lexicon entries, an approach to automatically induce segmentation and learn bilingual morph-like units by identifying and pairing word stems and suffixes is proposed, using the bilingual corpus of translations automatically extracted from aligned parallel corpora, manually validated or automatically classified. Minimally supervised technique is proposed to enable bilingual morphology learning for language pairs whose bilingual lexicons are highly defective in what concerns word-to-word translations representing inflection diversity. Apart from the above mentioned applications in the classification of machine extracted translations and in the generation of Out-Of-Vocabulary translations, learned bilingual morph-units may also have a great impact on the establishment of correspondences of sub-word constituents in the cases of word-to-multi-word and multi-word-to-multi-word translations and in compression, full text indexing and retrieval applications

    On Two Web IR Boosting Tools: Clustering and Ranking

    Get PDF
    This thesis investigates several research problems which arise in modern Web Information Retrieval (WebIR). The Holy Grail of modern WebIR is to find a way to organize and to rank results so that the most ``relevant' come first. The first break-through technique was the exploitation of the link structure of the Web graph in order to rank the result pages, using the well-known Hits and Pagerank algorithms. This link-analysis approaches have been improved and extended, but yet they seem to be insufficient in providing a satisfying search experience. In a number of situations a flat list of search results is not enough, and the users might desire to have search results grouped on-the-fly in folders of similar topics. In addition, the folders should be annotated with meaningful labels for rapid identification of the desired group of results. In other situations, users may have different search goals even when they express them with the same query. In this case the search results should be personalized according to the users' on-line activities. In order to address this need, we will discuss the algorithmic ideas behind SnakeT, a hierarchical clustering meta-search engine which personalizes searches according to the clusters selected by users on-the-fly. There are also situations where users might desire to access fresh information. In these cases, traditional link analysis could not be suitable. In fact, it is possible that there is not enough time to have many links pointing to a recently produced piece of information. In order to address this need, we will discuss the algorithmic and numerical ideas behind a new ranking algorithm suitable for ranking fresh type of information, such as news articles or blogs. When link analysis suffices to produce good quality search results, the huge amount of Web information asks for fast ranking methodologies. We will discuss numerical methodologies for accelerating the eingenvector-like computation, commonly used by link analysis. An important result of this thesis is that we show how to address the above predominant issues of Web Information Retrieval by using clustering and ranking methodologies. We will demonstrate that both clustering and ranking have a mutual reinforcement propriety which has not yet been studied intensively. This propriety can be exploited to boost the precision of both the two methodologies

    A SOM-based document clustering using frequent max substrings for non-segmented texts

    Get PDF
    This paper proposes a non-segmented document clustering method using self-organizing map (SOM) and frequent max substring technique to improve the efficiency of information retrieval. SOM has been widely used for document clustering and is successful in many applications. However, when applying to non-segmented document, the challenge is to identify any interesting pattern efficiently. There are two main phases in the propose method: preprocessing phase and clustering phase. In the preprocessing phase, the frequent max substring technique is first applied to discover the patterns of interest called Frequent Max substrings that are long and frequent substrings, rather than individual words from the non-segmented texts. These discovered patterns are then used as indexing terms. The indexing terms together with their number of occurrences form a document vector. In the clustering phase, SOM is used to generate the document cluster map by using the feature vector of Frequent Max substrings. To demonstrate the proposed technique, experimental studies and comparison results on clustering the Thai text documents, which consist of non-segmented texts, are presented in this paper. The results show that the proposed technique can be used for Thai texts. The document cluster map generated with the method can be used to find the relevant documents more efficiently

    Pemetaan Sebaran Mutu Pendidikan Dasar Menggunakan Metode Self-Organizing Maps

    Get PDF
    Salah satu program percepatan pembangunan oleh pemerintah adalah melakukan pemerataan dalam peningkatan mutu pendidikan disemua wilayah NKRI. Salah satu tahapan dalam melakukan program tersebut adalah dengan melakukan pemetaan mutu pendidikan melalui sekolah. Pemetaan mutu pendidikan melalui sekolah diharapkan bisa memberi gambaran kondisi dilapangan mutu pendidikan yang sebenarnya kepada penyelenggara pendidikan. Dengan adanya pemetaan mutu pendidikan diharapkan bisa menghasilkan evaluasi, kebijakan, dan rekomendasi, serta program perencanaan yang berguna untuk peningkatan mutu pendidikan berikutnya. Saat ini pemetaan masih menggunakan cara konvensional. Sehingga diperlukan metode yang dapat mengolah data untuk melakukan pemetaan secara cepat, efektif dan efisien. Pada penelitian ini mencoba menggunakan metode clustering Self-Organizing Maps (SOM) untuk melakukan pengelompokan dan pemetaan dengan mengolah data nilai mutu sekolah berdasarkan enam Standar Nasional Pendidikan. Data penilaian yang digunakan adalah nilai standar kompetensi lulusan, nilai standar isi, nilai standar proses, nilai standar penilaian, nilai standar pendidik dan tenaga kependidikan, dan nilai standar pengelolaan. Proses pemetaan diawali dengan penormalan data, kemudian data tersebut dijadikan sebagai input pada metode yang digunakan. Hasil dari penelitian ini menunjukkan bahwa secara rata-rata mutu pendidikan sekolah dasar ada di kategori sangat diharapkan Standar Nasional Pendidikan karena dari 6 pengelompokan 5 pengelompokan unggul di kategori mutu yang sangat diharapkan Standar Nasional Pendidikan. Sedangkan mutu pendidikan kategori mutu memenui Standar Nasional Pendidikan ada di parameter Standar Kompetensi Lulusan. Dari hasil pengujian analisa clustering dengan menggunakan validitas Davies-Bouldin Index (DBI) diperoleh informasi bahwa clustering pada pengujian 6 variabel Standar Nasional Pendidikan sudah bagus. ================================================================================================================= One of the accelerated development program by the government is doing equalization in improving the quality of education in all the Homeland. One of the stages in the conduct of the program is to do with the quality of education through school mapping. Mapping the quality of education through the school is expected to give a picture of the actual field of education quality to education providers. With the mapping of the quality of education is expected to produce an evaluation, policies, and recommendations as well as useful for planning programs to improve the quality of education next. Currently still using conventional mapping. So, we need a method that can process data for mapping quickly, effectively and efficiently. In this study tried to use the clustering method Self-Organizing Maps (SOM) to perform grouping and mapping the data processing school quality score based on six National Education Standards. Assessment data used is the value of competency standards, the value of content standards, a standard process value, the default value assessment, the default value of educators and education personnel, and value management standards. The mapping process begins with normalization of the data, then the data is used as input to the method used. The results of this study showed that the average quality of primary school education in the category is expected because of the National Standards 6 5 grouping grouping excel in the category of quality that is expected of National Education Standards. While the quality of education quality category memenui National Education Standards in parameter Competency Standards. From the test results of clustering analysis using the Davies-Bouldin validity Index (DBI) obtained information that clustering on 6 test variable National Education Standards are good

    Automatic Video-based Analysis of Human Motion

    Get PDF
    corecore