Search CORE

74 research outputs found

Text mining with the WEBSOM

Author: Lagus Krista
Publication venue: Teknillinen korkeakoulu
Publication date: 11/12/2000
Field of study

The emerging field of text mining applies methods from data mining and exploratory data analysis to analyzing text collections and to conveying information to the user in an intuitive manner. Visual, map-like displays provide a powerful and fast medium for portraying information about large collections of text. Relationships between text items and collections, such as similarity, clusters, gaps and outliers can be communicated naturally using spatial relationships, shading, and colors. In the WEBSOM method the self-organizing map (SOM) algorithm is used to automatically organize very large and high-dimensional collections of text documents onto two-dimensional map displays. The map forms a document landscape where similar documents appear close to each other at points of the regular map grid. The landscape can be labeled with automatically identified descriptive words that convey properties of each area and also act as landmarks during exploration. With the help of an HTML-based interactive tool the ordered landscape can be used in browsing the document collection and in performing searches on the map. An organized map offers an overview of an unknown document collection helping the user in familiarizing herself with the domain. Map displays that are already familiar can be used as visual frames of reference for conveying properties of unknown text items. Static, thematically arranged document landscapes provide meaningful backgrounds for dynamic visualizations of for example time-related properties of the data. Search results can be visualized in the context of related documents. Experiments on document collections of various sizes, text types, and languages show that the WEBSOM method is scalable and generally applicable. Preliminary results in a text retrieval experiment indicate that even when the additional value provided by the visualization is disregarded the document maps perform at least comparably with more conventional retrieval methods.reviewe

Aaltodoc Publication Archive

Dimensionality Reduction of very large document collections by Semantic Mapping

Author: Corrêa Renato Fernandes
Ludermir Teresa Bernarda
Publication venue: Technische Fakultät, Arbeitsgruppen der Informatik
Publication date: 31/12/2007
Field of study

This paper describes improving in Semantic Mapping, a feature extraction method useful to dimensionality reduction of vectors representing documents of large text collections. This method may be viewed as a specialization of the Random Mapping, method proposed in WEBSOM project. Semantic Mapping, Random Mapping and Principal Component Analysis (PCA) are applied to categorization of document collections using Self-Organizing Maps (SOM). Semantic Mapping generated document representation as good as PCA and much better than Random Mapping

BieColl - Bielefeld Electronic Collections

BieColl - Bielefeld eCollections

Towards improving WEBSOM with multi-word expressions

Author: Alves Stefan Eduard Raposo
Publication venue: Faculdade de Ciências e Tecnologia
Publication date: 01/01/2013
Field of study

Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaLarge quantities of free-text documents are usually rich in information and covers several topics. However, since their dimension is very large, searching and filtering data is an exhaustive task. A large text collection covers a set of topics where each topic is affiliated to a group of documents. This thesis presents a method for building a document map about the core contents covered in the collection. WEBSOM is an approach that combines document encoding methods and Self-Organising Maps (SOM) to generate a document map. However, this methodology has a weakness in the document encoding method because it uses single words to characterise documents. Single words tend to be ambiguous and semantically vague, so some documents can be incorrectly related. This thesis proposes a new document encoding method to improve the WEBSOM approach by using multi word expressions (MWEs) to describe documents. Previous research and ongoing experiments encourage us to use MWEs to characterise documents because these are semantically more accurate than single words and more descriptive

Repositório da Universidade Nova de Lisboa

Word Sense Disambiguation with THESSOM

Author: Linden Krister
Publication venue
Publication date: 01/09/2003
Field of study

Helsingin yliopiston digitaalinen arkisto

How to improve robustness in Kohonen maps and display additional information in Factorial Analysis: application to text mining

Author: Bourgeois Nicolas
Cottrell Marie
Déruelle Benjamin
Lamassé Stéphane
Letrémy Patrick
Publication venue: 'Elsevier BV'
Publication date: 23/07/2014
Field of study

This article is an extended version of a paper presented in the WSOM'2012 conference [1]. We display a combination of factorial projections, SOM algorithm and graph techniques applied to a text mining problem. The corpus contains 8 medieval manuscripts which were used to teach arithmetic techniques to merchants. Among the techniques for Data Analysis, those used for Lexicometry (such as Factorial Analysis) highlight the discrepancies between manuscripts. The reason for this is that they focus on the deviation from the independence between words and manuscripts. Still, we also want to discover and characterize the common vocabulary among the whole corpus. Using the properties of stochastic Kohonen maps, which define neighborhood between inputs in a non-deterministic way, we highlight the words which seem to play a special role in the vocabulary. We call them fickle and use them to improve both Kohonen map robustness and significance of FCA visualization. Finally we use graph algorithmic to exploit this fickleness for classification of words

arXiv.org e-Print Archive

HAL-Paris1

Evaluation of Linguistic Features for Word Sense Disambiguation with Self-Organized Document Maps

Author: Linden Krister
Publication venue
Publication date: 01/11/2004
Field of study

Word sense disambiguation automatically determines the appropriate senses of a word in context. We have previously shown that self-organized document maps have properties similar to a large-scale semantic structure that is useful for word sense disambiguation. This work evaluates the impact of different linguistic features on self-organized document maps for word sense disambiguation. The features evaluated are various qualitative features, e.g. part-of-speech and syntactic labels, and quantitative features, e.g. cut-off levels for word frequency. It is shown that linguistic features help make contextual information explicit. If the training corpus is large even contextually weak features, such as base forms, will act in concert to produce sense distinctions in a statistically significant way. However, the most important features are syntactic dependency relations and base forms annotated with part of speech or syntactic labels. We achieve 62.9%±0.73% correct results on the fine grained lexical task of the English SENSEVAL-2 data. On the 96.7% of the test cases which need no back-off to the most frequent sense we achieve 65.7% correct results.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Combining SOMs and Ontologies for Effective Web Site Mining

Author: Constantin Halatsis
Dimitris Petrilis
Publication venue: 'IntechOpen'
Publication date: 21/01/2011
Field of study

IntechOpen

Self-Organized Ordering of Terms and Documents in NSF Awards Data

Author: Honkela Timo
Klami Mikaela
Publication venue: Technische Fakultät, Arbeitsgruppen der Informatik
Publication date: 31/12/2007
Field of study

We present the results of an analysis of a text corpus of 129,000 abstracts of NSF-sponsored basic research projects between years 1990 and 2003. The methods used in the analysis include term extraction based on a reference corpus and an entropy measure, and the Self-Organizing Map algorithm for the formation of a term map and a document map. Methodologically, the basic approach is based on earlier developments, such as word category maps and the WEBSOM method, but in the level of details, we report several new aspects and quantitative comparison results between methodological variants in this article. The data covers a quite large proportion of US-based scientific research during recent years. The analysis results indicate the basic patterns discernable in the data, both at the level of the awards and at the terminology used in them

BieColl - Bielefeld Electronic Collections

BieColl - Bielefeld eCollections

Self-Organizing Word Map for Context-Based Document Classification

Author: Tambouratzis George
Tsimboukakis Nikolaos
Publication venue: Technische Fakultät, Arbeitsgruppen der Informatik
Publication date: 31/12/2007
Field of study

In this paper, a novel SOM-based system for document organization is presented. The purpose of the system is the classification of a document collection in terms of document content. The system possesses a two-level hybrid connectionist architecture that comprises (i) an automatically created word map using a SOM, which functions as a feature extraction module and (ii) a supervised MLP-based classifier, which provides the final classification result. The experiments, which have been performed on Modern Greek text documents, indicate that the proposed system separates effectively the different types of text

BieColl - Bielefeld Electronic Collections

BieColl - Bielefeld eCollections

Incremental document map formation: multi-stage approach

Author: A. Kłopotek Mieczysław
Ciesielski Krzysztof
Czerski Dariusz
Dramiński Michał
T. Wierzchoń Sławomir
Publication venue: 'Uniwersytetu Marii Curie-Sklodowskiej w Lublinie'
Publication date: 01/01/2006
Field of study

The paper presents methodology for the incremental map formation in a multi-stage process of a search engine with the map based user interface1. The architecture of the experimental system allows for comparative evaluation of different constituent technologies for various stages of the process. The quality of the map generation process has been investigated based on a number of clustering and classification measures. Some conclusions concerning the impact of various technological solutions on map quality are presented

Biblioteka Nauki - repozytorium artykuÅÃ³w

University of Maria Curie-Skłodowska (UMCS): Scientific e-Journals / Uniwersytet Marii Curie-Skłodowskiej: e-czasopisma naukowe