Search CORE

24,166 research outputs found

Incorporating semantic and syntactic information into document representation for document clustering

Author: Wang Yong
Publication venue: Scholars Junction
Publication date: 18/07/2005
Field of study

Document clustering is a widely used strategy for information retrieval and text data mining. In traditional document clustering systems, documents are represented as a bag of independent words. In this project, we propose to enrich the representation of a document by incorporating semantic information and syntactic information. Semantic analysis and syntactic analysis are performed on the raw text to identify this information. A detailed survey of current research in natural language processing, syntactic analysis, and semantic analysis is provided. Our experimental results demonstrate that incorporating semantic information and syntactic information can improve the performance of our document clustering system for most of our data sets. A statistically significant improvement can be achieved when we combine both syntactic and semantic information. Our experimental results using compound words show that using only compound words does not improve the clustering performance for our data sets. When the compound words are combined with original single words, the combined feature set gets slightly better performance for most data sets. But this improvement is not statistically significant. In order to select the best clustering algorithm for our document clustering system, a comparison of several widely used clustering algorithms is performed. Although the bisecting K-means method has advantages when working with large datasets, a traditional hierarchical clustering algorithm still achieves the best performance for our small datasets

Mississippi State University Libraries ETD database

Scholars Junction - Mississippi State University Institutional Repository

Real-time unsupervised classification of web documents

Author: Constant Mathieu
Sigogne Anthony
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/10/2009
Field of study

International audienceThis paper adresses the problem of clustering dynamic collections of web documents. We show an iterative algorithm based on a fine-grained keyword extraction (simple, compound words and proper nouns). Each new document inserted in the collection is either assigned to an existing class containing documents of the same topic, or assigned to a new class. After each step, when necessary, classes are refined using statistical techniques. The implementation of this algorithm was successfully integrated in an application used for Information Intelligence

Crossref

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Clustering Documents with Maximal Substrings

Author: D. Blei
D. Zhang
K. Nigam
M. Abouelhoda
T. Chumwatana
T. Kasai
Y. Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

This paper provides experimental results showing that we can use maximal substrings as elementary building blocks of documents in place of the words extracted by a current state-of-the-art supervised word extraction. Maximal substrings are defined as the substrings each giving a smaller number of occurrences even by appending only one character to its head or tail. The main feature of maximal substrings is that they can be extracted quite efficiently in an unsupervised manner. We extract maximal substrings from a document set and represent each document as a bag of maximal substrings. We also obtain a bag of words representation by using a state-of-the-art supervised word extraction over the same document set. We then apply the same document clustering method to both representations and obtain two clustering results for a comparison of their quality. We adopt a Bayesian document clustering based on Dirichlet compound multinomials for avoiding overfitting. Our experiment shows that the clustering quality achieved with maximal substrings is acceptable enough to use them in place of the words extracted by a supervised word extraction

Crossref

Nagasaki University's Academic Output SITE: NAOSITE

Institutional Repositories DataBase (IRDB)

Nagasaki university's Academic Output SITE

Matching Queries to Frequently Asked Questions: Search Functionality for the MRSA Web-Portal

Author: Akker Rieks op den
Tigelaar Almer S.
Verhoeven Fenne
Publication venue: Centre for Telematics and Information Technology, University of Twente
Publication date: 01/01/2009
Field of study

As part of the long-term EUREGIO MRSA-net project a system was developed which enables health care workers and the general public to quickly find answers to their questions regarding the MRSA pathogen. This paper focuses on how these questions can be answered using Information Retrieval (IR) and Natural Language Processing (NLP) techniques on a Frequently-Asked-Questions-style (FAQ) database

University of Twente Research Information