Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. 



Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. 



Conclusions: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept

Chen, David

Muller, Hans-Michael

Sternberg, Paul W.

English

PubMed

Abstract Background Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature. Results We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept. Conclusion We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.</p

Sternberg Paul W

Müller Hans-Michael

Chen David

Directory of Open Access Journals

BMC Bioinformatics

Automatic document classification of biological literature

David Chen

Hans-Michael Müller

Paul W Sternberg

Springer - Publisher Connector

Caltech Authors - Main

(editor): Mining information for function genomics.

A: The anatomy of a hierarchical clustering engine for Webpage, news and book snippets.

An algorithm for suffix stripping. Program

Aone C: Fast and effective text mining using linear-time document clustering.

Automated extraction or information in molecular biology.

Deriving concept hierarchies from text.

Download site for automatic classification software [http://www.textpresso.org/clustering-software]

Efficient Phrase-based Document Indexing for Web Document Clustering.

Frequent term-based text clustering.

Getting to the (c)ore of knowledge: Mining biomedical literature.

Less is more: Active Learning with support vector machines.

LIBSVM: a library for support vector machines [http://www.csie.ntu.edu.tw/~cjlin/libsvm]

Literature mining for the biologist: from information retrieval to biological discovery. Nature Reviews Genetics,

Online interface to clustering engine of Yahoo search snippets [http://www.textpresso.org/webcluster]

PW: Textpresso: An ontology-based information retrieval and extraction system for biological literature. PLoS Biol

Reuters-21578 text categorization test collection distribution 1.0 [http://www.daviddlewis.com/resources/testcollections/reuters21578/]

Sebastiani F: An analysis of the relative hardness of Reuters-21578 subsets.

Text Categorization with Suport Vector Machines: Learning with Many Relevant Features.

Textpresso: An information retrieval and extraction system for biological literature [http://www.textpresso.org/]

Tough Mining.

Transductive Inference for Text Classification using Support Vector Machines.

Vivisimo: Clustering – automatic categorization and meta-search software [http://www.vivisimo.com/]

Weng RC: Probability estimates for multi-class classification by pairwise coupling.

WormBook: Online Review of C.

http://authors.library.caltech.edu/4376/1/CHEbmcbioinf06.pdf

Automatic document classification of biological literature

Abstract

Similar works

Full text

Available Versions

Directory of Open Access Journals

Springer - Publisher Connector

Caltech Authors - Main