Automatic Learning of the Morphology of Medical Language using Information Compression

Johnson, Stephen B.; Mollah, Shamim Ara

Automatic Learning of the Morphology of Medical Language using Information Compression

Authors: Stephen B. Johnson
Shamim Ara Mollah
Publication date: 1 January 2003
Publisher: American Medical Informatics Association

Abstract

Conversion of free-text strings in a natural language to a standard representation (codes) is an important reoccurring problem in biomedical informatics. Determining the content of a string involves identifying its meaningful constituents (morphemes). One current method of identifying these constituents is to look them up in a preexisting table (lexicon). Manual construction of lexicons and grammars in complex domains such as biomedicine is extremely laborious. As an alternative to the lexico-grammatical approach, we introduce a segmentation algorithm that automatically learns lexical and structural preferences from corpora via information compression. The method is based on the Minimum Description Length (MDL) principle from classic information theory

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.658.5...

Last time updated on 29/10/2017