The paper discusses, on the lexical level, the integration of heuristic solutions into a lexicon based and rule governed system for the automatic analysis of unrestricted Portuguese text. In particular, a morphology based analytic approach to lexical heuristics is presented and evaluated. The tagger involved uses a 50.000 entry base form lexicon as well as prefix-, suffix- and inflexion endings lexica to assign part of speech and other morphological tags to every wordform in the text, with recall rates between 99.6 % and 99.7%. Multiple readings are subsequently disambiguated by using grammatical rules formulated in the Constraint Grammar formalism. On the next level of analysis, tags for syntactical form and function alternatives are mapped onto the wordforms and disambiguated in a similar way. In spite of using a highly differentiated tag set, the parser yields 2 correctness rates- on running unrestricted and unknown text- of over 99 % for morphology/PoS and 97-98 % for syntax. After compilation, the system runs at about 200 words/sec on a 200 MHz Pentium based Linux system, when using all levels. Morphological and POS disambiguation alone approach 2000 words/sec. A test site with a variety of applications (parsing, corpus searches, interactive grammar teaching and- experimental- MT has been established a
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.