16 research outputs found
Introduction to the special issue on annotated corpora
International audienceLes corpus annoteĢs sont toujours plus cruciaux, aussi bien pour la recherche scien- tifique en linguistique que le traitement automatique des langues. Ce numeĢro speĢcial passe brieĢvement en revue lāeĢvolution du domaine et souligne les deĢfis aĢ relever en restant dans le cadre actuel dāannotations utilisant des cateĢgories analytiques, ainsi que ceux remettant en question le cadre lui-meĢme. Il preĢsente trois articles, lāun concernant lāeĢvaluation de la qualiteĢ dāannotation, et deux concernant des corpus arboreĢs du francĢ§ais, lāun traitant du plus ancien projet de corpus arboreĢ du francĢ§ais, le French Treebank, le second concernant la conversion de corpus francĢ§ais dans le scheĢma interlingue des Universal Dependencies, offrant ainsi une illustration de lāhistoire du deĢveloppement des corpus arboreĢs.Annotated corpora are increasingly important for linguistic scholarship, science and technology. This special issue briefly surveys the development of the field and points to challenges within the current framework of annotation using analytical categories as well as challenges to the framework itself. It presents three articles, one concerning the evaluation of the quality of annotation, and two concerning French treebanks, one dealing with the oldest project for French, the French Treebank, the second concerning the conversion of French corpora into the cross-lingual framework of Universal Dependencies, thus offering an illustration of the history of treebank development worldwide
Durham - a word sense disambiguation system
Ever since the 1950's when Machine Translation first began to be developed, word sense disambiguation (WSD) has been considered a problem to developers. In more recent times, all NLP tasks which are sensitive to lexical semantics potentially benefit from WSD although to what extent is largely unknown. The thesis presents a novel approach to the task of WSD on a large scale. In particular a novel knowledge source is presented named contextual information. This knowledge source adopts a sub-symbolic training mechanism to learn information from the context of a sentence which is able to aid disambiguation. The system also takes advantage of frequency information and these two knowledge sources are combined. The system is trained and tested on SEMCOR. A novel disambiguation algorithm is also developed. The algorithm must tackle the problem of a large possible number of sense combinations in a sentence. The algorithm presented aims to make an appropriate choice between accuracy and efficiency. This is performed by directing the search at a word level. The performance achieved on SEMCOR is reported and an analysis of the various components of the system is performed. The results achieved on this test data are pleasing, but are difficult to compare with most of the other work carried out in the field. For this reason the system took part in the SENSEVAL evaluation which provided an excellent opportunity to extensively compare WSD systems. SENSEVAL is a small scale WSD evaluation using the HECTOR lexicon. Despite this, few adaptations to the system were required. The performance of the system on the SENSEVAL task are reported and have also been presented in [Hawkins, 2000]