3 research outputs found

    Learn-filter-apply-forget. Mixed approaches to named entity recognition

    Full text link
    We have explored and implemented different approaches to named entity recognition in German, a difficult task in this language since both regular nouns and proper names are capitalized. Our goal is to identify and recognise per- son names, geographical names and company names in a computer magazine corpus. Our geographical name classifier works with precompiled lists but our company name classifier learns the names from the corpus. For the recognition of person names we work with a precompiled list of first names and the program learns the last names. For this classifier we suggest setting an activation value for the last name and subsequently depriming the value until “forgetting” the name. Our evaluation results show that our mixed approaches are as good as the recall and precision values reported for English. It is shown that a carefully tuned cascade of name classifiers can even distinguish between different interpretations of a name token within the same document

    Corpus-adaptive Named Entity Recognition

    Get PDF
    Named Entity Recognition (NER) is an important step towards the automatic analysis of natural language and is needed for a series of natural language applications. The task of NER requires the recognition and classification of proper names and other unique identifiers according to a predefined category system, e.g. the “traditional” categories PERSON, ORGANIZATION (companies, associations) and LOCATION. While most of the previous work deals with the recognition of these traditional categories within English newspaper texts, the approach presented in this thesis is beyond that scope. The approach is particularly motivated by NER which is more challenging than the classical task, such as German, or the identification of biomedical entities within scientific texts. Additionally, the approach addresses the ease-of-development and maintainability of NER-services by emphasizing the need for “corpus-adaptive” systems, with “corpus-adaptivity” describing whether a system can be easily adapted to new tasks and to new text corpora. In order to implement such a corpus-adaptive system, three design guidelines are proposed: (i) the consequent use of machine-learning techniques instead of manually created linguistic rules; (ii) a strict data-oriented modelling of the phenomena instead of a generalization based on intellectual categories; (iii) the usage of automatically extracted knowledge about Named Entities, gained by analysing large amounts of raw texts. A prototype was implemented according to these guidelines and its evaluation shows the feasibility of the approach. The system originally developed for a German newspaper corpus could easily be adapted and applied to the extraction of biomedical entities within scientific abstracts written in English and therefore gave proof of the corpus-adaptivity of the approach. Despite the limited resources in comparison with other state-of-the-art systems, the prototype scored competitive results for some of the categories
    corecore