Skip to main content
Article thumbnail
Location of Repository

Semi-Supervised Named Entity Recognition:\ud Learning to Recognize 100 Entity Types with Little Supervision\ud

By David Nadeau

Abstract

Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. \ud \ud In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. \ud \ud Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. \ud \ud We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts. \u

Topics: Language, Machine Learning, Artificial Intelligence
Year: 2007
OAI identifier: oai:cogprints.org:5859
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://cogprints.org/5859/1/Th... (external link)
  • http://cogprints.org/5859/ (external link)
  • Suggested articles

    Citations

    1. (2004). A (Acronyms),
    2. (2006). A Multilingual Named Entity Recognition System Using Boosting and C4.5 Decision Tree Learning Algorithms. Discovery Science
    3. (2004). A Named Entity Recognizer for Danish.
    4. (2003). A simple algorithm for identifying abbreviation definitions in biomedical texts, doi
    5. (2000). A system for recognition of named entities in Greek.
    6. (1995). An Approach to Proper Name Tagging for German.
    7. (2002). An Unsupervised Method for General Named Entity Recognition and Automated Concept Discovery.
    8. (1999). Automatic Extraction of Acronyms from Text.
    9. (2004). Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. doi
    10. (2003). Boosting Precision and Recall of Dictionary-Based Protein Name Recognition.
    11. (2007). Cross-lingual Named Entity Recognition. In: Sekine,
    12. (2004). Definition, dictionaries and tagger for Extended Named Entity Hierarchy.
    13. (1998). Description of the Kent Ridge Digital Labs System Used for MUC-7.
    14. (2002). Detecting and Browsing Events in Unstructured Text.
    15. (2003). Effective Adaptation of a Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain.
    16. (2005). ESpotter: Adaptive Named Entity Recognition for Web Browsing.
    17. (2001). Estimating the support of a High-Dimensional Distribution.
    18. (2002). Evaluating Sense Disambiguation across Diverse Parameter Spaces. doi
    19. (1998). Extracting Patterns and Relations from the World Wide Web.
    20. (1999). Extracting Significant Time Varying Features from Text.
    21. (2003). Frequency Estimates for Statistical Word Similarity Measures.
    22. (2006). HAREM: An Advanced NER Evaluation Contest for Portuguese.
    23. (2000). Identifying Proper Names in Parallel Medical Terminologies.
    24. (1999). Information Extraction Supported Question Answering.
    25. (2002). Introduction to the CoNLL-2002 Shared Task: LanguageIndependent Named Entity Recognition.
    26. (2003). Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition.
    27. (2003). Japanese Named Entity Extraction with Redundant Morphological Analysis.
    28. (1999). Learning User Profiles from Positive Examples.
    29. (2002). Mapping abbreviations to full forms in biomedical articles, doi
    30. (2004). Named Entity Discovery Using Comparable News Articles.
    31. (1997). Nymble: a High-Performance Learning Name-finder.
    32. (1998). Nyu: Description of The Japanese
    33. (1999). Recognizing acronyms and their definitions,
    34. (2005). RitroveRAI: A Web Application for Semantic Indexing and Hyperlinking of Multimedia News. International Semantic Web Conference.
    35. (2002). S-RAD A Simple and Robust Abbreviation Dictionary,
    36. (2006). Truecasing for the Portage System.
    37. (2002). Unsupervised Learning of Generalized Names.

    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.