5 research outputs found

    Automatic rule learning exploiting morphological features for named entity recognition in Turkish

    Get PDF
    Named entity recognition (NER) is one of the basic tasks in automatic extraction of information from natural language texts. In this paper, we describe an automatic rule learning method that exploits different features of the input text to identify the named entities located in the natural language texts. Moreover, we explore the use of morphological features for extracting named entities from Turkish texts. We believe that the developed system can also be used for other agglutinative languages. The paper also provides a comprehensive overview of the field by reviewing the NER research literature. We conducted our experiments on the TurkIE dataset, a corpus of articles collected from different Turkish newspapers. Our method achieved an average F-score of 91.08% on the dataset. The results of the comparative experiments demonstrate that the developed technique is successfully applicable to the task of automatic NER and exploiting morphological features can significantly improve the NER from Turkish, an agglutinative language. © The Author(s) 2011

    Semi-Supervised Named Entity Recognition:\ud Learning to Recognize 100 Entity Types with Little Supervision\ud

    Get PDF
    Named Entity Recognition (NER) aims to extract and to classify rigid designators in text such as proper names, biological species, and temporal expressions. There has been growing interest in this field of research since the early 1990s. In this thesis, we document a trend moving away from handcrafted rules, and towards machine learning approaches. Still, recent machine learning approaches have a problem with annotated data availability, which is a serious shortcoming in building and maintaining large-scale NER systems. \ud \ud In this thesis, we present an NER system built with very little supervision. Human supervision is indeed limited to listing a few examples of each named entity (NE) type. First, we introduce a proof-of-concept semi-supervised system that can recognize four NE types. Then, we expand its capacities by improving key technologies, and we apply the system to an entire hierarchy comprised of 100 NE types. \ud \ud Our work makes the following contributions: the creation of a proof-of-concept semi-supervised NER system; the demonstration of an innovative noise filtering technique for generating NE lists; the validation of a strategy for learning disambiguation rules using automatically identified, unambiguous NEs; and finally, the development of an acronym detection algorithm, thus solving a rare but very difficult problem in alias resolution. \ud \ud We believe semi-supervised learning techniques are about to break new ground in the machine learning community. In this thesis, we show that limited supervision can build complete NER systems. On standard evaluation corpora, we report performances that compare to baseline supervised systems in the task of annotating NEs in texts. \u

    Diseño y generación semi-automática de patrones adaptables para el reconocimiento de entidades

    Get PDF
    La tarea de Reconocimiento de Entidades Nombradas (NER) facilita la gestión de información y tiene utilidad en otras áreas, como Anotación Semántica, Sistemas de Búsqueda de Respuesta, Población de Ontologías y Minería de Opiniones. Pero de acuerdo a los resultados de algunos foros, el área de NER podría considerarse resuelta. La tesis profundiza en la evaluación del área y muestra que parece haberse estancado en el reconocimiento de entidades típicas, para las que habitualmente existen recursos anotados. Esto contrasta con la diversidad de tipos de entidad y dominios de aplicación actuales. Este trabajo contribuye con el diseño de un método para el reconocimiento de entidades más consecuente con el problema de no disponer de corpus anotados para cualquier tipo de entidad requerida y sobre cualquier dominio. El método diseñado integra los siguientes aspectos: Transparencia: patrones legibles y con alto grado de estandarización. Flexibilidad: posibilidad de incorporar diferentes tipos de atributos capaces de describir las entidades o su contexto. Potencia: reconocimiento de diferentes estructuras del lenguaje en los documentos. Coste: uso de un pequeño conjunto de entidades como semillas iniciales y técnicas de aprendizaje activo para guiar al usuario en el proceso de anotación. Efectividad: tasas de efectividad competitivas en relación al estado del arte, medidas en términos de precisión y exhaustividad. Los resultados obtenidos son evaluados mediante el uso de corpus públicos anotados con diferentes tipos de entidades, y comparados con otros trabajos relacionados en la literatura científica.The task of Named Entity Recognition (NER) facilitates information management and is useful in other areas like Semantic Annotation, Question Answering, Ontology Population and Opinion Mining. According to the results from some evaluation forums though, NER may be considered a solved task. This dissertation digs into these evaluations and shows that they seemed stuck to the recognition of typical entities for which annotated resources are usually available. This contrasts with the current diversity of entity types and domains of application. The main contribution of this work is the design of a method to recognize entities that is more consistent with the lack of annotated corpora for any required type of entity and in any domain. The designed method integrates the following aspects: Transparency: readable patterns with a high level of standardization. Flexibility: possibility to incorporate different types of features capable of describing entities or their context. Power: recognition of different language structures within documents. Cost: use of a small set of entities as initial seeds and active learning techniques to guide the user through the annotation process. Effectiveness: competitive effectiveness rates compared to the state of the art in terms of precision and recall. The method is evaluated with two public annotated corpora with different types of entities, and compared with related works found in the scientific literature

    Cluster Analysis and Classification of Named Entities

    No full text
    This paper presents a statistics-based and language independent unsupervised approach for clustering possible named entities. We describe and motivate the features and statistical filters used by our clustering process. Using the Model-Based Clustering Analysis software we obtained different clusters of named entities. The method was applied to Bulgarian and English. For some clusters, precision is close to 100%; this helps human validation and saves time. Other clusters still need further refinement. Based on the obtained clusters, it is possible to classify new named entities.
    corecore