11 research outputs found

    Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition

    Full text link
    We describe the CoNLL-2003 shared task: language-independent named entity recognition. We give background information on the data sets (English and German) and the evaluation method, present a general overview of the systems that have taken part in the task and discuss their performance

    Low-rank regularization for high-dimensional sparse conjunctive feature spaces in information extraction

    Get PDF
    Versió amb dues seccions retallades, per drets de l'editorOne of the challenges in Natural Language Processing (NLP) is the unstructured nature of texts, in which useful information is not easily identifiable. Information Extraction (IE) aims to alleviate it by enabling automatic extraction of structured information from such text sources. The resulting structured information will facilitate easier querying, organizing, and analyzing of data from texts. In this thesis, we are interested in two IE related tasks: (i) named entity classification and (ii) template filling. Specifically, this thesis examines the problem of learning classifiers of text spans and explore its application for extracting named entities and template slot-fillers. In general, our goal is to construct a method to learn classifiers that: (i) require less supervision, (ii) work well with high-dimensional sparse feature spaces and (iii) are able to classify unseen items (i.e. named entities/slot-fillers not observed in training data). The key idea of our contribution is the utilization of unseen conjunctive features. A conjunctive feature is a combination of features from different feature sets. For example, to classify a phrase, one might have one feature set for the context and another set for the phrase itself. When learning a classifier, only a factor of these conjunctive features will be observed in the training set, leaving the rest (i.e. unseen features) unusable for predicting items in test time. We hypothesize that utilizing such unseen conjunctions is useful to address all of the aspects of the goal. We develop a general regularization framework specifically designed for sparse conjunctive feature spaces. Our strategy is based on employing tensors to represent the conjunctive feature space, and forcing the model to induce low-dimensional embeddings of the feature vectors via low-rank regularization on the tensor parameters. Such compressed representation will help prediction by generalizing to novel examples where most of the conjunctions will be unseen in the training set. We conduct experiments on learning named entity classifiers and template filling, focusing on extracting unseen items. We show that when learning classifiers under minimal supervision, our approach is more effective in controlling model capacity than standard techniques for linear classification.Uno de los retos en Procesamiento del Lenguaje Natural (NLP, del inglés Natural Language Processing) es la naturaleza no estructurada del texto, que hace que la información útil y relevante no sea fácilmente identificable. Los métodos de Extracción de Información (IE, del inglés Information Extraction) afrontan este problema mediante la extracción automática de información estructurada de dichos textos. La estructura resultante facilita la búsqueda, la organización y el análisis datos textuales. Esta tesis se centra en dos tareas relacionadas dentro de IE: (i) clasificación de entidades nombradas (NEC, del inglés Named Entity Classification), y (ii) rellenado de plantillas (en inglés, template filling). Concretamente, esta tesis estudia el problema de aprender clasificadores de secuencias textuales y explora su aplicación a la extracción de entidades nombradas y de valores para campos de plantillas. El objetivo general es desarrollar un método para aprender clasificadores que: (i) requieran poca supervisión; (ii) funcionen bien en espacios de características de alta dimensión y dispersión; y (iii) sean capaces de clasificar elementos nunca vistos (por ejemplo entidades o valores de campos que no hayan sido vistos en fase de entrenamiento). La idea principal de nuestra contribución es la utilización de características conjuntivas que no aparecen en el conjunto de entrenamiento. Una característica conjuntiva es una conjunción de características elementales. Por ejemplo, para clasificar la mención de una entidad en una oración, se utilizan características de la mención, del contexto de ésta, y a su vez conjunciones de los dos grupos de características. Cuando se aprende un clasificador en un conjunto de entrenamiento concreto, sólo se observará una fracción de estas características conjuntivas, dejando el resto (es decir, características no vistas) sin ser utilizado para predecir elementos en fase de evaluación y explotación del modelo. Nuestra hipótesis es que la utilización de estas conjunciones nunca vistas pueden ser potencialmente muy útiles, especialmente para reconocer entidades nuevas. Desarrollamos un marco de regularización general específicamente diseñado para espacios de características conjuntivas dispersas. Nuestra estrategia se basa en utilizar tensores para representar el espacio de características conjuntivas y obligar al modelo a inducir "embeddings" de baja dimensión de los vectores de características vía regularización de bajo rango en los parámetros de tensor. Dicha representación comprimida ayudará a la predicción, generalizando a nuevos ejemplos donde la mayoría de las conjunciones no han sido vistas durante la fase de entrenamiento. Presentamos experimentos sobre el aprendizaje de clasificadores de entidades nombradas, y clasificadores de valores en campos de plantillas, centrándonos en la extracción de elementos no vistos. Demostramos que al aprender los clasificadores bajo mínima supervisión, nuestro enfoque es más efectivo en el control de la capacidad del modelo que las técnicas estándar para la clasificación linealPostprint (published version

    Low-rank regularization for high-dimensional sparse conjunctive feature spaces in information extraction

    Get PDF
    One of the challenges in Natural Language Processing (NLP) is the unstructured nature of texts, in which useful information is not easily identifiable. Information Extraction (IE) aims to alleviate it by enabling automatic extraction of structured information from such text sources. The resulting structured information will facilitate easier querying, organizing, and analyzing of data from texts. In this thesis, we are interested in two IE related tasks: (i) named entity classification and (ii) template filling. Specifically, this thesis examines the problem of learning classifiers of text spans and explore its application for extracting named entities and template slot-fillers. In general, our goal is to construct a method to learn classifiers that: (i) require less supervision, (ii) work well with high-dimensional sparse feature spaces and (iii) are able to classify unseen items (i.e. named entities/slot-fillers not observed in training data). The key idea of our contribution is the utilization of unseen conjunctive features. A conjunctive feature is a combination of features from different feature sets. For example, to classify a phrase, one might have one feature set for the context and another set for the phrase itself. When learning a classifier, only a factor of these conjunctive features will be observed in the training set, leaving the rest (i.e. unseen features) unusable for predicting items in test time. We hypothesize that utilizing such unseen conjunctions is useful to address all of the aspects of the goal. We develop a general regularization framework specifically designed for sparse conjunctive feature spaces. Our strategy is based on employing tensors to represent the conjunctive feature space, and forcing the model to induce low-dimensional embeddings of the feature vectors via low-rank regularization on the tensor parameters. Such compressed representation will help prediction by generalizing to novel examples where most of the conjunctions will be unseen in the training set. We conduct experiments on learning named entity classifiers and template filling, focusing on extracting unseen items. We show that when learning classifiers under minimal supervision, our approach is more effective in controlling model capacity than standard techniques for linear classification.Uno de los retos en Procesamiento del Lenguaje Natural (NLP, del inglés Natural Language Processing) es la naturaleza no estructurada del texto, que hace que la información útil y relevante no sea fácilmente identificable. Los métodos de Extracción de Información (IE, del inglés Information Extraction) afrontan este problema mediante la extracción automática de información estructurada de dichos textos. La estructura resultante facilita la búsqueda, la organización y el análisis datos textuales. Esta tesis se centra en dos tareas relacionadas dentro de IE: (i) clasificación de entidades nombradas (NEC, del inglés Named Entity Classification), y (ii) rellenado de plantillas (en inglés, template filling). Concretamente, esta tesis estudia el problema de aprender clasificadores de secuencias textuales y explora su aplicación a la extracción de entidades nombradas y de valores para campos de plantillas. El objetivo general es desarrollar un método para aprender clasificadores que: (i) requieran poca supervisión; (ii) funcionen bien en espacios de características de alta dimensión y dispersión; y (iii) sean capaces de clasificar elementos nunca vistos (por ejemplo entidades o valores de campos que no hayan sido vistos en fase de entrenamiento). La idea principal de nuestra contribución es la utilización de características conjuntivas que no aparecen en el conjunto de entrenamiento. Una característica conjuntiva es una conjunción de características elementales. Por ejemplo, para clasificar la mención de una entidad en una oración, se utilizan características de la mención, del contexto de ésta, y a su vez conjunciones de los dos grupos de características. Cuando se aprende un clasificador en un conjunto de entrenamiento concreto, sólo se observará una fracción de estas características conjuntivas, dejando el resto (es decir, características no vistas) sin ser utilizado para predecir elementos en fase de evaluación y explotación del modelo. Nuestra hipótesis es que la utilización de estas conjunciones nunca vistas pueden ser potencialmente muy útiles, especialmente para reconocer entidades nuevas. Desarrollamos un marco de regularización general específicamente diseñado para espacios de características conjuntivas dispersas. Nuestra estrategia se basa en utilizar tensores para representar el espacio de características conjuntivas y obligar al modelo a inducir "embeddings" de baja dimensión de los vectores de características vía regularización de bajo rango en los parámetros de tensor. Dicha representación comprimida ayudará a la predicción, generalizando a nuevos ejemplos donde la mayoría de las conjunciones no han sido vistas durante la fase de entrenamiento. Presentamos experimentos sobre el aprendizaje de clasificadores de entidades nombradas, y clasificadores de valores en campos de plantillas, centrándonos en la extracción de elementos no vistos. Demostramos que al aprender los clasificadores bajo mínima supervisión, nuestro enfoque es más efectivo en el control de la capacidad del modelo que las técnicas estándar para la clasificación linea

    Memory-Based Named Entity Recognition using Unannotated Data

    No full text

    Memory-Based Named Entity Recognition Using Unannotated Data

    No full text
    We used the memory-based learner Timbl (Daelemans et al., 2002) to find names in English and German newspaper text. A first system used only the training data, and a number of gazetteers. The results show that gazetteers are not beneficial in the English case, while they are for the German data. Type-token generalization was applied, but also reduced performance

    Domain-specific question answering system : an application to the construction sector

    Get PDF
    Mémoire numérisé par la Direction des bibliothèques de l'Université de Montréal

    Le repérage automatique des entités nommées dans la langue arabe : vers la création d'un système à base de règles

    Full text link
    Mémoire numérisé par la Division de la gestion de documents et des archives de l'Université de Montréal

    Unsupervised relation extraction for e-learning applications

    Get PDF
    In this modern era many educational institutes and business organisations are adopting the e-Learning approach as it provides an effective method for educating and testing their students and staff. The continuous development in the area of information technology and increasing use of the internet has resulted in a huge global market and rapid growth for e-Learning. Multiple Choice Tests (MCTs) are a popular form of assessment and are quite frequently used by many e-Learning applications as they are well adapted to assessing factual, conceptual and procedural information. In this thesis, we present an alternative to the lengthy and time-consuming activity of developing MCTs by proposing a Natural Language Processing (NLP) based approach that relies on semantic relations extracted using Information Extraction to automatically generate MCTs. Information Extraction (IE) is an NLP field used to recognise the most important entities present in a text, and the relations between those concepts, regardless of their surface realisations. In IE, text is processed at a semantic level that allows the partial representation of the meaning of a sentence to be produced. IE has two major subtasks: Named Entity Recognition (NER) and Relation Extraction (RE). In this work, we present two unsupervised RE approaches (surface-based and dependency-based). The aim of both approaches is to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations. In the surface-based approach, we examined different surface pattern types, each implementing different assumptions about the linguistic expression of semantic relations between named entities while in the dependency-based approach we explored how dependency relations based on dependency trees can be helpful in extracting relations between named entities. Our findings indicate that the presented approaches are capable of achieving high precision rates. Our experiments make use of traditional, manually compiled corpora along with similar corpora automatically collected from the Web. We found that an automatically collected web corpus is still unable to ensure the same level of topic relevance as attained in manually compiled traditional corpora. Comparison between the surface-based and the dependency-based approaches revealed that the dependency-based approach performs better. Our research enabled us to automatically generate questions regarding the important concepts present in a domain by relying on unsupervised relation extraction approaches as extracted semantic relations allow us to identify key information in a sentence. The extracted patterns (semantic relations) are then automatically transformed into questions. In the surface-based approach, questions are automatically generated from sentences matched by the extracted surface-based semantic pattern which relies on a certain set of rules. Conversely, in the dependency-based approach questions are automatically generated by traversing the dependency tree of extracted sentence matched by the dependency-based semantic patterns. The MCQ systems produced from these surface-based and dependency-based semantic patterns were extrinsically evaluated by two domain experts in terms of questions and distractors readability, usefulness of semantic relations, relevance, acceptability of questions and distractors and overall MCQ usability. The evaluation results revealed that the MCQ system based on dependency-based semantic relations performed better than the surface-based one. A major outcome of this work is an integrated system for MCQ generation that has been evaluated by potential end users.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Corpus-adaptive Named Entity Recognition

    Get PDF
    Named Entity Recognition (NER) is an important step towards the automatic analysis of natural language and is needed for a series of natural language applications. The task of NER requires the recognition and classification of proper names and other unique identifiers according to a predefined category system, e.g. the “traditional” categories PERSON, ORGANIZATION (companies, associations) and LOCATION. While most of the previous work deals with the recognition of these traditional categories within English newspaper texts, the approach presented in this thesis is beyond that scope. The approach is particularly motivated by NER which is more challenging than the classical task, such as German, or the identification of biomedical entities within scientific texts. Additionally, the approach addresses the ease-of-development and maintainability of NER-services by emphasizing the need for “corpus-adaptive” systems, with “corpus-adaptivity” describing whether a system can be easily adapted to new tasks and to new text corpora. In order to implement such a corpus-adaptive system, three design guidelines are proposed: (i) the consequent use of machine-learning techniques instead of manually created linguistic rules; (ii) a strict data-oriented modelling of the phenomena instead of a generalization based on intellectual categories; (iii) the usage of automatically extracted knowledge about Named Entities, gained by analysing large amounts of raw texts. A prototype was implemented according to these guidelines and its evaluation shows the feasibility of the approach. The system originally developed for a German newspaper corpus could easily be adapted and applied to the extraction of biomedical entities within scientific abstracts written in English and therefore gave proof of the corpus-adaptivity of the approach. Despite the limited resources in comparison with other state-of-the-art systems, the prototype scored competitive results for some of the categories