16 research outputs found

    Determining the Unithood of Word Sequences using Mutual Information and Independence Measure

    Full text link
    Most works related to unithood were conducted as part of a larger effort for the determination of termhood. Consequently, the number of independent research that study the notion of unithood and produce dedicated techniques for measuring unithood is extremely small. We propose a new approach, independent of any influences of termhood, that provides dedicated measures to gather linguistic evidence from parsed text and statistical evidence from Google search engine for the measurement of unithood. Our evaluations revealed a precision and recall of 98.68% and 91.82% respectively with an accuracy at 95.42% in measuring the unithood of 1005 test cases.Comment: More information is available at http://explorer.csse.uwa.edu.au/reference

    Modeling Noisiness to Recognize Named Entities using Multitask Neural Networks on Social Media

    Full text link
    Recognizing named entities in a document is a key task in many NLP applications. Although current state-of-the-art approaches to this task reach a high performance on clean text (e.g. newswire genres), those algorithms dramatically degrade when they are moved to noisy environments such as social media domains. We present two systems that address the challenges of processing social media data using character-level phonetics and phonology, word embeddings, and Part-of-Speech tags as features. The first model is a multitask end-to-end Bidirectional Long Short-Term Memory (BLSTM)-Conditional Random Field (CRF) network whose output layer contains two CRF classifiers. The second model uses a multitask BLSTM network as feature extractor that transfers the learning to a CRF classifier for the final prediction. Our systems outperform the current F1 scores of the state of the art on the Workshop on Noisy User-generated Text 2017 dataset by 2.45% and 3.69%, establishing a more suitable approach for social media environments.Comment: NAACL 201

    Szófaji kódok és névelemek együttes osztályozása

    Get PDF
    Jelen munkánkban egy, a szófaji kódok és a névelemek meghatározására szolgáló gépi tanulási modellt mutatunk be. Az általános véletlen mezkön alapuló módszer segítségével több címkesorozat együttesen tanulható, valamint az osztályozás során a címkesorozatok legjobb kombinációját együttesen keressük. A magyarlanc szófaji elemz és az SZTENER névelem-felismer jellemzkészletét használva olyan rendszert építettünk, amely a címkék együttes osztályozásának segítségével felülmúlta a kiindulási rendszereket az általunk használt teszthalmazon. A névelem-felismer F-mértékben mért teljesítménye 87,75-rl 89,87-re, a szófaji címkéz pontossága 97,11%-ról 97,99%-ra nőtt, úgy, hogy a kódok meghatározásának más minségi tényezi is javultak

    SVMs for the temporal expression chunking problem

    Get PDF
    This technical report provides a description of the Temporal Expression Chunking problem, which consists in finding the mentions of temporal expressions in free text documents, and reports on the results obtained using SVM classifiers to tackle this task. We first define the task in detail, then present a brief survey of relevant literature, state our approach and finally present the results of several experiments. These include a comparison of the correct recognition rate achieved using different polynomial kernel degrees, sets of features, and sizes of the context window. We use 5-fold cross-validation on the ACE 2005 corpus for training and for testing.Postprint (published version

    Named Entity Recognition in Turkish with Bayesian Learning and Hybrid Approaches

    Get PDF
    Named entity recognition is one of the significant textual information extraction tasks. In this paper, we present two approaches for named entity recognition on Turkish texts. The first is a Bayesian learning approach which is trained on a considerably limited training set. The second approach comprises two hybrid systems based on joint utilization of this Bayesian learning approach and a previously proposed rule-based named entity recognizer. All of the proposed three approaches achieve promising performance rates. This paper is significant as it reports the first use of the Bayesian approach for the task of named entity recognition on Turkish texts for which especially practical approaches are still insufficient

    Cross-sentence contexts in Named Entity Recognition with BERT

    Get PDF
    Named entity recognition (NER) is a task under the broader scope of Natural Language Processing (NLP). The computational task of NER is often cast as a sequence classification task where the goal is to label each word (or token) in the input sequence with a class from a predefined set of classes. The development of deep transfer learning methodologies in recent years has greatly influenced both NLP and NER. There have been improvements in the performance of NER models but at the same time the use of cross-sentence context, the sentences around the sentence of interest, has diminished in NER methods. Many of the current methods use inputs that consist of only one sentence of text at a time. It is nevertheless clear that useful information for NER is often found also elsewhere in text. Recent self-attention models like BERT can both capture long-distance relationships in input and represent inputs consisting of several sentences. This creates opportunities for making use of cross-sentence information in NLP tasks. This thesis presents a systematic study exploring the use of cross-sentence information for NER using BERT models in five languages. The study shows that adding context as additional sentences to BERT input systematically increases NER performance. Adding multiple sentences in input samples also allows the study of predictions for the sentences in different contexts. A straightforward method of Contextual Majority Voting (CMV) is proposed to combine these different predictions. The study demonstrates that using CMV increases NER performance even further. Evaluation of the proposed methods on established datasets, including the Conference on Computational Natural Language Learning CoNLL'02 and CoNLL'03 NER benchmarks, demonstrates that the proposed approach can improve on the state-of-the-art NER results for English, Dutch, and Finnish, achieves the best reported BERT-based results for German, and is on par with other BERT-based approaches for Spanish. The methods implemented for this work are published under open licenses

    Named entity recognition using hundreds of thousands of features

    No full text

    Low-rank regularization for high-dimensional sparse conjunctive feature spaces in information extraction

    Get PDF
    Versió amb dues seccions retallades, per drets de l'editorOne of the challenges in Natural Language Processing (NLP) is the unstructured nature of texts, in which useful information is not easily identifiable. Information Extraction (IE) aims to alleviate it by enabling automatic extraction of structured information from such text sources. The resulting structured information will facilitate easier querying, organizing, and analyzing of data from texts. In this thesis, we are interested in two IE related tasks: (i) named entity classification and (ii) template filling. Specifically, this thesis examines the problem of learning classifiers of text spans and explore its application for extracting named entities and template slot-fillers. In general, our goal is to construct a method to learn classifiers that: (i) require less supervision, (ii) work well with high-dimensional sparse feature spaces and (iii) are able to classify unseen items (i.e. named entities/slot-fillers not observed in training data). The key idea of our contribution is the utilization of unseen conjunctive features. A conjunctive feature is a combination of features from different feature sets. For example, to classify a phrase, one might have one feature set for the context and another set for the phrase itself. When learning a classifier, only a factor of these conjunctive features will be observed in the training set, leaving the rest (i.e. unseen features) unusable for predicting items in test time. We hypothesize that utilizing such unseen conjunctions is useful to address all of the aspects of the goal. We develop a general regularization framework specifically designed for sparse conjunctive feature spaces. Our strategy is based on employing tensors to represent the conjunctive feature space, and forcing the model to induce low-dimensional embeddings of the feature vectors via low-rank regularization on the tensor parameters. Such compressed representation will help prediction by generalizing to novel examples where most of the conjunctions will be unseen in the training set. We conduct experiments on learning named entity classifiers and template filling, focusing on extracting unseen items. We show that when learning classifiers under minimal supervision, our approach is more effective in controlling model capacity than standard techniques for linear classification.Uno de los retos en Procesamiento del Lenguaje Natural (NLP, del inglés Natural Language Processing) es la naturaleza no estructurada del texto, que hace que la información útil y relevante no sea fácilmente identificable. Los métodos de Extracción de Información (IE, del inglés Information Extraction) afrontan este problema mediante la extracción automática de información estructurada de dichos textos. La estructura resultante facilita la búsqueda, la organización y el análisis datos textuales. Esta tesis se centra en dos tareas relacionadas dentro de IE: (i) clasificación de entidades nombradas (NEC, del inglés Named Entity Classification), y (ii) rellenado de plantillas (en inglés, template filling). Concretamente, esta tesis estudia el problema de aprender clasificadores de secuencias textuales y explora su aplicación a la extracción de entidades nombradas y de valores para campos de plantillas. El objetivo general es desarrollar un método para aprender clasificadores que: (i) requieran poca supervisión; (ii) funcionen bien en espacios de características de alta dimensión y dispersión; y (iii) sean capaces de clasificar elementos nunca vistos (por ejemplo entidades o valores de campos que no hayan sido vistos en fase de entrenamiento). La idea principal de nuestra contribución es la utilización de características conjuntivas que no aparecen en el conjunto de entrenamiento. Una característica conjuntiva es una conjunción de características elementales. Por ejemplo, para clasificar la mención de una entidad en una oración, se utilizan características de la mención, del contexto de ésta, y a su vez conjunciones de los dos grupos de características. Cuando se aprende un clasificador en un conjunto de entrenamiento concreto, sólo se observará una fracción de estas características conjuntivas, dejando el resto (es decir, características no vistas) sin ser utilizado para predecir elementos en fase de evaluación y explotación del modelo. Nuestra hipótesis es que la utilización de estas conjunciones nunca vistas pueden ser potencialmente muy útiles, especialmente para reconocer entidades nuevas. Desarrollamos un marco de regularización general específicamente diseñado para espacios de características conjuntivas dispersas. Nuestra estrategia se basa en utilizar tensores para representar el espacio de características conjuntivas y obligar al modelo a inducir "embeddings" de baja dimensión de los vectores de características vía regularización de bajo rango en los parámetros de tensor. Dicha representación comprimida ayudará a la predicción, generalizando a nuevos ejemplos donde la mayoría de las conjunciones no han sido vistas durante la fase de entrenamiento. Presentamos experimentos sobre el aprendizaje de clasificadores de entidades nombradas, y clasificadores de valores en campos de plantillas, centrándonos en la extracción de elementos no vistos. Demostramos que al aprender los clasificadores bajo mínima supervisión, nuestro enfoque es más efectivo en el control de la capacidad del modelo que las técnicas estándar para la clasificación linealPostprint (published version
    corecore