141 research outputs found

    On the Use of Parsing for Named Entity Recognition

    Get PDF
    [Abstract] Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from financial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic information; and we propose a new way of using parsing in NER based on casting parsing itself as a sequence labeling task.Xunta de Galicia; ED431C 2020/11Xunta de Galicia; ED431G 2019/01This work has been funded by MINECO, AEI and FEDER of UE through the ANSWER-ASAP project (TIN2017-85160-C2-1-R); and by Xunta de Galicia through a Competitive Reference Group grant (ED431C 2020/11). CITIC, as Research Center of the Galician University System, is funded by the Consellería de Educación, Universidade e Formación Profesional of the Xunta de Galicia through the European Regional Development Fund (ERDF/FEDER) with 80%, the Galicia ERDF 2014-20 Operational Programme, and the remaining 20% from the Secretaría Xeral de Universidades (Ref. ED431G 2019/01). Carlos Gómez-Rodríguez has also received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, Grant No. 714150)

    Improving Neural Sequence Labelling Using Additional Linguistic Information

    Get PDF
    Sequence Labelling is the task of mapping sequential data from one domain to another domain. As we can interpret language as a sequence of words, sequence labelling is very common in the field of Natural Language Processing (NLP). In NLP, some fundamental sequence labelling tasks are Parts-of-Speech Tagging, Named Entity Recognition, Chunking, etc. Moreover, many NLP tasks can be modeled as sequence labelling or sequence to sequence labelling such as machine translation, information retrieval and question answering. An extensive amount of research has already been performed on sequence labelling. Most of the current high performing models are neural network models. These Deep Learning based models are outperforming traditional machine learning techniques by using abstract high dimensional feature representations of the input data. In this thesis, we propose a new neural sequence model which uses several additional types of linguistic information to improve the model performance. The convergence rate of the proposed model is significantly less than similar models. Moreover, our model obtains state of the art results on the benchmark datasets of POS, NER, and chunking

    PersoNER: Persian named-entity recognition

    Full text link
    © 1963-2018 ACL. Named-Entity Recognition (NER) is still a challenging task for languages with low digital resources. The main difficulties arise from the scarcity of annotated corpora and the consequent problematic training of an effective NER pipeline. To abridge this gap, in this paper we target the Persian language that is spoken by a population of over a hundred million people world-wide. We first present and provide ArmanPerosNERCorpus, the first manually-annotated Persian NER corpus. Then, we introduce PersoNER, an NER pipeline for Persian that leverages a word embedding and a sequential max-margin classifier. The experimental results show that the proposed approach is capable of achieving interesting MUC7 and CoNNL scores while outperforming two alternatives based on a CRF and a recurrent neural network

    Learning Chinese language structures with multiple views

    Get PDF
    Motivated by the inadequacy of single view approaches in many areas in NLP, we study multi-view Chinese language processing, including word segmentation, part-of-speech (POS) tagging, syntactic parsing and semantic role labeling (SRL), in this thesis. We consider three situations of multiple views in statistical NLP: (1) Heterogeneous computational models have been designed for a given problem; (2) Heterogeneous annotation data is available to train systems; (3) Supervised and unsupervised machine learning techniques are applicable. First, we comparatively analyze successful single view approaches for Chinese lexical, syntactic and semantic processing. Our analysis highlights the diversity between heterogenous systems built on different views, and motivates us to improve the state-of-the-art by combining or integrating heterogeneous approaches. Second, we study the annotation ensemble problem, i.e. learning from multiple data sets under different annotation standards. We propose a series of generalized stacking models to effectively utilize heterogeneous labeled data to reduce approximation errors for word segmentation and parsing. Finally, we are concerned with bridging the gap between unsupervised and supervised learning paradigms. We introduce feature induction solutions that harvest useful linguistic knowledge from large-scale unlabeled data and effectively use them as new features to enhance discriminative learning based systems. For word segmentation, we present a comparative study of word-based and character-based approaches. Inspired by the diversity of the two views, we design a novel stacked sub-word tagging model for joint word segmentation and POS tagging, which is robust to integrate different models, even models trained on heterogeneous annotations. To benefit from unsupervised word segmentation, we derive expressive string knowledge from unlabeled data which significantly enhances a strong supervised segmenter. For POS tagging, we introduce two linguistically motivated improvements: (1) combining syntax-free sequential tagging and syntax-based chart parsing results to better capture syntagmatic lexical relations and (2) integrating word clusters acquired from unlabeled data to better capture paradigmatic lexical relations. For syntactic parsing, we present a comparative analysis for generative PCFG-LA constituency parsing and discriminative graph-based dependency parsing. To benefit from the diversity of parsing in different formalisms, we implement a previously introduced stacking method and propose a novel Bagging model to combine complementary strengths of grammar-free and grammar-based models. In addition to the study on the syntactic formalism, we also propose a reranking model to explore heterogenous treebanks that are labeled under different annotation scheme. Finally, we continue our efforts on combining strengths of supervised and unsupervised learning, and evaluate the impact of word clustering on different syntactic processing tasks. Our work on SRL focus on improving the full parsing method with linguistically rich features and a chunking strategy. Furthermore, we developed a partial parsing based semantic chunking method, which has complementary strengths to the full parsing based method. Based on our work, Zhuang and Zong (2010) successfully improve the state-of-the-art by combining full and partial parsing based SRL systems.Motiviert durch die Unzulänglichkeit der Ansätze mit dem einzigen Ansicht in vielen Bereichen in NLP, untersuchen wir Chinesische Sprache Verarbeitung mit mehrfachen Ansichten, einschließlich Wortsegmentierung, Part-of-Speech (POS)-Tagging und syntaktische Parsing und die Kennzeichnung der semantische Rolle (SRL) in dieser Arbeit . Wir betrachten drei Situationen von mehreren Ansichten in der statistischen NLP: (1) Heterogene computergestützte Modelle sind für ein gegebenes Problem entwurft, (2) Heterogene Annotationsdaten sind verfügbar, um die Systeme zu trainieren, (3) überwachten und unüberwachten Methoden des maschinellen Lernens sind zur Verfügung gestellt. Erstens, wir analysieren vergleichsweise erfolgreiche Ansätze mit einzigen Ansicht für chinesische lexikalische, syntaktische und semantische Verarbeitung. Unsere Analyse zeigt die Unterschiede zwischen den heterogenen Systemen, die auf verschiedenen Ansichten gebaut werden, und motiviert uns, die state-of-the-Art durch die Kombination oder Integration heterogener Ansätze zu verbessern. Zweitens, untersuchen wir die Annotation Ensemble Problem, d.h. das Lernen aus mehreren Datensätzen unter verschiedenen Annotation Standards. Wir schlagen eine Reihe allgemeiner Stapeln Modelle, um eine effektive Nutzung heterogener Daten zu beschriften, und um Approximationsfehler für Wort Segmentierung und Analyse zu reduzieren. Schließlich sind wir besorgt mit der Überbrückung der Kluft zwischen unüberwachten und überwachten Lernens Paradigmen. Wir führen Induktion Feature-Lösungen, die nützliche Sprachkenntnisse von großflächigen unmarkierter Daten ernte, und die effektiv nutzen als neue Features, um die unterscheidenden Lernen basierten Systemen zu verbessern. Für die Wortsegmentierung, präsentieren wir eine vergleichende Studie der Wort-basierte und Charakter-basierten Ansätzen. Inspiriert von der Vielfalt der beiden Ansichten, entwerfen wir eine neuartige gestapelt Sub-Wort-Tagging-Modell für gemeinsame Wort-Segmentierung und POS-Tagging, die robust ist, um verschiedene Modelle zu integrieren, auch Modelle auf heterogenen Annotationen geschult. Um den unbeaufsichtigten Wortsegmentierung zu profitieren, leiten wir ausdrucksstarke Zeichenfolge Wissen von unmarkierten Daten. Diese Methode hat eine überwachte Methode erheblich verbessert. Für POS-Tagging, führen wir zwei linguistisch motiviert Verbesserungen: (1) die Kombination von Syntaxfreie sequentielle Tagging und Syntaxbasierten Grafik-Parsing-Ergebnisse, um syntagmatische lexikalische Beziehungen besser zu erfassen (2) die Integration von Wortclusteren von nicht markierte Daten, um die paradigmatische lexikalische Beziehungen besser zu erfassen. Für syntaktische Parsing präsentieren wir eine vergleichenbare Analyse für generative PCFG-LA Wahlkreis Parsing und diskriminierende Graphen-basierte Abhängigkeit Parsing. Um aus der Vielfalt der Parsen in unterschiedlichen Formalismen zu profitieren, setzen wir eine zuvor eingeführte Stacking-Methode und schlagen eine neuartige Schrumpfbeutel-Modell vor, um die ergänzenden Stärken der Grammatik und Grammatik-free-basierte Modelle zu kombinieren. Neben dem syntaktischen Formalismus, wir schlagen auch ein Modell, um heterogene reranking Baumbanken, die unter verschiedenen Annotationsschema beschriftet sind zu erkunden. Schließlich setzen wir unsere Bemühungen auf die Bündelung von Stärken des überwachten und unüberwachten Lernen, und bewerten wir die Auswirkungen der Wort-Clustering auf verschiedene syntaktische Verarbeitung Aufgaben. Unsere Arbeit an SRL ist konzentriert auf die Verbesserung der vollen Parsingsmethode mit linguistischen umfangreichen Funktionen und einer Chunkingstrategie. Weiterhin entwickelten wir eine semantische Chunkingmethode basiert auf dem partiellen Parsing, die die komplementäre Stärken gegen die die Methode basiert auf dem vollen Parsing hat. Basiert auf unserer Arbeit, Zhuang und Zong (2010) hat den aktuelle Stand erfolgreich verbessert durch die Kombination von voll-und partielle-Parsing basierte SRL Systeme

    Discriminative Reranking for Spoken Language Understanding

    Full text link

    Aplicação de técnicas de Clustering ao contexto da Tomada de Decisão em Grupo

    Get PDF
    Nowadays, decisions made by executives and managers are primarily made in a group. Therefore, group decision-making is a process where a group of people called participants work together to analyze a set of variables, considering and evaluating a set of alternatives to select one or more solutions. There are many problems associated with group decision-making, namely when the participants cannot meet for any reason, ranging from schedule incompatibility to being in different countries with different time zones. To support this process, Group Decision Support Systems (GDSS) evolved to what today we call web-based GDSS. In GDSS, argumentation is ideal since it makes it easier to use justifications and explanations in interactions between decision-makers so they can sustain their opinions. Aspect Based Sentiment Analysis (ABSA) is a subfield of Argument Mining closely related to Natural Language Processing. It intends to classify opinions at the aspect level and identify the elements of an opinion. Applying ABSA techniques to Group Decision Making Context results in the automatic identification of alternatives and criteria, for example. This automatic identification is essential to reduce the time decision-makers take to step themselves up on Group Decision Support Systems and offer them various insights and knowledge on the discussion they are participants. One of these insights can be arguments getting used by the decision-makers about an alternative. Therefore, this dissertation proposes a methodology that uses an unsupervised technique, Clustering, and aims to segment the participants of a discussion based on arguments used so it can produce knowledge from the current information in the GDSS. This methodology can be hosted in a web service that follows a micro-service architecture and utilizes Data Preprocessing and Intra-sentence Segmentation in addition to Clustering to achieve the objectives of the dissertation. Word Embedding is needed when we apply clustering techniques to natural language text to transform the natural language text into vectors usable by the clustering techniques. In addition to Word Embedding, Dimensionality Reduction techniques were tested to improve the results. Maintaining the same Preprocessing steps and varying the chosen Clustering techniques, Word Embedders, and Dimensionality Reduction techniques came up with the best approach. This approach consisted of the KMeans++ clustering technique, using SBERT as the word embedder with UMAP dimensionality reduction, reducing the number of dimensions to 2. This experiment achieved a Silhouette Score of 0.63 with 8 clusters on the baseball dataset, which wielded good cluster results based on their manual review and Wordclouds. The same approach obtained a Silhouette Score of 0.59 with 16 clusters on the car brand dataset, which we used as an approach validation dataset.Atualmente, as decisões tomadas por gestores e executivos são maioritariamente realizadas em grupo. Sendo assim, a tomada de decisão em grupo é um processo no qual um grupo de pessoas denominadas de participantes, atuam em conjunto, analisando um conjunto de variáveis, considerando e avaliando um conjunto de alternativas com o objetivo de selecionar uma ou mais soluções. Existem muitos problemas associados ao processo de tomada de decisão, principalmente quando os participantes não têm possibilidades de se reunirem (Exs.: Os participantes encontramse em diferentes locais, os países onde estão têm fusos horários diferentes, incompatibilidades de agenda, etc.). Para suportar este processo de tomada de decisão, os Sistemas de Apoio à Tomada de Decisão em Grupo (SADG) evoluíram para o que hoje se chamam de Sistemas de Apoio à Tomada de Decisão em Grupo baseados na Web. Num SADG, argumentação é ideal pois facilita a utilização de justificações e explicações nas interações entre decisores para que possam suster as suas opiniões. Aspect Based Sentiment Analysis (ABSA) é uma área de Argument Mining correlacionada com o Processamento de Linguagem Natural. Esta área pretende classificar opiniões ao nível do aspeto da frase e identificar os elementos de uma opinião. Aplicando técnicas de ABSA à Tomada de Decisão em Grupo resulta na identificação automática de alternativas e critérios por exemplo. Esta identificação automática é essencial para reduzir o tempo que os decisores gastam a customizarem-se no SADG e oferece aos mesmos conhecimento e entendimentos sobre a discussão ao qual participam. Um destes entendimentos pode ser os argumentos a serem usados pelos decisores sobre uma alternativa. Assim, esta dissertação propõe uma metodologia que utiliza uma técnica não-supervisionada, Clustering, com o objetivo de segmentar os participantes de uma discussão com base nos argumentos usados pelos mesmos de modo a produzir conhecimento com a informação atual no SADG. Esta metodologia pode ser colocada num serviço web que segue a arquitetura micro serviços e utiliza Preprocessamento de Dados e Segmentação Intra Frase em conjunto com o Clustering para atingir os objetivos desta dissertação. Word Embedding também é necessário para aplicar técnicas de Clustering a texto em linguagem natural para transformar o texto em vetores que possam ser usados pelas técnicas de Clustering. Também Técnicas de Redução de Dimensionalidade também foram testadas de modo a melhorar os resultados. Mantendo os passos de Preprocessamento e variando as técnicas de Clustering, Word Embedder e as técnicas de Redução de Dimensionalidade de modo a encontrar a melhor abordagem. Essa abordagem consiste na utilização da técnica de Clustering KMeans++ com o SBERT como Word Embedder e UMAP como a técnica de redução de dimensionalidade, reduzindo as dimensões iniciais para duas. Esta experiência obteve um Silhouette Score de 0.63 com 8 clusters no dataset de baseball, que resultou em bons resultados de cluster com base na sua revisão manual e visualização dos WordClouds. A mesma abordagem obteve um Silhouette Score de 0.59 com 16 clusters no dataset das marcas de carros, ao qual usamos esse dataset com validação de abordagem

    Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A Method for Creating Corpora

    Get PDF
    This thesis describes the development and in-depth empirical investigation of a method, called BootMark, for bootstrapping the marking up of named entities in textual documents. The reason for working with documents, as opposed to for instance sentences or phrases, is that the BootMark method is concerned with the creation of corpora. The claim made in the thesis is that BootMark requires a human annotator to manually annotate fewer documents in order to produce a named entity recognizer with a given performance, than would be needed if the documents forming the basis for the recognizer were randomly drawn from the same corpus. The intention is then to use the created named en- tity recognizer as a pre-tagger and thus eventually turn the manual annotation process into one in which the annotator reviews system-suggested annotations rather than creating new ones from scratch. The BootMark method consists of three phases: (1) Manual annotation of a set of documents; (2) Bootstrapping – active machine learning for the purpose of selecting which document to an- notate next; (3) The remaining unannotated documents of the original corpus are marked up using pre-tagging with revision. Five emerging issues are identified, described and empirically investigated in the thesis. Their common denominator is that they all depend on the real- ization of the named entity recognition task, and as such, require the context of a practical setting in order to be properly addressed. The emerging issues are related to: (1) the characteristics of the named entity recognition task and the base learners used in conjunction with it; (2) the constitution of the set of documents annotated by the human annotator in phase one in order to start the bootstrapping process; (3) the active selection of the documents to annotate in phase two; (4) the monitoring and termination of the active learning carried out in phase two, including a new intrinsic stopping criterion for committee-based active learning; and (5) the applicability of the named entity recognizer created during phase two as a pre-tagger in phase three. The outcomes of the empirical investigations concerning the emerging is- sues support the claim made in the thesis. The results also suggest that while the recognizer produced in phases one and two is as useful for pre-tagging as a recognizer created from randomly selected documents, the applicability of the recognizer as a pre-tagger is best investigated by conducting a user study involving real annotators working on a real named entity recognition task
    corecore