13 research outputs found

    A survey on recent advances in named entity recognition

    Full text link
    Named Entity Recognition seeks to extract substrings within a text that name real-world objects and to determine their type (for example, whether they refer to persons or organizations). In this survey, we first present an overview of recent popular approaches, but we also look at graph- and transformer- based methods including Large Language Models (LLMs) that have not had much coverage in other surveys. Second, we focus on methods designed for datasets with scarce annotations. Third, we evaluate the performance of the main NER implementations on a variety of datasets with differing characteristics (as regards their domain, their size, and their number of classes). We thus provide a deep comparison of algorithms that are never considered together. Our experiments shed some light on how the characteristics of datasets affect the behavior of the methods that we compare.Comment: 30 page

    HunFlair2 in a cross-corpus evaluation of biomedical named entity recognition and normalization tools

    Full text link
    With the exponential growth of the life science literature, biomedical text mining (BTM) has become an essential technology for accelerating the extraction of insights from publications. Identifying named entities (e.g., diseases, drugs, or genes) in texts and their linkage to reference knowledge bases are crucial steps in BTM pipelines to enable information aggregation from different documents. However, tools for these two steps are rarely applied in the same context in which they were developed. Instead, they are applied in the wild, i.e., on application-dependent text collections different from those used for the tools' training, varying, e.g., in focus, genre, style, and text type. This raises the question of whether the reported performance of BTM tools can be trusted for downstream applications. Here, we report on the results of a carefully designed cross-corpus benchmark for named entity extraction, where tools were applied systematically to corpora not used during their training. Based on a survey of 28 published systems, we selected five for an in-depth analysis on three publicly available corpora encompassing four different entity types. Comparison between tools results in a mixed picture and shows that, in a cross-corpus setting, the performance is significantly lower than the one reported in an in-corpus setting. HunFlair2 showed the best performance on average, being closely followed by PubTator. Our results indicate that users of BTM tools should expect diminishing performances when applying them in the wild compared to original publications and show that further research is necessary to make BTM tools more robust

    SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German

    Get PDF
    International audienceIn this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers. The challenge proposed various tasks for three languages, among them we focused on Named Entity Recognition in French and German texts. The best system we proposed ranked third for these two languages, it uses FastText em-beddings and Elmo language models (FrELMo and German ELMo). We show that combining several word representations enhances the quality of the results for all NE types and that the segmentation in sentences has an important impact on the results

    Robust input representations for low-resource information extraction

    Get PDF
    Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jĂŒngsten Fortschritte auf dem Gebiet der Verarbeitung natĂŒrlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies fĂŒhrte zu einer Vielzahl neuer Forschungsfragen bezĂŒglich der StabilitĂ€t solcher großen Systeme und ihrer Anwendbarkeit ĂŒber gut untersuchte Aufgaben und DatensĂ€tze hinaus, wie z. B. die Informationsextraktion fĂŒr Nicht-Standardsprachen, aber auch TextdomĂ€nen und Aufgaben, fĂŒr die selbst im Englischen nur wenige Trainingsdaten zur VerfĂŒgung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige BeitrĂ€ge in Bereichen wie ReprĂ€sentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende BeschrĂ€nkungen zu ĂŒberwinden, darunter fehlende Trainingsressourcen, ungesehene DomĂ€nen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die DomĂ€nenlĂŒcke zwischen ReprĂ€sentationsmodellen zu schließen, z.B. durch domĂ€nenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen ReprĂ€sentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die LeistungsfĂ€higkeit unserer Methoden fĂŒr verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und DomĂ€nen

    Entities with quantities : extraction, search, and ranking

    Get PDF
    Quantities are more than numeric values. They denote measures of the world’s entities such as heights of buildings, running times of athletes, energy efficiency of car models or energy production of power plants, all expressed in numbers with associated units. Entity-centric search and question answering (QA) are well supported by modern search engines. However, they do not work well when the queries involve quantity filters, such as searching for athletes who ran 200m under 20 seconds or companies with quarterly revenue above $2 Billion. State-of-the-art systems fail to understand the quantities, including the condition (less than, above, etc.), the unit of interest (seconds, dollar, etc.), and the context of the quantity (200m race, quarterly revenue, etc.). QA systems based on structured knowledge bases (KBs) also fail as quantities are poorly covered by state-of-the-art KBs. In this dissertation, we developed new methods to advance the state-of-the-art on quantity knowledge extraction and search.Zahlen sind mehr als nur numerische Werte. Sie beschreiben Maße von EntitĂ€ten wie die Höhe von GebĂ€uden, die Laufzeit von Sportlern, die Energieeffizienz von Automodellen oder die Energieerzeugung von Kraftwerken - jeweils ausgedrĂŒckt durch Zahlen mit zugehörigen Einheiten. EntitĂ€tszentriete Anfragen und direktes Question-Answering werden von Suchmaschinen hĂ€ufig gut unterstĂŒtzt. Sie funktionieren jedoch nicht gut, wenn die Fragen Zahlenfilter beinhalten, wie z. B. die Suche nach Sportlern, die 200m unter 20 Sekunden gelaufen sind, oder nach Unternehmen mit einem Quartalsumsatz von ĂŒber 2 Milliarden US-Dollar. Selbst moderne Systeme schaffen es nicht, QuantitĂ€ten, einschließlich der genannten Bedingungen (weniger als, ĂŒber, etc.), der Maßeinheiten (Sekunden, Dollar, etc.) und des Kontexts (200-Meter-Rennen, Quartalsumsatz usw.), zu verstehen. Auch QA-Systeme, die auf strukturierten Wissensbanken (“Knowledge Bases”, KBs) aufgebaut sind, versagen, da quantitative Eigenschaften von modernen KBs kaum erfasst werden. In dieser Dissertation werden neue Methoden entwickelt, um den Stand der Technik zur Wissensextraktion und -suche von QuantitĂ€ten voranzutreiben. Unsere HauptbeitrĂ€ge sind die folgenden: ‱ ZunĂ€chst prĂ€sentieren wir Qsearch [Ho et al., 2019, Ho et al., 2020] – ein System, das mit erweiterten Fragen mit QuantitĂ€tsfiltern umgehen kann, indem es Hinweise verwendet, die sowohl in der Frage als auch in den Textquellen vorhanden sind. Qsearch umfasst zwei HauptbeitrĂ€ge. Der erste Beitrag ist ein tiefes neuronales Netzwerkmodell, das fĂŒr die Extraktion quantitĂ€tszentrierter Tupel aus Textquellen entwickelt wurde. Der zweite Beitrag ist ein neuartiges Query-Matching-Modell zum Finden und zur Reihung passender Tupel. ‱ Zweitens, um beim Vorgang heterogene Tabellen einzubinden, stellen wir QuTE [Ho et al., 2021a, Ho et al., 2021b] vor – ein System zum Extrahieren von QuantitĂ€tsinformationen aus Webquellen, insbesondere Ad-hoc Webtabellen in HTML-Seiten. Der Beitrag von QuTE umfasst eine Methode zur VerknĂŒpfung von QuantitĂ€ts- und EntitĂ€tsspalten, fĂŒr die externe Textquellen genutzt werden. Zur Beantwortung von Fragen kontextualisieren wir die extrahierten EntitĂ€ts-QuantitĂ€ts-Paare mit informativen Hinweisen aus der Tabelle und stellen eine neue Methode zur Konsolidierung und verbesserteer Reihung von Antwortkandidaten durch Inter-Fakten-Konsistenz vor. ‱ Drittens stellen wir QL [Ho et al., 2022] vor – eine Recall-orientierte Methode zur Anreicherung von Knowledge Bases (KBs) mit quantitativen Fakten. Moderne KBs wie Wikidata oder YAGO decken viele EntitĂ€ten und ihre relevanten Informationen ab, ĂŒbersehen aber oft wichtige quantitative Eigenschaften. QL ist frage-gesteuert und basiert auf iterativem Lernen mit zwei HauptbeitrĂ€gen, um die KB-Abdeckung zu verbessern. Der erste Beitrag ist eine Methode zur Expansion von Fragen, um einen grĂ¶ĂŸeren Pool an Faktenkandidaten zu erfassen. Der zweite Beitrag ist eine Technik zur Selbstkonsistenz durch BerĂŒcksichtigung der Werteverteilungen von QuantitĂ€ten

    Knowledge mining over scientific literature and technical documentation

    Full text link
    Abstract This dissertation focuses on the extraction of information implicitly encoded in domain descriptions (technical terminology and related items) and its usage within a restricted-domain question answering system (QA). Since different variants of the same term can be used to refer to the same domain entity, it is necessary to recognize all possible forms of a given term and structure them, so that they can be used in the question answering process. The knowledge about domain descriptions and their mutual relations is leveraged in an extension to an existing QA system, aimed at the technical maintenance manual of a well-known commercial aircraft. The original version of the QA system did not make use of domain descriptions, which are the novelty introduced by the present work. The explicit treatment of domain descriptions provided considerable gains in terms of efficiency, in particular in the process of analysis of the background document collection. Similar techniques were later applied to another domain (biomedical scientific literature), focusing in particular on protein- protein interactions. This dissertation describes in particular: (1) the extraction of domain specific lexical items which refer to entities of the domain; (2) the detection of relationships (like synonymy and hyponymy) among such items, and their organization into a conceptual structure; (3) their usage within a domain restricted question answering system, in order to facilitate the correct identification of relevant answers to a query; (4) the adaptation of the system to another domain, and extension of the basic hypothesis to tasks other than question answering. Zusammenfassung Das Thema dieser Dissertation ist die Extraktion von Information, welche implizit in technischen Terminologien und Ă€hnlichen Ressourcen enthalten ist, sowie ihre Anwendung in einem Antwortextraktionssystem (AE). Da verschiedene Varianten desselben Terms verwendet werden können, um auf den gleichen Begriff zu verweisen, ist die Erkennung und Strukturierung aller möglichen Formen Voraussetzung fĂŒr den Einsatz in einem AE-System. Die Kenntnisse ĂŒber Terme und deren Relationen werden in einem AE System angewandt, welches auf dem Wartungshandbuch eines bekannten Verkehrsflugzeug fokussiert. Die ursprĂŒngliche Version des Systems hatte keine explizite Behandlung von Terminologie. Die explizite Behandlung von Terminologie lieferte eine beachtliche Verbesserung der Effizienz des Systems, insbesondere was die Analyse der zugrundeliegenden Dokumentensammlung betrifft. Ähnliche Methodologien wurden spĂ€ter auf einer anderen DomĂ€ne angewandt (biomedizinische Literatur), mit einen besonderen Fokus auf Interaktionen zwischen Proteinen. Diese Dissertation beschreibt insbesondere: (1) die Extraktion der Terminologie (2) die Identifikation der Relationen zwischen Termen (wie z.B. Synonymie und Hyponymie) (3) deren Verwendung in einen AE System (4) die Portierung des Systems auf eine andere DomĂ€ne

    Representing relational knowledge with language models

    Get PDF
    Relational knowledge is the ability to recognize the relationship between instances, and it has an important role in human understanding a concept or commonsense reasoning. We, humans, structure our knowledge by understanding individual instances together with the relationship among them, which enables us to further expand the knowledge. Nevertheless, modelling relational knowledge with computational models is a long-standing challenge in Natural Language Processing (NLP). The main difficulty at acquiring relational knowledge arises from the generalization capability. For pre-trained Language Model (LM), in spite of the huge impact made in NLP, relational knowledge remains understudied. In fact, GPT-3 (Brown et al., 2020), one of the largest LM at the time being with 175 billions of parameters, has shown to perform worse than a traditional statistical baseline in an analogy benchmark. Our initial results hinted at the type of relational knowledge encoded in some of the LMs. However, we found out that such knowledge can be hardly extracted with a carefully designed method tuned on a task specific validation set. According to such finding, we proposed a method (RelBERT) for distilling relational knowledge via LM fine-tuning. This method successfully retrieves flexible relation embeddings that achieve State-of-The-Art (SoTA) in various analogy benchmarks. Moreover, it exhibits a high generalization ability to be able to handle relation types that are not included in the training data. Finally, we propose a new task of modelling graded relation in named entities, which reveals some limitations of recent SoTA LMs as well as RelBERT, suggesting future research direction to model relational knowledge in the current LM era, especially when it comes to named entities
    corecore