1,088 research outputs found

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    Robust input representations for low-resource information extraction

    Get PDF
    Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jüngsten Fortschritte auf dem Gebiet der Verarbeitung natürlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies führte zu einer Vielzahl neuer Forschungsfragen bezüglich der Stabilität solcher großen Systeme und ihrer Anwendbarkeit über gut untersuchte Aufgaben und Datensätze hinaus, wie z. B. die Informationsextraktion für Nicht-Standardsprachen, aber auch Textdomänen und Aufgaben, für die selbst im Englischen nur wenige Trainingsdaten zur Verfügung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige Beiträge in Bereichen wie Repräsentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende Beschränkungen zu überwinden, darunter fehlende Trainingsressourcen, ungesehene Domänen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die Domänenlücke zwischen Repräsentationsmodellen zu schließen, z.B. durch domänenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen Repräsentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die Leistungsfähigkeit unserer Methoden für verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und Domänen

    FRASIMED: a Clinical French Annotated Resource Produced through Crosslingual BERT-Based Annotation Projection

    Full text link
    Natural language processing (NLP) applications such as named entity recognition (NER) for low-resource corpora do not benefit from recent advances in the development of large language models (LLMs) where there is still a need for larger annotated datasets. This research article introduces a methodology for generating translated versions of annotated datasets through crosslingual annotation projection. Leveraging a language agnostic BERT-based approach, it is an efficient solution to increase low-resource corpora with few human efforts and by only using already available open data resources. Quantitative and qualitative evaluations are often lacking when it comes to evaluating the quality and effectiveness of semi-automatic data generation strategies. The evaluation of our crosslingual annotation projection approach showed both effectiveness and high accuracy in the resulting dataset. As a practical application of this methodology, we present the creation of French Annotated Resource with Semantic Information for Medical Entities Detection (FRASIMED), an annotated corpus comprising 2'051 synthetic clinical cases in French. The corpus is now available for researchers and practitioners to develop and refine French natural language processing (NLP) applications in the clinical field (https://zenodo.org/record/8355629), making it the largest open annotated corpus with linked medical concepts in French

    D-TERMINE : data-driven term extraction methodologies investigated

    Get PDF
    Automatic term extraction is a task in the field of natural language processing that aims to automatically identify terminology in collections of specialised, domain-specific texts. Terminology is defined as domain-specific vocabulary and consists of both single-word terms (e.g., corpus in the field of linguistics, referring to a large collection of texts) and multi-word terms (e.g., automatic term extraction). Terminology is a crucial part of specialised communication since terms can concisely express very specific and essential information. Therefore, quickly and automatically identifying terms is useful in a wide range of contexts. Automatic term extraction can be used by language professionals to find which terms are used in a domain and how, based on a relevant corpus. It is also useful for other tasks in natural language processing, including machine translation. One of the main difficulties with term extraction, both manual and automatic, is the vague boundary between general language and terminology. When different people identify terms in the same text, it will invariably produce different results. Consequently, creating manually annotated datasets for term extraction is a costly, time- and effort- consuming task. This can hinder research on automatic term extraction, which requires gold standard data for evaluation, preferably even in multiple languages and domains, since terms are language- and domain-dependent. Moreover, supervised machine learning methodologies rely on annotated training data to automatically deduce the characteristics of terms, so this knowledge can be used to detect terms in other corpora as well. Consequently, the first part of this PhD project was dedicated to the construction and validation of a new dataset for automatic term extraction, called ACTER – Annotated Corpora for Term Extraction Research. Terms and Named Entities were manually identified with four different labels in twelve specialised corpora. The dataset contains corpora in three languages and four domains, leading to a total of more than 100k annotations, made over almost 600k tokens. It was made publicly available during a shared task we organised, in which five international teams competed to automatically extract terms from the same test data. This illustrated how ACTER can contribute towards advancing the state-of-the-art. It also revealed that there is still a lot of room for improvement, with moderate scores even for the best teams. Therefore, the second part of this dissertation was devoted to researching how supervised machine learning techniques might contribute. The traditional, hybrid approach to automatic term extraction relies on a combination of linguistic and statistical clues to detect terms. An initial list of unique candidate terms is extracted based on linguistic information (e.g., part-of-speech patterns) and this list is filtered based on statistical metrics that use frequencies to measure whether a candidate term might be relevant. The result is a ranked list of candidate terms. HAMLET – Hybrid, Adaptable Machine Learning Approach to Extract Terminology – was developed based on this traditional approach and applies machine learning to efficiently combine more information than could be used with a rule-based approach. This makes HAMLET less susceptible to typical issues like low recall on rare terms. While domain and language have a large impact on results, robust performance was reached even without domain- specific training data, and HAMLET compared favourably to a state-of-the-art rule-based system. Building on these findings, the third and final part of the project was dedicated to investigating methodologies that are even further removed from the traditional approach. Instead of starting from an initial list of unique candidate terms, potential terms were labelled immediately in the running text, in their original context. Two sequential labelling approaches were developed, evaluated and compared: a feature- based conditional random fields classifier, and a recurrent neural network with word embeddings. The latter outperformed the feature-based approach and was compared to HAMLET as well, obtaining comparable and even better results. In conclusion, this research resulted in an extensive, reusable dataset and three distinct new methodologies for automatic term extraction. The elaborate evaluations went beyond reporting scores and revealed the strengths and weaknesses of the different approaches. This identified challenges for future research, since some terms, especially ambiguous ones, remain problematic for all systems. However, overall, results were promising and the approaches were complementary, revealing great potential for new methodologies that combine multiple strategies

    Incorporating Ontological Information in Biomedical Entity Linking of Phrases in Clinical Text

    Get PDF
    Biomedical Entity Linking (BEL) is the task of mapping spans of text within biomedical documents to normalized, unique identifiers within an ontology. Translational application of BEL on clinical notes has enormous potential for augmenting discretely captured data in electronic health records, but the existing paradigm for evaluating BEL systems developed in academia is not well aligned with real-world use cases. In this work, we demonstrate a proof of concept for incorporating ontological similarity into the training and evaluation of BEL systems to begin to rectify this misalignment. This thesis has two primary components: 1) a comprehensive literature review and 2) a methodology section to propose novel BEL techniques to contribute to scientific progress in the field. In the literature review component, I survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, and outline the technical components that vii comprise BEL systems. In the methodology component, I describe my experiments incorporating ontological information into training a BERT encoder for entity linking

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    Named Entity Recognition and Linking in a Multilingual Biomedical Setting

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2021Information analysis is an essential process for all researchers and physicians. However, the amount of biomedical literature that we currently have available and the format in which it is found make this process difficult. Therefore, it is essential to apply text mining tools to automatically obtain information from these documents. The problem is that most of these tools are not designed to deal with non-English languages, which is critical in the biomedical literature, since many of these documents are written in the authors’ native language. Although there have been organized several shared tasks where text mining tools were developed for the Spanish language, the same does not happen for the Portuguese language. However, due to the lexical similarity between the two languages, it is possible to hypothesize that the tools for the two languages may be similar and that there is an annotation transfer between Portuguese and Spanish. To contribute to the development of text mining tools for Portuguese and Spanish, this dissertation presents the ICERL (Iberian Cancer-related Entity Recognition and Linking) system, a NERL (Named Entity Recognition and Linking) system that uses deep learning and it is composed of two similar pipelines for each language, and the parallel corpus ICR (Iberian Cancer-related) corpus. Both these tools are focused on the oncology domain. The application of the ICERL system on the ICR corpus resulted in 3,999 annotations in Spanish and 3,287 in Portuguese. The similarities between the annotations of the two languages and the F1-score of 0.858 that resulted from the comparison of the Portuguese annotations with the Spanish ones confirm the hypothesis initially presented.A divulgação de descobertas realizadas pelos investigadores e médicos é feita através de vários documentos como livros, artigos, patentes e outros tipos de publicações. Para que investigadores estejam atualizados sobre a sua área de interesse, é essencial que realizem uma análise rápida e eficaz destes documentos. Isto porque, quanto mais eficiente for esta fase, melhores serão os resultados que serão obtidos e, quanto mais rápida for, mais tempo poderão dedicar a outras componentes dos seus trabalhos. No entanto, a velocidade com que estes documentos são publicados e o facto de o texto presente nos mesmos ser expresso em linguagem natural dificulta esta tarefa. Por isso, torna-se essencial a aplicação de ferramentas de prospeção de texto para a extração de informação. As ferramentas de prospeção de texto são compostas por diversas etapas, como por exemplo, Reconhecimento de Entidades Nomeadas (em inglês Named Entity Recognition ou NER) e Mapeamento de Entidades Nomeadas (em inglês Named Entity Linking ou NEL). A etapa NER corresponde à identificação de uma entidade no texto. NEL consiste na ligação de entidades a uma base de conhecimento. Os sistemas estado-de-arte para a NER são métodos de aprendizagem profunda e normalmente utilizam a arquitetura BiLSTM-CRF. Por outro lado, os sistemas estado-de-arte NEL usam não só métodos de aprendizagem profunda, mas também métodos baseados em grafos. A maioria dos sistemas de prospeção de texto que atualmente temos disponíveis está desenhada ape nas para a língua inglesa, o que é problemático, pois muitas das vezes a literatura biomédica encontra-se descrita na língua nativa dos autores. Para resolver este problema têm surgido competições para desenvolver sistemas de prospeção de texto para outras línguas que não o inglês. Uma das línguas que têm sido um dos principais focos destas competições é a língua espanhola. O espanhol é a segunda língua com o maior número de falantes nativos no mundo e com um elevado número de publicações biomédicas disponível. Um dos exemplos de competições para a língua espanhola é o CANTEMIST. O objetivo do CANTEMIST passa pela identificação de entidades do domínio oncológico e a ligação das mesmas à base de dados Clasificación Internacional de Enfermedades para Oncología (CIE-O). Por outro lado, o português não têm sido alvo de grande interesse por parte destas competições. Devido ao facto de que o português e o espanhol derivarem do latim, existe uma semelhança lexical elevada entre as duas línguas (89%). Portanto, é possível assumir que as soluções encontradas para espanhol possam ser adaptadas ou utilizadas para o português, e que exista transferências de anotações entre as duas línguas. Por isso, o objetivo deste trabalho passa por criar ferramentas que validem esta hipótese: o sistema ICERL (Iberian Cancer-related Entity Recognition and Linking) e o corpus ICR (Iberian Cancer-related). O sistema ICERL é um sistema NERL (Named Entity Recognition and Linking) bilíngue português-espanhol, enquanto que o ICR é um corpus paralelo para as mesmas línguas. Ambas as ferramentas estão desenhadas para o domínio oncológico. A primeira etapa no desenvolvimento do sistema ICERL passou pela criação de uma pipeline NERL para a língua espanhola específica para o domínio oncológico. Esta pipeline foi baseada no trabalho desenvolvido pela equipa LasigeBioTM na competição CANTEMIST. A abordagem apresentada pelo LasigeBioTM no CANTEMIST consiste na utilização da framework Flair para a tarefa NER e do algoritmo Personalized PageRank (PPR) para a tarefa NEL. O Flair é uma ferramenta que permite a combinação de diferentes embeddings (representações vetoriais para palavras) de diferentes modelos num só para a tarefa NER. O PPR é uma variação do algoritmo PageRank que é utilizado para classificar importância de páginas web. O algoritmo PageRank é aplicado sobre um grafo. Originalmente, cada nó do grafo representava uma página web e as ligações entre nós representavam hiperligações entre páginas. O algoritmo estima a coerência de cada nó no grafo, isto é, a sua relevância. No contexto da tarefa NEL, o grafo é composto por candidatos para as entidades de interesse. O Flair foi utilizado pela equipa LasigeBioTM para o treino de embeddings que foram obtidos em documentos em espanhol do PubMed. Estes embeddings foram integrados num modelo para NER que foi treinado nos conjuntos de treino e desenvolvimento do corpus do CANTEMIST. O modelo treinado foi depois utilizado no conjunto de teste do corpus do CANTEMIST para a obtenção de ficheiros de anotação com as entidades reconhecidas. Foi depois feita uma procura pelos candidatos para a tarefa de NEL das entidades reconhecidas em três bases de dados: o CIE-O, o Health Sciences Descriptors (DeCS) e o International Classification of Diseases (ICD). A partir destes candidatos foi construído um grafo e através do algoritmo PPR os candidatos foram classificados e foi escolhido o melhor candidato para ligar cada entidade. Esta pipeline foi aperfeiçoada através da adição de novos embeddings, um prolongamento do treino no modelo NER e uma correção de erros no código do sistema para a tarefa NEL. Apesar destas alterações contribuírem para um aumento significativo na performance da tarefa NEL (medida-F de 0.0061 para 0.665), o mesmo não aconteceu para a tarefa NER (medida-F de 0.741 para 0.754). A versão final do sistema ICERL é composta por uma pipeline para a língua portuguesa e pela pipeline que foi testada no corpus do CANTEMIST, com uma ligeira diferença na tarefa NEL: em vez de ser escolhido apenas um candidato para cada entidade, é escolhida uma lista de candidatos do CIE-O e o DeCS. Já na pipeline portuguesa são escolhidos candidatos do DeCS e da Classificação Internacional de Doenças (CID). Esta diferença na tarefa NEL deve-se ao método que foi utilizado para avaliar a performance do sistema ICERL e para não restringir o sistema a apenas um candidato e a um vocabulário. Para a construção da pipeline portuguesa, três modelos para a tarefa NER foram testados e concluiu-se que a melhor abordagem passaria pela combinação de um modelo semelhante ao modelo utilizado na pipeline espanhola e o modelo BioBERTpt. Devido à elevada semelhança lexical entre as duas línguas, foi testada a hipótese de utilização da mesma pipeline para as duas línguas. No entanto, através do software NLPStatTest foi possível concluir que a utilização de uma pipeline específica para cada língua traduz-se numa melhoria de 58 por cento na medida-F para os textos em português. O corpus ICR é composto por 1555 documentos para cada língua que foram retirados do SciELO. Uma vez que a pipeline espanhola foi treinada com ficheiros do CANTEMIST corpus, foi também necessário retirar documentos do SciELO e do PubMed para treinar a pipeline portuguesa. O sistema ICERL foi aplicado ao corpus ICR e o método de avaliação passou pela comparação dos resultados das anotações portuguesas com as anotações em espanhol. Isto porque foi possível avaliar a performance da pipeline espanhol no corpus do CANTEMIST, e os resultados obtidos foram próximos do estado-de-arte. A aplicação do sistema ICERL no corpus ICR resultou em 3999 anotações em espanhol sendo que 216 dessas anotações são únicas e 3287 em português sendo que 171 dessas anotações são únicas. Para além disso, a entidade câncer é a entidade mais frequente para as duas línguas. Para além destas semelhanças nas anotações, o facto de ter sido obtido 0.858 em medida-F no método de avaliação permite concluir que existe transferências de anotações entre as duas línguas e que é possível utilizar ferramentas de prospeção de texto semelhantes para ambas

    Biomedical Term Extraction: NLP Techniques in Computational Medicine

    Get PDF
    Artificial Intelligence (AI) and its branch Natural Language Processing (NLP) in particular are main contributors to recent advances in classifying documentation and extracting information from assorted fields, Medicine being one that has gathered a lot of attention due to the amount of information generated in public professional journals and other means of communication within the medical profession. The typical information extraction task from technical texts is performed via an automatic term recognition extractor. Automatic Term Recognition (ATR) from technical texts is applied for the identification of key concepts for information retrieval and, secondarily, for machine translation. Term recognition depends on the subject domain and the lexical patterns of a given language, in our case, Spanish, Arabic and Japanese. In this article, we present the methods and techniques for creating a biomedical corpus of validated terms, with several tools for optimal exploitation of the information therewith contained in said corpus. This paper also shows how these techniques and tools have been used in a prototype

    Overview of BioASQ 2021-MESINESP track. Evaluation of advance hierarchical classification techniques for scientific literature, patents and clinical trials

    Get PDF
    CLEF 2021 – Conference and Labs of the Evaluation Forum, September 21–24, 2021, Bucharest, Romania,There is a pressing need to exploit recent advances in natural language processing technologies, in particular language models and deep learning approaches, to enable improved retrieval, classification and ultimately access to information contained in multiple, heterogeneous types of documents. This is particularly true for the field of biomedicine and clinical research, where medical experts and scientists need to carry out complex search queries against a variety of document collections, including literature, patents, clinical trials or other kind of content like EHRs. Indexing documents with structured controlled vocabularies used for semantic search engines and query expansion purposes is a critical task for enabling sophisticated user queries and even cross-language retrieval. Due to the complexity of the medical domain and the use of very large hierarchical indexing terminologies, implementing efficient automatic systems to aid manual indexing is extremely difficult. This paper provides a summary of the MESINESP task results on medical semantic indexing in Spanish (BioASQ/ CLEF 2021 Challenge). MESINESP was carried out in direct collaboration with literature content databases and medical indexing experts using the DeCS vocabulary, a similar resource as MeSH terms. Seven participating teams used advanced technologies including extreme multilabel classification and deep language models to solve this challenge which can be viewed as a multi-label classification problem. MESINESP resources, we have released a Gold Standard collection of 243,000 documents with a total of 2179 manual annotations divided in train, development and test subsets covering literature, patents as well as clinical trial summaries, under a cross-genre training and data labeling scenario. Manual indexing of the evaluation subsets was carried out by three independent experts using a specially developed indexing interface called ASIT. Additionally, we have published a collection of large-scale automatic semantic annotations based on NER systems of these documents with mentions of drugs/medications (170,000), symptoms (137,000), diseases (840,000) and clinical procedures (415,000). In addition to a summary of the used technologies by the teams, this paperS
    corecore