806 research outputs found

    Deep Test to Transformers Architecture in Named Entity Recognition

    Get PDF
    Named Entity Recognition is a task of Natural Language Processing, which aims to extract and classify named entities such as ”Queen of England”. Depending on the objective of the extraction, the entities can be classified with different labels. These labels usually are Person, Organization, and Location but can be extended and include sub-entities like cars, countries, etc., or very different such as when the scope of the classification is biological, and the entities are Genes or Virus. These entities are extracted from raw text, which may be a well-structured scientific document or an internet post, and written in any language. These constraints create a considerable challenge to create an independent domain model. So, most of the authors have focused on English documents, which is the most explored language and contain more labeled data, which requires a significant amount of human resources. More recently, approaches are focused on Transformers architecture models, which may take up to days to train and consume millions of labeled entities. My approach is a statistical one, which means it will be language-independent while still requiring much computation power. This model will combine multiple techniques such as Bag of Words, Steeming, and Word2Vec to compute his features. Then, it will be compared with two transformer-based models, that although they have similar architecture, they have respectful differences. The three models will be tested in multiple datasets, each with its challenges, to conduct deep research on each model’s strengths and weaknesses. After a tough evaluation process the three models achieved performances of over 90% in datasets with high number of samples. The biggest challenge were the datasets with lower data, where the Pipeline achieved better performances than the transformer-based models.Named Entity Recognition é uma tarefa no Processamento de Língua Natural, que tem como objectivo extrair e classificar entidades como ”Rainha da Inglaterra”. Dependendo do objectivo da extração, as entidades podem ser classificadas em diferentes categorias. As categorias mais comuns são: Pessoa, Organização e Local, mas podem ser estendidas e incluir sub-entidades como carros, países, entre outros. Existem ainda categorias muito diferentes, por exemplo, quando o texto é do domínio da Biologia e as categorias são Genes ou Vírus. Essas entidades são extraídas de diferentes tipos de texto como documentos científicos estruturados corretamente ou um post da internet, podendo ser escritos em qualquer idioma. Estes constrangimentos criam um enorme desafio, sendo muito ambicioso criar um modelo independente do idioma. Acontece que a maioria dos autores está focado em documentos em inglês, uma vez que este é o idioma mais explorado e aquele que contém mais dados rotulados. Para obter estes dados são necessários recursos humanos capazes de os classificar à mão. Mais recentemente, as abordagens estão focadas em modelos de Deep Learning que podem levar dias para treinar e consomem milhões de entidades rotuladas. A minha abordagem é uma abordagem estatística, o que significa que será independente da língua, embora ainda necessite de muito poder de computação. Este modelo combinará múltiplas técnicas tais como Bag of Words, Steeming, e Word2Vec para caracterizar os dados. De seguida, será comparado com dois modelos baseados em transformers, que embora tenham uma arquitectura semelhante, têm diferenças significativas. Os três modelos serão testados em múltiplos conjuntos de dados, cada um com os seus desafios, para conduzir uma pesquisa profunda sobre os pontos fortes e fracos de cada modelo. Após uma extenso processo de avaliação os três modelos obtiveram métricas superiores a 90% em datasets com grandes quantidades de dados. O maior desafio foram os datasets com menos dados onde o Pipeline obteve métricas superiores aos modelos baseados em transformers

    Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications

    Get PDF
    Negation and speculation are universal linguistic phenomena that affect the performance of Natural Language Processing (NLP) applications, such as those for opinion mining and information retrieval, especially in biomedical data. In this article, we review the corpora annotated with negation and speculation in various natural languages and domains. Furthermore, we discuss the ongoing research into recent rule-based, supervised, and transfer learning techniques for the detection of negating and speculative content. Many English corpora for various domains are now annotated with negation and speculation; moreover, the availability of annotated corpora in other languages has started to increase. However, this growth is insufficient to address these important phenomena in languages with limited resources. The use of cross-lingual models and translation of the well-known languages are acceptable alternatives. We also highlight the lack of consistent annotation guidelines and the shortcomings of the existing techniques, and suggest alternatives that may speed up progress in this research direction. Adding more syntactic features may alleviate the limitations of the existing techniques, such as cue ambiguity and detecting the discontinuous scopes. In some NLP applications, inclusion of a system that is negation- and speculation-aware improves performance, yet this aspect is still not addressed or considered an essential step

    Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health

    Get PDF
    Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of under-studied types of medical information, and demonstrate its applicability via a case study on physical mobility function. Mobility is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is coded in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in medical informatics, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This study has implications for the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research.Comment: Updated final version, published in Frontiers in Digital Health, https://doi.org/10.3389/fdgth.2021.620828. 34 pages (23 text + 11 references); 9 figures, 2 table
    corecore