806 research outputs found
Deep Test to Transformers Architecture in Named Entity Recognition
Named Entity Recognition is a task of Natural Language Processing, which aims to extract
and classify named entities such as ”Queen of England”. Depending on the objective of
the extraction, the entities can be classified with different labels. These labels usually are
Person, Organization, and Location but can be extended and include sub-entities like cars,
countries, etc., or very different such as when the scope of the classification is biological,
and the entities are Genes or Virus. These entities are extracted from raw text, which may
be a well-structured scientific document or an internet post, and written in any language.
These constraints create a considerable challenge to create an independent domain model.
So, most of the authors have focused on English documents, which is the most explored
language and contain more labeled data, which requires a significant amount of human
resources. More recently, approaches are focused on Transformers architecture models,
which may take up to days to train and consume millions of labeled entities.
My approach is a statistical one, which means it will be language-independent while
still requiring much computation power. This model will combine multiple techniques
such as Bag of Words, Steeming, and Word2Vec to compute his features. Then, it will
be compared with two transformer-based models, that although they have similar architecture,
they have respectful differences. The three models will be tested in multiple
datasets, each with its challenges, to conduct deep research on each model’s strengths
and weaknesses.
After a tough evaluation process the three models achieved performances of over 90%
in datasets with high number of samples. The biggest challenge were the datasets with
lower data, where the Pipeline achieved better performances than the transformer-based
models.Named Entity Recognition é uma tarefa no Processamento de Língua Natural, que tem
como objectivo extrair e classificar entidades como ”Rainha da Inglaterra”. Dependendo
do objectivo da extração, as entidades podem ser classificadas em diferentes categorias.
As categorias mais comuns são: Pessoa, Organização e Local, mas podem ser estendidas e
incluir sub-entidades como carros, países, entre outros. Existem ainda categorias muito
diferentes, por exemplo, quando o texto é do domínio da Biologia e as categorias são Genes
ou Vírus. Essas entidades são extraídas de diferentes tipos de texto como documentos
científicos estruturados corretamente ou um post da internet, podendo ser escritos em
qualquer idioma. Estes constrangimentos criam um enorme desafio, sendo muito ambicioso
criar um modelo independente do idioma. Acontece que a maioria dos autores
está focado em documentos em inglês, uma vez que este é o idioma mais explorado e
aquele que contém mais dados rotulados. Para obter estes dados são necessários recursos
humanos capazes de os classificar à mão. Mais recentemente, as abordagens estão focadas
em modelos de Deep Learning que podem levar dias para treinar e consomem milhões
de entidades rotuladas.
A minha abordagem é uma abordagem estatística, o que significa que será independente
da língua, embora ainda necessite de muito poder de computação. Este modelo
combinará múltiplas técnicas tais como Bag of Words, Steeming, e Word2Vec para caracterizar
os dados. De seguida, será comparado com dois modelos baseados em transformers,
que embora tenham uma arquitectura semelhante, têm diferenças significativas. Os três
modelos serão testados em múltiplos conjuntos de dados, cada um com os seus desafios,
para conduzir uma pesquisa profunda sobre os pontos fortes e fracos de cada modelo.
Após uma extenso processo de avaliação os três modelos obtiveram métricas superiores
a 90% em datasets com grandes quantidades de dados. O maior desafio foram
os datasets com menos dados onde o Pipeline obteve métricas superiores aos modelos
baseados em transformers
Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications
Negation and speculation are universal linguistic phenomena that affect the performance of Natural Language Processing (NLP) applications, such as those for opinion mining and information retrieval, especially in biomedical data. In this article, we review the corpora annotated with negation and speculation in various natural languages and domains. Furthermore, we discuss the ongoing research into recent rule-based, supervised, and transfer learning techniques for the detection of negating and speculative content. Many English corpora for various domains are now annotated with negation and speculation; moreover, the availability of annotated corpora in other languages has started to increase. However, this growth is insufficient to address these important phenomena in languages with limited resources. The use of cross-lingual models and translation of the well-known languages are acceptable alternatives. We also highlight the lack of consistent annotation guidelines and the shortcomings of the existing techniques, and suggest alternatives that may speed up progress in this research direction. Adding more syntactic features may alleviate the limitations of the existing techniques, such as cue ambiguity and detecting the discontinuous scopes. In some NLP applications, inclusion of a system that is negation- and speculation-aware improves performance, yet this aspect is still not addressed or considered an essential step
Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health
Linking clinical narratives to standardized vocabularies and coding systems
is a key component of unlocking the information in medical text for analysis.
However, many domains of medical concepts lack well-developed terminologies
that can support effective coding of medical text. We present a framework for
developing natural language processing (NLP) technologies for automated coding
of under-studied types of medical information, and demonstrate its
applicability via a case study on physical mobility function. Mobility is a
component of many health measures, from post-acute care and surgical outcomes
to chronic frailty and disability, and is coded in the International
Classification of Functioning, Disability, and Health (ICF). However, mobility
and other types of functional activity remain under-studied in medical
informatics, and neither the ICF nor commonly-used medical terminologies
capture functional status terminology in practice. We investigated two
data-driven paradigms, classification and candidate selection, to link
narrative observations of mobility to standardized ICF codes, using a dataset
of clinical narratives from physical therapy encounters. Recent advances in
language modeling and word embedding were used as features for established
machine learning models and a novel deep learning approach, achieving a macro
F-1 score of 84% on linking mobility activity reports to ICF codes. Both
classification and candidate selection approaches present distinct strengths
for automated coding in under-studied domains, and we highlight that the
combination of (i) a small annotated data set; (ii) expert definitions of codes
of interest; and (iii) a representative text corpus is sufficient to produce
high-performing automated coding systems. This study has implications for the
ongoing growth of NLP tools for a variety of specialized applications in
clinical care and research.Comment: Updated final version, published in Frontiers in Digital Health,
https://doi.org/10.3389/fdgth.2021.620828. 34 pages (23 text + 11
references); 9 figures, 2 table
- …