40,360 research outputs found

    Named Entity Recognition in Spanish Biomedical Literature: Short Review and Bert Model

    Get PDF
    Named Entity Recognition (NER) is the rst step for knowledge acquisition when we deal with an unknown corpus of texts. Having received these entities, we have an opportunity to form parameters space and to solve problems of text mining as concept normalization, speech recognition, etc. The recent advances in NER are related to the technology of word embeddings, which transforms text to the form being effective for Deep Learning. In the paper, we show how NER detects pharmacological substances, compounds, and proteins in the dataset obtained from the Spanish Clinical Case Corpus (SPACCC). To achieve this goal, we use contextualized word embeddings based on BERT language representation, which shows better results than the standard word embeddings

    Analyzing transfer learning impact in biomedical cross lingual named entity recognition and normalization

    Get PDF
    Background The volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain. Methods In this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture. Results Several Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80% . For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset. Conclusion These results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages.This work was partially supported by the Research Program of the Ministry of Economy and Competitiveness - Government of Spain, (DeepEMR project TIN2017-87548-C2-1-R)

    Spanish named entity recognition in the biomedical domain

    Get PDF
    Named Entity Recognition in the clinical domain and in languages different from English has the difficulty of the absence of complete dictionaries, the informality of texts, the polysemy of terms, the lack of accordance in the boundaries of an entity, the scarcity of corpora and of other resources available. We present a Named Entity Recognition method for poorly resourced languages. The method was tested with Spanish radiology reports and compared with a conditional random fields system.Peer ReviewedPostprint (author's final draft

    Improving the Performance of a Named Entity Extractor by Applying a Stacking Scheme

    Get PDF
    In this paper we investigate the way of improving the performance of a Named Entity Extraction (NEE) system by applying machine learning techniques and corpus transformation. The main resources used in our experiments are the publicly available tagger TnT and a corpus of Spanish texts in which named entities occurrences are tagged with BIO tags. We split the NEE task into two subtasks 1) Named Entity Recognition (NER) that involves the identification of the group of words that make up the name of an entity and 2) Named Entity Classification (NEC) that determines the category of a named entity. We have focused our work on the improvement of the NER task, generating four different taggers with the same training corpus and combining them using a stacking scheme. We improve the baseline of the NER task (Fβ=1 value of 81.84) up to a value of 88.37. When a NEC module is added to the NER system the performance of the whole NEE task is also improved. A value of 70.47 is achieved from a baseline of 66.07

    Flabase: towards the creation of a flamenco music knowledge base

    Get PDF
    Online information about flamenco music is scattered overdifferent sites and knowledge bases. Unfortunately, thereis no common repository that indexes all these data. Inthis work, information related to flamenco music is gath-ered from general knowledge bases (e.g., Wikipedia, DB-pedia), music encyclopedias (e.g., MusicBrainz), and spe-cialized flamenco websites, and is then integrated into anew knowledge base called FlaBase. As resources fromdifferent data sources do not share common identifiers, aprocess of pair-wise entity resolution has been performed.FlaBase contains information about 1,174 artists, 76pa-los(flamenco genres), 2,913 albums, 14,078 tracks, and771 Andalusian locations. It is freely available in RDF andJSON formats. In addition, a method for entity recognitionand disambiguation for FlaBase has been created. The sys-tem can recognize and disambiguate FlaBase entity refer-ences in Spanish texts with an f-measure value of 0.77. Weapplied it to biographical texts present in Flabase. By usingthe extracted information, the knowledge base is populatedwith relevant information and a semantic graph is createdconnecting the entities of FlaBase. Artists relevance is thencomputed over the graph and evaluated according to a fla-menco expert criteria. Accuracy of results shows a highdegree of quality and completeness of the knowledge base
    corecore