4 research outputs found

    Comparison of the Influence of Different Normalization Methods on Tweet Sentiment Analysis in the Serbian language

    Get PDF
    Given the growing need to quickly process texts and extract information from the data for various purposes, correct normalization that will contribute to better and faster processing is of great importance. The paper presents the comparison of different methods of short text (tweet) normalization.  The comparison is illustrated by the example of text sentiment analysis.  The results of an application of different normalizations are presented, taking into account time complexity and sentiment algorithm classification accuracy. It has been shown that using cutting to n-gram normalization, better or similar results are obtained compared to language-dependent normalizations. Including the time complexity, it is concluded that the application of this language-independent normalization gives optimal results in the classification of short informal texts

    Zero- and Few-Shot Machine Learning for Named Entity Recognition in Biomedical Texts

    Get PDF
    Named entity recognition (NER) is an NLP that involves identifying and classifying named entities in text. Token classification is a crucial subtask of NER that assumes assigning labels to individual tokens within a text, indicating the named entity category to which they belong. Fine-tuning large language models (LLMs) on labeled domain datasets has emerged as a powerful technique for improving NER performance. By training a pretrained LLM such as BERT on domain-specific labeled data, the model learns to recognize named entities specific to that domain with high accuracy. This approach has been applied to a wide range of domains including biomedical and has demonstrated significant improvements in NER accuracy. Still, data for fine-tuning pre-trained LLMs is large and labeling is a time-consuming and expensive process that requires expert domain knowledge. Also, domains with an open set of classes yield difficulties in traditional machine learning approaches since the number of classes to predict needs to be pre-defined. Our solution to the two mentioned problems is based on data transformation for factorizing the initial multiple classification problem into a binary one and applying crossencoder- based BERT architecture for zero- and few-shot learning. To create our dataset, we transformed six widely used biomedical datasets that contain various biomedical entities such as genes, drugs, diseases, adverse events, chemicals, etc., into a uniform format. This transformation process enabled us to merge the datasets into a single cohesive dataset of 26 different named entity classes. We then fine-tuned two pre-trained language models: BioBERT and PubMedBERT for the NER task in zero- and few-shot settings. The results of the experiment for 9 classes in zero-shot mode are promising for semantically similar classes and improve significantly after providing only a few supporting examples for almost all classes. The best results were obtained using a fine-tuned PubMedBERT model, with average F1 scores of 35.44%, 50.10%, 69.94%, and 79.51% for zero-shot, one-shot, 10-shot, and 100-shot NER respectively.Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 202

    Uncovering the Reasons Behind COVID-19 Vaccine Hesitancy in Serbia: Sentiment-Based Topic Modeling

    No full text
    BackgroundSince the first COVID-19 vaccine appeared, there has been a growing tendency to automatically determine public attitudes toward it. In particular, it was important to find the reasons for vaccine hesitancy, since it was directly correlated with pandemic protraction. Natural language processing (NLP) and public health researchers have turned to social media (eg, Twitter, Reddit, and Facebook) for user-created content from which they can gauge public opinion on vaccination. To automatically process such content, they use a number of NLP techniques, most notably topic modeling. Topic modeling enables the automatic uncovering and grouping of hidden topics in the text. When applied to content that expresses a negative sentiment toward vaccination, it can give direct insight into the reasons for vaccine hesitancy. ObjectiveThis study applies NLP methods to classify vaccination-related tweets by sentiment polarity and uncover the reasons for vaccine hesitancy among the negative tweets in the Serbian language. MethodsTo study the attitudes and beliefs behind vaccine hesitancy, we collected 2 batches of tweets that mention some aspects of COVID-19 vaccination. The first batch of 8817 tweets was manually annotated as either relevant or irrelevant regarding the COVID-19 vaccination sentiment, and then the relevant tweets were annotated as positive, negative, or neutral. We used the annotated tweets to train a sequential bidirectional encoder representations from transformers (BERT)-based classifier for 2 tweet classification tasks to augment this initial data set. The first classifier distinguished between relevant and irrelevant tweets. The second classifier used the relevant tweets and classified them as negative, positive, or neutral. This sequential classifier was used to annotate the second batch of tweets. The combined data sets resulted in 3286 tweets with a negative sentiment: 1770 (53.9%) from the manually annotated data set and 1516 (46.1%) as a result of automatic classification. Topic modeling methods (latent Dirichlet allocation [LDA] and nonnegative matrix factorization [NMF]) were applied using the 3286 preprocessed tweets to detect the reasons for vaccine hesitancy. ResultsThe relevance classifier achieved an F-score of 0.91 and 0.96 for relevant and irrelevant tweets, respectively. The sentiment polarity classifier achieved an F-score of 0.87, 0.85, and 0.85 for negative, neutral, and positive sentiments, respectively. By summarizing the topics obtained in both models, we extracted 5 main groups of reasons for vaccine hesitancy: concern over vaccine side effects, concern over vaccine effectiveness, concern over insufficiently tested vaccines, mistrust of authorities, and conspiracy theories. ConclusionsThis paper presents a combination of NLP methods applied to find the reasons for vaccine hesitancy in Serbia. Given these reasons, it is now possible to better understand the concerns of people regarding the vaccination process

    A transformer-based method for zero and few-shot biomedical named entity recognition

    Full text link
    Supervised named entity recognition (NER) in the biomedical domain is dependent on large sets of annotated texts with the given named entities, whose creation can be time-consuming and expensive. Furthermore, the extraction of new entities often requires conducting additional annotation tasks and retraining the model. To address these challenges, this paper proposes a transformer-based method for zero- and few-shot NER in the biomedical domain. The method is based on transforming the task of multi-class token classification into binary token classification (token contains the searched entity or does not contain the searched entity) and pre-training on a larger amount of datasets and biomedical entities, from where the method can learn semantic relations between the given and potential classes. We have achieved average F1 scores of 35.44% for zero-shot NER, 50.10% for one-shot NER, 69.94% for 10-shot NER, and 79.51% for 100-shot NER on 9 diverse evaluated biomedical entities with PubMedBERT fine-tuned model. The results demonstrate the effectiveness of the proposed method for recognizing new entities with limited examples, with comparable or better results from the state-of-the-art zero- and few-shot NER methods.Comment: Collaboration between Bayer Pharma R&D and Serbian Institute for Artificial Intelligence Research and Developmen
    corecore