6 research outputs found

    CoNTACT : a Dutch COVID-19 adapted BERT for vaccine hesitancy and argumentation detection

    No full text
    We present CoNTACT: a Dutch language model adapted to the domain of COVID-19 tweets. The model was developed by continuing the pre-training phase of RobBERT (Delobelle, 2020) by using 2.8M Dutch COVID-19 related tweets posted in 2021. In order to test the performance of the model and compare it to RobBERT, the two models were tested on two tasks: (1) binary vaccine hesitancy detection and (2) detection of arguments for vaccine hesitancy. For both tasks, not only Twitter but also Facebook data was used to show cross-genre performance. In our experiments, CoNTACT showed statistically significant gains over RobBERT in all experiments for task 1. For task 2, we observed substantial improvements in virtually all classes in all experiments. An error analysis indicated that the domain adaptation yielded better representations of domain-specific terminology, causing CoNTACT to make more accurate classification decisions

    Vaccinpraat : monitoring vaccine skepticism in Dutch Twitter and Facebook comments

    No full text
    We present an online tool – “Vaccinpraat” – that monitors messages expressing skepticism towards COVID-19 vaccination on Dutch-language Twitter and Facebook. The tool provides live updates, statistics and qualitative insights into opinions about vaccines and arguments used to justify antivaccination opinions. An annotation task was set up to create training data for a model that determines the vaccine stance of a message and another model that detects arguments for antivaccination opinions. For the binary vaccine skepticism detection task (vaccine-skeptic vs. non-skeptic), our model obtained F1-scores of 0.77 and 0.69 for Twitter and Facebook, respectively. Experiments on argument detection showed that this multilabel task is more challenging than stance classification, with F1-scores ranging from 0.23 to 0.68 depending on the argument class, suggesting that more research in this area is needed. Additionally, we process the content of messages related to vaccines by applying named entity recognition, fine-grained emotion analysis, and author profiling techniques. Users of the tool can consult monthly reports in PDF format and request data with model predictions. The tool is available at https://vaccinpraat.uantwerpen.be

    Vaccinpraat : monitoring vaccine skepticism in Dutch Twitter and Facebook comments

    No full text
    We present an online tool – “Vaccinpraat” – that monitors messages expressing skepticism towards COVID-19 vaccination on Dutch-language Twitter and Facebook. The tool provides live updates, statistics and qualitative insights into opinions about vaccines and arguments used to justify antivaccination opinions. An annotation task was set up to create training data for a model that determines the vaccine stance of a message and another model that detects arguments for antivaccination opinions. For the binary vaccine skepticism detection task (vaccine-skeptic vs. non-skeptic), our model obtained F1-scores of 0.77 and 0.69 for Twitter and Facebook, respectively. Experiments on argument detection showed that this multilabel task is more challenging than stance classification, with F1-scores ranging from 0.23 to 0.68 depending on the argument class, suggesting that more research in this area is needed. Additionally, we process the content of messages related to vaccines by applying named entity recognition, fine-grained emotion analysis, and author profiling techniques. Users of the tool can consult monthly reports in PDF format and request data with model predictions. The tool is available at https://vaccinpraat.uantwerpen.be

    Predicting COVID-19 symptoms from free text in medical records using Artificial Intelligence : feasibility study

    No full text
    BACKGROUND: Electronic medical records have opened opportunities to analyze clinical practice at large scale. Structured registries and coding procedures such as the International Classification of Primary Care further improved these procedures. However, a large part of the information about the state of patient and the doctors’ observations is still entered in free text fields. The main function of those fields is to report the doctor’s line of thought, to remind oneself and his or her colleagues on follow-up actions, and to be accountable for clinical decisions. These fields contain rich information that can be complementary to that in coded fields, and until now, they have been hardly used for analysis. OBJECTIVE: This study aims to develop a prediction model to convert the free text information on COVID-19–related symptoms from out of hours care electronic medical records into usable symptom-based data that can be analyzed at large scale. METHODS: The design was a feasibility study in which we examined the content of the raw data, steps and methods for modelling, as well as the precision and accuracy of the models. A data prediction model for 27 preidentified COVID-19–relevant symptoms was developed for a data set derived from the database of primary-care out-of-hours consultations in Flanders. A multiclass, multilabel categorization classifier was developed. We tested two approaches, which were (1) a classical machine learning–based text categorization approach, Binary Relevance, and (2) a deep neural network learning approach with BERTje, including a domain-adapted version. Ethical approval was acquired through the Institutional Review Board of the Institute of Tropical Medicine and the ethics committee of the University Hospital of Antwerpen (ref 20/50/693). RESULTS: The sample set comprised 3957 fields. After cleaning, 2313 could be used for the experiments. Of the 2313 fields, 85% (n=1966) were used to train the model, and 15% (n=347) for testing. The normal BERTje model performed the best on the data. It reached a weighted F1 score of 0.70 and an exact match ratio or accuracy score of 0.38, indicating the instances for which the model has identified all correct codes. The other models achieved respectable results as well, ranging from 0.59 to 0.70 weighted F1. The Binary Relevance method performed the best on the data without a frequency threshold. As for the individual codes, the domain-adapted version of BERTje performs better on several of the less common objective codes, while BERTje reaches higher F1 scores for the least common labels especially, and for most other codes in general. CONCLUSIONS: The artificial intelligence model BERTje can reliably predict COVID-19–related information from medical records using text mining from the free text fields generated in primary care settings. This feasibility study invites researchers to examine further possibilities to use primary care routine data

    Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology

    Get PDF
    Gould E, Fraser H, Parker T, et al. Same data, different analysts: variation in effect sizes due to analytical decisions in ecology and evolutionary biology. 2023.Although variation in effect sizes and predicted values among studies of similar phenomena is inevitable, such variation far exceeds what might be produced by sampling error alone. One possible explanation for variation among results is differences among researchers in the decisions they make regarding statistical analyses. A growing array of studies has explored this analytical variability in different (mostly social science) fields, and has found substantial variability among results, despite analysts having the same data and research question. We implemented an analogous study in ecology and evolutionary biology, fields in which there have been no empirical exploration of the variation in effect sizes or model predictions generated by the analytical decisions of different researchers. We used two unpublished datasets, one from evolutionary ecology (blue tit, Cyanistes caeruleus, to compare sibling number and nestling growth) and one from conservation ecology (Eucalyptus, to compare grass cover and tree seedling recruitment), and the project leaders recruited 174 analyst teams, comprising 246 analysts, to investigate the answers to prespecified research questions. Analyses conducted by these teams yielded 141 usable effects for the blue tit dataset, and 85 usable effects for the Eucalyptus dataset. We found substantial heterogeneity among results for both datasets, although the patterns of variation differed between them. For the blue tit analyses, the average effect was convincingly negative, with less growth for nestlings living with more siblings, but there was near continuous variation in effect size from large negative effects to effects near zero, and even effects crossing the traditional threshold of statistical significance in the opposite direction. In contrast, the average relationship between grass cover and Eucalyptus seedling number was only slightly negative and not convincingly different from zero, and most effects ranged from weakly negative to weakly positive, with about a third of effects crossing the traditional threshold of significance in one direction or the other. However, there were also several striking outliers in the Eucalyptus dataset, with effects far from zero. For both datasets, we found substantial variation in the variable selection and random effects structures among analyses, as well as in the ratings of the analytical methods by peer reviewers, but we found no strong relationship between any of these and deviation from the meta-analytic mean. In other words, analyses with results that were far from the mean were no more or less likely to have dissimilar variable sets, use random effects in their models, or receive poor peer reviews than those analyses that found results that were close to the mean. The existence of substantial variability among analysis outcomes raises important questions about how ecologists and evolutionary biologists should interpret published results, and how they should conduct analyses in the future
    corecore