7 research outputs found

    Internship and Thesis Research on Digital Humanities Centre Projects

    No full text
    As part of my three-month internship at the DH Commons in 2020, I worked on both the DH Commons and the Digital Humanities webpages in close collaboration with Laura Ulens (my fellow intern) and Merisa Martinez, our supervisor and the founder of the Commons. We started from scratch, with a limited team, no physical space, no budget and no reputation to fall back on. Additionally, the Covid-19 pandemic threw us a curve ball, and led to us having to to coordinate all our activities using Skype ..

    Vaccinpraat : monitoring vaccine skepticism in Dutch Twitter and Facebook comments

    No full text
    We present an online tool – “Vaccinpraat” – that monitors messages expressing skepticism towards COVID-19 vaccination on Dutch-language Twitter and Facebook. The tool provides live updates, statistics and qualitative insights into opinions about vaccines and arguments used to justify antivaccination opinions. An annotation task was set up to create training data for a model that determines the vaccine stance of a message and another model that detects arguments for antivaccination opinions. For the binary vaccine skepticism detection task (vaccine-skeptic vs. non-skeptic), our model obtained F1-scores of 0.77 and 0.69 for Twitter and Facebook, respectively. Experiments on argument detection showed that this multilabel task is more challenging than stance classification, with F1-scores ranging from 0.23 to 0.68 depending on the argument class, suggesting that more research in this area is needed. Additionally, we process the content of messages related to vaccines by applying named entity recognition, fine-grained emotion analysis, and author profiling techniques. Users of the tool can consult monthly reports in PDF format and request data with model predictions. The tool is available at https://vaccinpraat.uantwerpen.be

    Vaccinpraat : monitoring vaccine skepticism in Dutch Twitter and Facebook comments

    No full text
    We present an online tool – “Vaccinpraat” – that monitors messages expressing skepticism towards COVID-19 vaccination on Dutch-language Twitter and Facebook. The tool provides live updates, statistics and qualitative insights into opinions about vaccines and arguments used to justify antivaccination opinions. An annotation task was set up to create training data for a model that determines the vaccine stance of a message and another model that detects arguments for antivaccination opinions. For the binary vaccine skepticism detection task (vaccine-skeptic vs. non-skeptic), our model obtained F1-scores of 0.77 and 0.69 for Twitter and Facebook, respectively. Experiments on argument detection showed that this multilabel task is more challenging than stance classification, with F1-scores ranging from 0.23 to 0.68 depending on the argument class, suggesting that more research in this area is needed. Additionally, we process the content of messages related to vaccines by applying named entity recognition, fine-grained emotion analysis, and author profiling techniques. Users of the tool can consult monthly reports in PDF format and request data with model predictions. The tool is available at https://vaccinpraat.uantwerpen.be

    CLS INFRA D8.1 Report of the tools for the basic Natural Language Processing (NLP) tasks in the CLS context

    No full text
    This report lists and describes a selection of Natural Language Processing (NLP) tools which are considered to form a Corpus-Enrichment and NLP toolchain for common CLS research tasks. The tools were selected to be: • safely positioned in their life cycle, i.e., state-of-the art, and mature as well as continuously maintained, or in development and promised as CLS Infra Deliverables by March 2025 • as multilingual as possible (beyond English and several major European languages) • as interoperable as possible with other tools and texts in other languages

    Evaluating the performance and usability of a Tesseract-based OCR workflow on French-Dutch bilingual historical sources

    No full text
    The study of texts using a qualitative approach remains the dominant modus operandi in humanities research (D. Nguyen et al., 2020). While most humanities researchers emphasize the critical examination of texts, digital research methodologies are gradually being adopted as complementary options (Levenberg et al., 2018). These computational practices allow researchers to process, aggregate and analyze large quantities of texts. Analytical techniques can help humanities scholars uncover principles and patterns that were previously hidden or identify salient sources for further qualitative research (Bod, 2013; Aiello & Simeone, 2019). However, to support these and more advanced use cases such as Natural Language Processing (NLP), sources must be digitized and transformed into a machine-readable format through Optical Character Recognition (OCR) (Lopresti, 2009). Despite the fact that OCR software is frequently used to convert analogue sources into digital texts, off-the-shelf OCR tools are usually less adapted to historical sources leading to errors in text transcription (MartĂ­nek et al., 2020; Nguyen et al., 2021; Smith & Cordell, 2018). Another disadvantage to these models is that they are very susceptible to noise, resulting in relatively low text detection accuracy. Methods of digital text analysis have the potential to further expand the field of humanities (Blevins & Robichaud, 2011; Kuhn, 2019; Nguyen et al., 2021). However, as OCR quality has a profound impact on these methods, it is important that OCR-generated text is as accurate as possible to avoid bias (Traub et al., 2015; Strien et al., 2020). Adapting OCR systems to distinct historical sources is not only expensive and time-consuming, but the technical knowledge required to (re)train OCR models is often perceived as a hurdle by humanists (Nguyen et al., 2021; Smith & Cordell, 2018). Consequently, research efforts are often geared towards improving the output of the off-the-shelf OCR tools through a process of error analysis and post-correction (Nguyen et al., 2019). These efforts have resulted in streamlined, domain-specific OCR workflows including OCR4all, Escriptorium and OCR-D (Reul et al., 2019; Kiessling et al., 2019; Neudecker et al., 2019). Despite these efforts, there are limited OCR workflows for non-English and multilingual texts (Strien et al., 2020; Reynaert et al., 2020). In this short paper we present our OCR workflow approach that proposes a user-friendly solution for bilingual historical texts. We test this on a corpus of art exhibition catalogs from INSERT EXACT PERIOD. These texts from the 19th and 20th century, a time period marked by a major expansion of the printed word, a context that makes OCR highly meaningful as manually processing these texts would be very laborious (Taunton, 2014). This is a corpus of catalogs that record works present at specific exhibitions, the so-called salontentoonstellingen, which were held from 1792 to 1914 in Antwerp, Ghent and Brussels. The catalogs are bilingual - French and Dutch - printed texts

    Computational Literary Studies Infrastructure (CLSINFRA): a H2020 Research Infrastructure Project that aids to connect researchers, data, and methods

    No full text
    The aim of this poster is to provide an overview of the principal objectives of the newly started H2020 Computational Literary Studies (CLS) project- https://www.clsinfra.io. CLS is a infrastructure project works to develop and bring together resources of high-quality data, tools and knowledge to aid new approaches to studying literature in the digital age. Conducting computational literary studies has a number of challenges and opportunities from multilingual and bringing together distributing information. At present, the landscape of literary data is diverse and fragmented. Even though many resources are currently available in digital libraries, archives, repositories, websites or catalogues, a lack of standardisation hinders how they are constructed, accessed and the extent to which they are reusable (Ciotti 2014). CLS project aims to federate these resources, with the tools needed to interrogate them, and with a widened base of users, in the spirit of the FAIR and CARE principles (Wilkinson et al. 2016). The resulting improvements will benefit researchers by bridging gaps between greater- and lesser- resourced communities in computational literary studies and beyond, ultimately offering opportunities to create new research and insight into our shared and varied European cultural heritage. Rather than building entirely new resources for literary studies, the project is committed to exploiting and connecting the already-existing efforts and initiatives, in order to acknowledge and utilize the immense human labour that has already been undertaken. Therefore, the project builds on recently- compiled high-quality literary corpora, such as DraCor and ELTeC (Fischer et al. 2019, Burnard et al. 2021, Schöch et al. in press), integrates existing tools for text analysis, e.g. TXM, stylo, multilingual NLP pipelines (Heiden 2010, Eder et al. 2016), and takes advantage of deep integration with two other infrastructural projects, namely the CLARIN and DARIAH ERICs. Consequently, the project aims at building a coherent ecosystem to foster the technical and intellectual findability and accessibility of relevant data. The ecosystem consists of (1) resources, i.e. text collections for drama, poetry and prose in several languages, (2) tools, (3) methodological and theoretical considerations, (4) a network of CLS scholars based at different European institutions, (5) a system of short-term research stays for both early career researchers and seasoned scholars, (6) a repository for training materials, as well as (7) an efficient dissemination strategy. This is achieved through a collaboration between participating institutions: Institute of Polish Language at the Polish Academy of Sciences, Poland; University of Potsdam, Germany; Austrian Academy of Sciences, Austria; National University of Distance Education, Spain; École Normale Supérieure de Lyon, France; Humboldt University of Berlin, German; Charles University, Czech Republic; Digital Research Infrastructure for the Arts and Humanities, France; Ghent Centre for Digital Humanities, Ghent University, Belgium; Belgrade Centre for Digital Humanities, Serbia; Huygens Institute for the History of the Netherlands (Royal Netherlands Academy of Arts and Sciences), Netherlands; Trier Center for Digital Humanities, Trier University, Germany; Moore Institute, National University of Ireland Galway, Ireland

    CLS Infra Computational Literary Studies Infrastructure

    No full text
    Computational Literary Studies Infrastructure, funded by the Horizon2020 grant scheme, is a four-year, pan-European project that aims to unify the diverse landscape of computational text analysis, in terms of available texts, tools, methods, practices and so forth, within its growing international user community. The project started out in February 2021, meaning that it has been underway for just over a year. In our poster we discuss the various deliverables and activities that have come out of the CLS INFRA project in its first quarter to give an idea of its impact in practice
    corecore