43 research outputs found

    Dutch Historical Spelling Normalization for Parsing and Coreference Resolution

    Get PDF
    Non-canonical language can be handled in an NLP pipeline using normalization of the input (e.g., MoNoise; van der Goot & van Noord, CLINjournal 2017) or domain adaptation of the pipeline (e.g., Hupkes & Bod, LREC 2016); we focus on the former. MoNoise shows that normalization is effective for social media language. We consider a different domain: Dutch literature from Project Gutenberg. We work with 9 fragments that make up the OpenBoek corpus (van den Berg et al., CLIN 2021). The fragments consist of 10,000+ tokens from texts first published 1860-1920, both translated and originally Dutch.MoNoise consists of several modules: a lookup table, automatic spelling correction (aspell), and word embeddings; we aim to explore these techniques on our data in future work. Here we report results of a rule-based approach implemented with a sed script (i.e., regular expressions) for normalizing frequently occurring non-standard spellings.The output consists of instructions to the Alpino parser (van Noord, TALN 2006) to treat words with non-canonical orthography as if they occur with modern spelling. The advantage of this approach is that the resulting parse trees contain the original tokens, and existing annotation layers (such as coreference) do not have to be re-aligned. Consider the following sentence from Couperus, Eline Vere (ch. I, § II):18-1|- Is het [ @alt zo zoo ] goed ? vroeg zij met bevende stem , [ @alt ene eene ] , van te voren bestudeerde poze aannemende .Here [ @alt zo zoo ] indicates that the original token zoo should be treated as zo. Besides doubled vowels, other frequent spelling normalizations are de/den, zei/zeide, and mensen/menschen. When multiple alternatives are given the parser considers the input as a lattice and uses the sequence of tokens that generates the most likely parse. Parse trees for the above sentence show that the automatic spelling normalization is not perfect (the correct normalization of eene is een with POS lid rather than ene), but it does lead to a correct bracketing of the NP eene … poze. Furthermore, it turns out that a comma is missing after bestudeerde in the Project Gutenberg etext we use (EBook-No. 19563); the DBNL version of this text (coup002elin01_01) does have this comma—this underscores the importance of professionally edited critical editions.We will perform an intrinsic evaluation of our spelling normalization pipeline with manually corrected texts and report F1 scores (Reynaert, LREC 2008). We also perform an extrinsic evaluation of downstream tasks: part-of-speech tagging, mention detection, and coreference resolution. Scores for the latter two tasks on Multatuli, Max Havelaar:mentions lea pronrecall prec f1 recall prec f1 CoNLL accoriginal 89.96 81.29 85.40 54.80 47.07 50.64 65.76 55.00normalized 90.18 82.22 86.02 54.82 45.96 50.00 65.48 54.20The mention score is improved, which makes sense given that parsing of NPs seems to improve after spelling normalization, but there is a decrease in the coreference metrics, which warrants further investigation

    Attenuated AMPA Receptor Expression Allows Glioblastoma Cell Survival in Glutamate-Rich Environment

    Get PDF
    Background: Glioblastoma multiforme (GBM) cells secrete large amounts of glutamate that can trigger AMPA-type glutamate receptors (AMPARs). This commonly results in Na+ and Ca2+-permeability and thereby in excitotoxic cell death of the surrounding neurons. Here we investigated how the GBM cells themselves survive in a glutamate-rich environment. Methods and Findings: In silico analysis of published reports shows down-regulation of all ionotropic glutamate receptors in GBM as compared to normal brain. In vitro, in all GBM samples tested, mRNA expression of AMPAR subunit GluR1, 2 and 4 was relatively low compared to adult and fetal total brain mRNA and adult cerebellum mRNA. These findings were in line with primary GBM samples, in which protein expression patterns were down-regulated as compared to the normal tissue. Furthermore, mislocalized expression of these receptors was found. Sequence analysis of GluR2 RNA in primary and established GBM cell lines showed that the GluR2 subunit was found to be partly unedited. Conclusions: Together with the lack of functional effect of AMPAR inhibition by NBQX our results suggest that down-regulation and afunctionality of AMPARs, enable GBM cells to survive in a high glutamate environment without going into excitotoxic cell death themselves. It can be speculated that specific AMPA receptor inhibitors may protect normal neurons against the high glutamate microenvironment of GBM tumor

    Contemporary research in minoritized and diaspora languages of Europe

    Get PDF
    This volume provides a collection of research reports on multilingualism and language contact ranging from Romance, to Germanic, Greco and Slavic languages in situations of contact and diaspora. Most of the contributions are empirically-oriented studies presenting first-hand data based on original fieldwork, and a few focus directly on the methodological issues in such research. Owing to the multifaceted nature of contact and diaspora phenomena (e.g. the intrinsic transnational essence of contact and diaspora, and the associated interplay between majority and minoritized languages and multilingual practices in different contact settings, contact-induced language change, and issues relating to convergence) the disciplinary scope is broad, and includes ethnography, qualitative and quantitative sociolinguistics, formal linguistics, descriptive linguistics, contact linguistics, historical linguistics, and language acquisition. Case studies are drawn from Italo-Romance varieties in the Americas, Spanish-Nahuatl contact, Castellano Andino, Greko/Griko in Southern Italy, Yiddish in Anglophone communities, Frisian in the Netherlands, Wymysiöryś in Poland, Sorbian in Germany, and Pomeranian and Zeelandic Flemish in Brazil

    Contemporary research in minoritized and diaspora languages of Europe

    Get PDF
    This volume provides a collection of research reports on multilingualism and language contact ranging from Romance, to Germanic, Greco and Slavic languages in situations of contact and diaspora. Most of the contributions are empirically-oriented studies presenting first-hand data based on original fieldwork, and a few focus directly on the methodological issues in such research. Owing to the multifaceted nature of contact and diaspora phenomena (e.g. the intrinsic transnational essence of contact and diaspora, and the associated interplay between majority and minoritized languages and multilingual practices in different contact settings, contact-induced language change, and issues relating to convergence) the disciplinary scope is broad, and includes ethnography, qualitative and quantitative sociolinguistics, formal linguistics, descriptive linguistics, contact linguistics, historical linguistics, and language acquisition. Case studies are drawn from Italo-Romance varieties in the Americas, Spanish-Nahuatl contact, Castellano Andino, Greko/Griko in Southern Italy, Yiddish in Anglophone communities, Frisian in the Netherlands, Wymysiöryś in Poland, Sorbian in Germany, and Pomeranian and Zeelandic Flemish in Brazil
    corecore