Search CORE

3 research outputs found

A Collaborative Ecosystem for Digital Coptic Studies

Author: Schroeder Caroline T.
Zeldes Amir
Publication venue: 'Centre pour la Communication Scientifique Directe (CCSD)'
Publication date: 01/09/2020
Field of study

Scholarship on underresourced languages bring with them a variety of challenges which make access to the full spectrum of source materials and their evaluation difficult. For Coptic in particular, large scale analyses and any kind of quantitative work become difficult due to the fragmentation of manuscripts, the highly fusional nature of an incorporational morphology, and the complications of dealing with influences from Hellenistic era Greek, among other concerns. Many of these challenges, however, can be addressed using Digital Humanities tools and standards. In this paper, we outline some of the latest developments in Coptic Scriptorium, a DH project dedicated to bringing Coptic resources online in uniform, machine readable, and openly available formats. Collaborative web-based tools create online 'virtual departments' in which scholars dispersed sparsely across the globe can collaborate, and natural language processing tools counterbalance the scarcity of trained editors by enabling machine processing of Coptic text to produce searchable, annotated corpora.Comment: 9 pages; paper presented at the Stanford University CESTA Workshop "Collecting, Preserving and Disseminating Endangered Cultural Heritage for New Understandings Through Multilingual Approaches

arXiv.org e-Print Archive

Crossref

Episciences.org

Directory of Open Access Journals

Language Modelling with Pixels

Author: Bugliarello Emanuele
de Lhoneux Miryam
Elliott Desmond
Lotz Jonas F.
Rust Phillip
Salesky Elizabeth
Publication venue
Publication date: 14/07/2022
Field of study

Language models are defined over a finite set of inputs, which creates a vocabulary bottleneck when we attempt to scale the number of supported languages. Tackling this bottleneck results in a trade-off between what can be represented in the embedding matrix and computational issues in the output layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which suffers from neither of these issues. PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels. PIXEL is trained to reconstruct the pixels of masked patches, instead of predicting a distribution over tokens. We pretrain the 86M parameter PIXEL model on the same English data as BERT and evaluate on syntactic and semantic tasks in typologically diverse languages, including various non-Latin scripts. We find that PIXEL substantially outperforms BERT on syntactic and semantic processing tasks on scripts that are not found in the pretraining data, but PIXEL is slightly weaker than BERT when working with Latin scripts. Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.Comment: work in progres

arXiv.org e-Print Archive