3 research outputs found
A Collaborative Ecosystem for Digital Coptic Studies
Scholarship on underresourced languages bring with them a variety of
challenges which make access to the full spectrum of source materials and their
evaluation difficult. For Coptic in particular, large scale analyses and any
kind of quantitative work become difficult due to the fragmentation of
manuscripts, the highly fusional nature of an incorporational morphology, and
the complications of dealing with influences from Hellenistic era Greek, among
other concerns. Many of these challenges, however, can be addressed using
Digital Humanities tools and standards. In this paper, we outline some of the
latest developments in Coptic Scriptorium, a DH project dedicated to bringing
Coptic resources online in uniform, machine readable, and openly available
formats. Collaborative web-based tools create online 'virtual departments' in
which scholars dispersed sparsely across the globe can collaborate, and natural
language processing tools counterbalance the scarcity of trained editors by
enabling machine processing of Coptic text to produce searchable, annotated
corpora.Comment: 9 pages; paper presented at the Stanford University CESTA Workshop
"Collecting, Preserving and Disseminating Endangered Cultural Heritage for
New Understandings Through Multilingual Approaches
Language Modelling with Pixels
Language models are defined over a finite set of inputs, which creates a
vocabulary bottleneck when we attempt to scale the number of supported
languages. Tackling this bottleneck results in a trade-off between what can be
represented in the embedding matrix and computational issues in the output
layer. This paper introduces PIXEL, the Pixel-based Encoder of Language, which
suffers from neither of these issues. PIXEL is a pretrained language model that
renders text as images, making it possible to transfer representations across
languages based on orthographic similarity or the co-activation of pixels.
PIXEL is trained to reconstruct the pixels of masked patches, instead of
predicting a distribution over tokens. We pretrain the 86M parameter PIXEL
model on the same English data as BERT and evaluate on syntactic and semantic
tasks in typologically diverse languages, including various non-Latin scripts.
We find that PIXEL substantially outperforms BERT on syntactic and semantic
processing tasks on scripts that are not found in the pretraining data, but
PIXEL is slightly weaker than BERT when working with Latin scripts.
Furthermore, we find that PIXEL is more robust to noisy text inputs than BERT,
further confirming the benefits of modelling language with pixels.Comment: work in progres