5 research outputs found
Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?
An important assumption that comes with using LLMs on psycholinguistic data
has gone unverified. LLM-based predictions are based on subword tokenization,
not decomposition of words into morphemes. Does that matter? We carefully test
this by comparing surprisal estimates using orthographic, morphological, and
BPE tokenization against reading time data. Our results replicate previous
findings and provide evidence that in the aggregate, predictions using BPE
tokenization do not suffer relative to morphological and orthographic
segmentation. However, a finer-grained analysis points to potential issues with
relying on BPE-based tokenization, as well as providing promising results
involving morphologically-aware surprisal estimates and suggesting a new method
for evaluating morphological prediction.Comment: Accepted to Findings of EMNLP 2023; 10 pages, 5 figure
Contextualized Word Embeddings Capture Human-Like Relations Between English Word Senses
CogALex 2020 Submissio
Telephone: Evaluating Language Models with Serial Reproduction
Repository for "Evaluating Models of Robust Word Recognition with Serial Reproduction