We present a setup for training, evaluating and interpreting neural language
models, that uses artificial, language-like data. The data is generated using a
massive probabilistic grammar (based on state-split PCFGs), that is itself
derived from a large natural language corpus, but also provides us complete
control over the generative process. We describe and release both grammar and
corpus, and test for the naturalness of our generated data. This approach
allows us to define closed-form expressions to efficiently compute exact lower
bounds on obtainable perplexity using both causal and masked language
modelling. Our results show striking differences between neural language
modelling architectures and training objectives in how closely they allow
approximating the lower bound on perplexity. Our approach also allows us to
directly compare learned representations to symbolic rules in the underlying
source. We experiment with various techniques for interpreting model behaviour
and learning dynamics. With access to the underlying true source, our results
show striking differences and outcomes in learning dynamics between different
classes of words.Comment: EMNLP Findings 202