Recent years have seen development of descriptor generation based on
representation learning of extremely diverse molecules, especially those that
apply natural language processing (NLP) models to SMILES, a literal
representation of molecular structure. However, little research has been done
on how these models understand chemical structure. To address this, we
investigated the relationship between the learning progress of SMILES and
chemical structure using a representative NLP model, the Transformer. The
results suggest that while the Transformer learns partial structures of
molecules quickly, it requires extended training to understand overall
structures. Consistently, the accuracy of molecular property predictions using
descriptors generated from models at different learning steps was similar from
the beginning to the end of training. Furthermore, we found that the
Transformer requires particularly long training to learn chirality and
sometimes stagnates with low translation accuracy due to misunderstanding of
enantiomers. These findings are expected to deepen understanding of NLP models
in chemistry.Comment: 20 pages, 6 figure