Generative molecular design has moved from proof-of-concept to real-world
applicability, as marked by the surge in very recent papers reporting
experimental validation. Key challenges in explainability and sample efficiency
present opportunities to enhance generative design to directly optimize
expensive high-fidelity oracles and provide actionable insights to domain
experts. Here, we propose Beam Enumeration to exhaustively enumerate the most
probable sub-sequences from language-based molecular generative models and show
that molecular substructures can be extracted. When coupled with reinforcement
learning, extracted substructures become meaningful, providing a source of
explainability and improving sample efficiency through self-conditioned
generation. Beam Enumeration is generally applicable to any language-based
molecular generative model and notably further improves the performance of the
recently reported Augmented Memory algorithm, which achieved the new
state-of-the-art on the Practical Molecular Optimization benchmark for sample
efficiency. The combined algorithm generates more high reward molecules and
faster, given a fixed oracle budget. Beam Enumeration is the first method to
jointly address explainability and sample efficiency for molecular design