Free energy calculations are rapidly becoming indispensable in
structure-enabled drug discovery programs. As new methods, force fields, and
implementations are developed, assessing their expected accuracy on real-world
systems (benchmarking) becomes critical to provide users with an assessment of
the accuracy expected when these methods are applied within their domain of
applicability, and developers with a way to assess the expected impact of new
methodologies. These assessments require construction of a benchmark - a set of
well-prepared, high quality systems with corresponding experimental
measurements designed to ensure the resulting calculations provide a realistic
assessment of expected performance when these methods are deployed within their
domains of applicability. To date, the community has not yet adopted a common
standardized benchmark, and existing benchmark reports suffer from a myriad of
issues, including poor data quality, limited statistical power, and
statistically deficient analyses, all of which can conspire to produce
benchmarks that are poorly predictive of real-world performance. Here, we
address these issues by presenting guidelines for (1) curating experimental
data to develop meaningful benchmark sets, (2) preparing benchmark inputs
according to best practices to facilitate widespread adoption, and (3) analysis
of the resulting predictions to enable statistically meaningful comparisons
among methods and force fields