There are many ways to express similar things in text, which makes evaluating
natural language generation (NLG) systems difficult. Compounding this
difficulty is the need to assess varying quality criteria depending on the
deployment setting. While the landscape of NLG evaluation has been well-mapped,
practitioners' goals, assumptions, and constraints -- which inform decisions
about what, when, and how to evaluate -- are often partially or implicitly
stated, or not stated at all. Combining a formative semi-structured interview
study of NLG practitioners (N=18) with a survey study of a broader sample of
practitioners (N=61), we surface goals, community practices, assumptions, and
constraints that shape NLG evaluations, examining their implications and how
they embody ethical considerations.Comment: Camera Ready for NAACL 2022 (Main Conference