Statistical significance testing plays an important role when drawing
conclusions from experimental results in NLP papers. Particularly, it is a
valuable tool when one would like to establish the superiority of one algorithm
over another. This appendix complements the guide for testing statistical
significance in NLP presented in \cite{dror2018hitchhiker} by proposing valid
statistical tests for the common tasks and evaluation measures in the field