Instruction-tuned Large Language Models (It-LLMs) have been exhibiting
outstanding abilities to reason around cognitive states, intentions, and
reactions of all people involved, letting humans guide and comprehend
day-to-day social interactions effectively. In fact, several multiple-choice
questions (MCQ) benchmarks have been proposed to construct solid assessments of
the models' abilities. However, earlier works are demonstrating the presence of
inherent "order bias" in It-LLMs, posing challenges to the appropriate
evaluation. In this paper, we investigate It-LLMs' resilience abilities towards
a series of probing tests using four MCQ benchmarks. Introducing adversarial
examples, we show a significant performance gap, mainly when varying the order
of the choices, which reveals a selection bias and brings into discussion
reasoning abilities. Following a correlation between first positions and model
choices due to positional bias, we hypothesized the presence of structural
heuristics in the decision-making process of the It-LLMs, strengthened by
including significant examples in few-shot scenarios. Finally, by using the
Chain-of-Thought (CoT) technique, we elicit the model to reason and mitigate
the bias by obtaining more robust models