Graph neural networks (GNNs) have recently emerged as a promising learning
paradigm in learning graph-structured data and have demonstrated wide success
across various domains such as recommendation systems, social networks, and
electronic design automation (EDA). Like other deep learning (DL) methods, GNNs
are being deployed in sophisticated modern hardware systems, as well as
dedicated accelerators. However, despite the popularity of GNNs and the recent
efforts of bringing GNNs to hardware, the fault tolerance and resilience of
GNNs have generally been overlooked. Inspired by the inherent algorithmic
resilience of DL methods, this paper conducts, for the first time, a
large-scale and empirical study of GNN resilience, aiming to understand the
relationship between hardware faults and GNN accuracy. By developing a
customized fault injection tool on top of PyTorch, we perform extensive fault
injection experiments on various GNN models and application datasets. We
observe that the error resilience of GNN models varies by orders of magnitude
with respect to different models and application datasets. Further, we explore
a low-cost error mitigation mechanism for GNN to enhance its resilience. This
GNN resilience study aims to open up new directions and opportunities for
future GNN accelerator design and architectural optimization