Deep learning (DL) has been pervasive in a wide spectrum of nowadays software
systems and applications. The rich features of these DL based software
applications (i.e., DL software) usually rely on powerful DL models. To train
powerful DL models with large datasets efficiently, it has been a common
practice for developers to parallelize and distribute the computation and
memory over multiple devices in the training process, which is known as
distributed training. However, existing efforts in the software engineering
(SE) research community mainly focus on issues in the general process of
training DL models. In contrast, to the best of our knowledge, issues that
developers encounter in distributed training have never been well studied.
Given the surging importance of distributed training in the current practice of
developing DL software, this paper fills in the knowledge gap and presents the
first comprehensive study on developers' issues in distributed training. To
this end, we extract and analyze 1,054 real-world developers' issues in
distributed training from Stack Overflow and GitHub, two commonly used data
sources for studying software issues. We construct a fine-grained taxonomy
consisting of 30 categories regarding the fault symptoms and summarize common
fix patterns for different symptoms. Based on the results, we suggest
actionable implications and research avenues that can potentially facilitate
the future development of distributed training