This paper provides a theoretical understanding of Deep Q-Network (DQN) with
the 蔚-greedy exploration in deep reinforcement learning. Despite
the tremendous empirical achievement of the DQN, its theoretical
characterization remains underexplored. First, the exploration strategy is
either impractical or ignored in the existing analysis. Second, in contrast to
conventional Q-learning algorithms, the DQN employs the target network and
experience replay to acquire an unbiased estimation of the mean-square Bellman
error (MSBE) utilized in training the Q-network. However, the existing
theoretical analysis of DQNs lacks convergence analysis or bypasses the
technical challenges by deploying a significantly overparameterized neural
network, which is not computationally efficient. This paper provides the first
theoretical convergence and sample complexity analysis of the practical setting
of DQNs with 系-greedy policy. We prove an iterative procedure with
decaying 系 converges to the optimal Q-value function geometrically.
Moreover, a higher level of 系 values enlarges the region of
convergence but slows down the convergence, while the opposite holds for a
lower level of 系 values. Experiments justify our established
theoretical insights on DQNs