The explosive growth of dynamic and heterogeneous data traffic brings great
challenges for 5G and beyond mobile networks. To enhance the network capacity
and reliability, we propose a learning-based dynamic time-frequency division
duplexing (D-TFDD) scheme that adaptively allocates the uplink and downlink
time-frequency resources of base stations (BSs) to meet the asymmetric and
heterogeneous traffic demands while alleviating the inter-cell interference. We
formulate the problem as a decentralized partially observable Markov decision
process (Dec-POMDP) that maximizes the long-term expected sum rate under the
users' packet dropping ratio constraints. In order to jointly optimize the
global resources in a decentralized manner, we propose a federated
reinforcement learning (RL) algorithm named federated Wolpertinger deep
deterministic policy gradient (FWDDPG) algorithm. The BSs decide their local
time-frequency configurations through RL algorithms and achieve global training
via exchanging local RL models with their neighbors under a decentralized
federated learning framework. Specifically, to deal with the large-scale
discrete action space of each BS, we adopt a DDPG-based algorithm to generate
actions in a continuous space, and then utilize Wolpertinger policy to reduce
the mapping errors from continuous action space back to discrete action space.
Simulation results demonstrate the superiority of our proposed algorithm to
benchmark algorithms with respect to system sum rate