We introduce online kernel-based LSPI (or least squares policy iteration) which combines feature of online LSPI and offline kernel-based LSPI. The knowledge gradient is used as exploration policy in both online LSPI and online kernel-based LSPI in order to compare their performance on 2 discrete Markov decision problems. Automatic feature selection in online kernel-based LSPI, which is a result of the approximate linear dependency based kernel sparsification, improves the performance when compared to online LSPI