The scene information existing in high resolution remote sensing images is important for image interpretation and understanding of the real world. Traditional scene classification methods often use middle and low-level artificial features, but high resolution images have rich information and complex scene configuration, which need high-level feature to express. A joint saliency and multi-convolutional neural network method is proposed in this paper. Firstly, we obtain meaningful patches that include dominant image information by saliency sampling. Secondly, these patches will be set as a sample input to the convolutional neural network for training, obtain feature expression on different levels. Finally, we embed the multi-layer features into the support vector machine (SVM) for image classification. Experiments using two high resolution image scene data show that saliency sampling can effectively get the main target, weaken the impact of other unrelated targets, and reduce data redundancy; convolutional neural network can automatically learn the high-level feature, compared to existing methods, the proposed method can effectively improve the classification accuracy