Detecting multiple food items in one image is a challenge task. We propose a novel method which detects food items and their locations in the image with minimal supervision. In training, we generate candidate object regions for each image and extract their CNN features. Then we perform region mining to select discriminative regions for each class by submodular optimization. With these mined regions, we train a binary SVM classifier for each class and further refine these classifiers with hard negatives mining. In testing, a score is computed for each proposed region and we select the regions using non-maximum suppression and output the locations and predicted class names. Our experiments show very promising results with an average precision of 83.78% on test dataset. Our food detection method could be easily extended to a larger dataset as no ground-truth bounding boxes is needed during training