The complexity of visual stimuli plays an important role in many cognitive
phenomena, including attention, engagement, memorability, time perception and
aesthetic evaluation. Despite its importance, complexity is poorly understood
and ironically, previous models of image complexity have been quite complex.
There have been many attempts to find handcrafted features that explain
complexity, but these features are usually dataset specific, and hence fail to
generalise. On the other hand, more recent work has employed deep neural
networks to predict complexity, but these models remain difficult to interpret,
and do not guide a theoretical understanding of the problem. Here we propose to
model complexity using segment-based representations of images. We use
state-of-the-art segmentation models, SAM and FC-CLIP, to quantify the number
of segments at multiple granularities, and the number of classes in an image
respectively. We find that complexity is well-explained by a simple linear
model with these two features across six diverse image-sets of naturalistic
scene and art images. This suggests that the complexity of images can be
surprisingly simple