Open-vocabulary semantic segmentation is a challenging task that requires
segmenting novel object categories at inference time. Recent works explore
vision-language pre-training to handle this task, but suffer from unrealistic
assumptions in practical scenarios, i.e., low-quality textual category names.
For example, this paradigm assumes that new textual categories will be
accurately and completely provided, and exist in lexicons during pre-training.
However, exceptions often happen when meet with ambiguity for brief or
incomplete names, new words that are not present in the pre-trained lexicons,
and difficult-to-describe categories for users. To address these issues, this
work proposes a novel decomposition-aggregation framework, inspired by human
cognition in understanding new concepts. Specifically, in the decomposition
stage, we decouple class names into diverse attribute descriptions to enrich
semantic contexts. Two attribute construction strategies are designed: using
large language models for common categories, and involving manually labelling
for human-invented categories. In the aggregation stage, we group diverse
attributes into an integrated global description, to form a discriminative
classifier that distinguishes the target object from others. One hierarchical
aggregation is further designed to achieve multi-level alignment and deep
fusion between vision and text. The final result is obtained by computing the
embedding similarity between aggregated attributes and images. To evaluate the
effectiveness, we annotate three datasets with attribute descriptions, and
conduct extensive experiments and ablation studies. The results show the
superior performance of attribute decomposition-aggregation