Recently, there has been an increasing focus on audio-text cross-modal
learning. However, most of the existing audio-text datasets contain only simple
descriptions of sound events. Compared with classification labels, the
advantages of such descriptions are significantly limited. In this paper, we
first analyze the detailed information that human descriptions of audio may
contain beyond sound event labels. Based on the analysis, we propose an
automatic pipeline for curating audio-text pairs with rich details. Leveraging
the property that sounds can be mixed and concatenated in the time domain, we
control details in four aspects: temporal relationship, loudness, speaker
identity, and occurrence number, in simulating audio mixtures. Corresponding
details are transformed into captions by large language models. Audio-text
pairs with rich details in text descriptions are thereby obtained. We validate
the effectiveness of our pipeline with a small amount of simulated data,
demonstrating that the simulated data enables models to learn detailed audio
captioning