Analyzing the layout of a document to identify headers, sections, tables,
figures etc. is critical to understanding its content. Deep learning based
approaches for detecting the layout structure of document images have been
promising. However, these methods require a large number of annotated examples
during training, which are both expensive and time consuming to obtain. We
describe here a synthetic document generator that automatically produces
realistic documents with labels for spatial positions, extents and categories
of the layout elements. The proposed generative process treats every physical
component of a document as a random variable and models their intrinsic
dependencies using a Bayesian Network graph. Our hierarchical formulation using
stochastic templates allow parameter sharing between documents for retaining
broad themes and yet the distributional characteristics produces visually
unique samples, thereby capturing complex and diverse layouts. We empirically
illustrate that a deep layout detection model trained purely on the synthetic
documents can match the performance of a model that uses real documents