Without access to large compute clusters, building random forests on large
datasets is still a challenging problem. This is, in particular, the case if
fully-grown trees are desired. We propose a simple yet effective framework that
allows to efficiently construct ensembles of huge trees for hundreds of
millions or even billions of training instances using a cheap desktop computer
with commodity hardware. The basic idea is to consider a multi-level
construction scheme, which builds top trees for small random subsets of the
available data and which subsequently distributes all training instances to the
top trees' leaves for further processing. While being conceptually simple, the
overall efficiency crucially depends on the particular implementation of the
different phases. The practical merits of our approach are demonstrated using
dense datasets with hundreds of millions of training instances.Comment: 9 pages, 9 Figure