We present a numerical scheme geared for high performance computation of
wall-bounded turbulent flows. The number of all-to-all communications is
decreased to only six instances by using a two-dimensional (pencil) domain
decomposition and utilizing the favourable scaling of the CFL time-step
constraint as compared to the diffusive time-step constraint. As the CFL
condition is more restrictive at high driving, implicit time integration of the
viscous terms in the wall-parallel directions is no longer required. This
avoids the communication of non-local information to a process for the
computation of implicit derivatives in these directions. We explain in detail
the numerical scheme used for the integration of the equations, and the
underlying parallelization. The code is shown to have very good strong and weak
scaling to at least 64K cores