The Biham-Middleton-Levine (BML) traffic model is a simple two-dimensional,
discrete Cellular Automaton (CA) that has been used to study self-organization
and phase transitions arising in traffic flows. From the computational point of
view, the BML model exhibits the usual features of discrete CA, where the state
of the automaton are updated according to simple rules that depend on the state
of each cell and its neighbors. In this paper we study the impact of various
optimizations for speeding up CA computations by using the BML model as a case
study. In particular, we describe and analyze the impact of several parallel
implementations that rely on CPU features, such as multiple cores or SIMD
instructions, and on GPUs. Experimental evaluation provides quantitative
measures of the payoff of each technique in terms of speedup with respect to a
plain serial implementation. Our findings show that the performance gap between
CPU and GPU implementations of the BML traffic model can be reduced by clever
exploitation of all CPU features