An architecture with 100 cores that uses a relatively accessible programming model should in principle be competitive with all but the best cases for GPGPU, provided each core is reasonably fast. If we stick with a proven model, a shared-memory multiprocessor with a fast interconnect is amenable to a wide range of programming problems including highly parallel problems, multitasking workloads and moderately parallel problems (possibly using a subset of the CPUs). A reasonably large on-chip SRAM to mask the speed gap of going off chip will also be necessary to sustain most workloads (Machanick, Salverda & Pompe, 1998 ). If we add in vector instructions, the design can still be kept relatively simple, making it possible to scale to this number of cores within the budget of today's high-end GPUs (7-billion transistors surpassed in 2012 (NVIDIA, 2012; Chen, 2013) ).
Communication is an issue with anything but a small number of processors. Over about 64 cores, uniform-latency interconnects become impractical; something closer to a traditional network with variable latency becomes a better design compromise (Sewell et al., 2012) . Network-on-chip (NoC) (Hemani et al., 2000; Goossens, Dielissen & Radulescu, 2005; Pande, Grecu, Jones, Ivanov & Saleh, 2005; Ogras & Marculescu, 2013; Ginosar & Chatha, 2014) can scale to the required number of processors (Bjerregaard & Mahadevan, 2006) .
Intel has explored part of the design space with the Larrabee architecture, which was based on multiple in-order multiple-issue Pentium cores with limited extra extensions to support graphics (Seiler et al., 2008) . A design using a simpler RISC instruction set without the complexity of multiple issue would make it possible to implement more cores with the same transistor count. Intel abandoned the Larrabee strategy; they more recently have introduced the Xeon Phi multicore coprocessor, which features a specialist instruction set including vector modes to support high-performance computing (Heinecke et al., 2013) . Unlike the Phi, the idea here is to implement a design that can also implement a graphics pipeline. The Phi is based on Intel Atom (a low-power variant of the x86 architecture) cores with a vector unit added to each.
Intel's latest Knights Landing version of the Phi features up to 72 cores, indicating that a design of the scale contemplated here is feasible (Gardner, 2014) .
What does a GPU pipeline do? In its original form it was a static sequence of stages; recent designs are more programmable. The major stages are (Luebke & Humphreys, 2007 ):
• input -usually in the form of primitives, e.g., OpenGL, that provide vertices, which the pipeline assembles into triangles
• model transformations -produces a stream of triangles in a unified coordinate system
• lighting -the triangles are coloured based on the lighting of the scene; this stage requires vector computations
• camera simulation -the GPU projects the scene onto the film plane of a virtual camera, producing a stream of triangles in screen coordinates; vector computation is again needed here
• rasterization -triangles that overlap screen pixels are calculated, and this is a highly parallel stage since each pixel can be handled independently
• texturing -images called textures are added to the near-final pixel colouring; this is also a highly parallel step and has a very regular memory access pattern
• hidden surfaces -pixels obscured by others have to be discarded, using a depth buffer that records how close a pixel is to the viewer and hence whether it can overwrite another pixel in the same spot on the screen
The main research question to be answered is whether the proposed design can implement a competitive graphics pipeline, with roughly the same component count as a GPU. If the graphics pipeline can be implemented with modes of parallelism no more exotic than a large number of conventional cores possibly with vector units, GPGPU becomes truly general-purpose. The challenge is to implement the highly parallel stages of the graphics pipeline and the aspects that lend themselves to specialist memory without using features that are difficult to apply to general programming. Further, if we can get this right, the new design can leverage the key advantage of a GPU: the fact that it is a highly substitutable part in a large market. By contrast with the Xeon Phi (which only targets the compute-intensive market), provided the design can gain a significant foothold in the graphics market, it will achieve economies of scale that will make it viable for smaller niches like supercomputers.
Narrowing the research question to testing viability of a graphics pipeline with such a design avoids some of the harder questions, such as implementing a memory hierarchy for general workloads. The value in starting with the graphics pipeline is simplification of simulation studies -rather than simulating a workload with millions or billions of instruction executions spread over of the order of 100 cores, the simulation study only needs to show that the graphics pipeline can be implemented with typical operations within required latency targets.
Simulation is possible with existing research simulators, such as Gem5, which includes a capability of simulating a network with accurate timing (Binkert et al., 2011) .
Finally, to show viability, enough of the logic needs to be designed to show that the proposed design is competitive in terms of component count with a comparable GPU. Part of this can be done by estimating CPU component count from previous designs of similar complexity, such as early RISC designs. For example, the MIPS R4000 is a single-issue design with a full 64-bit instruction set, and it required only 1.2-million transistors for the CPU and first-level cache (Mirapuri, Woodacre & Vasseghi, 1992) .
ACKNOWLEDGMENTS
I would like to thank anonymous SAICSIT 2015 reviewers who rejected a version of this paper but made insightful comments on how to take the project forward, and the SACJ reviewers who provided further helpful comments. This work was undertaken in the Distributed Multimedia CoE at Rhodes University, with financial support from Telkom SA, Tellabs, Genband, Easttel, Bright Ideas 39, THRIP and NRF SA (TP13070820716). The author acknowledges that opinions, findings and conclusions or recommendations expressed here are those of the author and that none of the above mentioned sponsors accept liability whatsoever in this regard.
