The Ultimate DataFlow for Ultimate SuperComputers-on-a-Chips by Milutinovic, Veljko et al.
September 15, 2020 
 
The Ultimate DataFlow for Ultimate SuperComputers-on-a-Chips 
Veljko Milutinovic and Milos Kotlar, CSCI-B-490/649, Indiana University, Bloomington, Indiana, USA  
Ivan Ratkovic, Esperanto Technologies, Belgrade, Serbia and San Francisco, California, USA 
Nenad Korolija and Miljan Djordjevic, Universty of Belgrade, Serbia 
Kristy Yoshimoto and Erik Klem, UROC, Indiana University, Bloomington, Indiana, USA 
Mateo Valero, BSC, Barcelona, Spain   
This article starts from the assumption that near future 100BTransistor SuperComputers-
on-a-Chip will include N big multi-core processors, 1000N small many-core processors, 
a TPU-like fixed-structure systolic array accelerator for the most frequently used Machine 
Learning algorithms needed in bandwidth-bound applications and a flexible-structure re-
programmable accelerator for less frequently used Machine Learning algorithms needed 
in latency-critical applications. Of course, appropriate interfaces to memory and standard 
I/O, as well as to the Internet and external accelerators, are absolutely necessary, as 
depicted in the attached figure. The future SuperComputers-on-a-chip should include 
effective interfaces to specific external accelerators based on Quantum, Optical, 
Molecular, and Biological paradigms, but these issues are outside the scope of this article. 
Also, the number of processors in Figure 1, could be additionally increased if appropriate 
techniques are used, like cache injection and cache splitting [15, 16]. Finally, a higher 
speed could be achieved if some more advanced technology is used, like GaAs [13,14]. 
Figure 1 is further explained with data in Table 1. 
 
Figure 1: Generic structure of a future SuperComputer-on-a-Chip with 100 Billion Transistors. 
 
 
 
 
September 15, 2020 
Chip Hardware Type Estimated Transistor Count 
One Manycore with Memory 3.29 million 
4000 Manycores with Memory 11 800 million [17] 
One Multicore with Memory 1 billion [18] 
4 Multicore with Memory 4 billion 
One Systolic Array <1 billion [19] 
One Reprogrammable Ultimate Dataflow <69 billion [20] 
Interface to I/O with external Memory <100 million 
Interface to External Accelerators <100 million 
TOTAL <100 billion 
Table 1. Basically, current efforts include about 30 billion transistors on a chip, and this article advocates 
that, for future 100 billion transistor chips, the most effective resources to include are those based on the 
dataflow principle. For some important applications, such resources bring significant speedups, that 
would fully justify the incorporation of additional 70 billion transistors. The speedups could be, in reality, 
from about 10x to about 100x, and the explanations follow in the rest of this article. 
Since the first three structures (multi-cores, many-cores, and TPU) are well elaborated in 
the open literature, this article focuses only on the fourth type of architecture, and 
elaborates on an idea referred to as the Ultimate DataFlow, that offers specific 
advantages, but requires a more advanced technology, other than today’s FPGAs.  
In addition, some of the most effective power reduction techniques are not applicable to 
FPGAs, which is another reason that creates motivation for research leading to new 
approaches for mapping of algorithms onto reconfigurable architectures. Consequently, 
the novel approach, referred to as Ultimate DataFlow, is described next.  
Introduction to Ultimate Data Flow 
The architectures like Google TPU are extremely effective for the most frequent Tensor 
Calculus and related algorithms to which they are tuned, but these algorithms, in many 
important applications, burn only about 50% of the run time. In these applications, the 
other about 50% of run time gets burned by a huge number of other algorithms, so their 
architectural support requires a lot more flexible and fully reconfigurable architecture. 
This article sheds light on the newly proposed concept, Ultimate DataFlow for BigData, 
offering flexibility and reconfigurability in DeepAnalytics (DA) and MachineLearning (ML).  
Some of the problems in DA and ML are bandwidth-bound, while the others are latency bound. 
The bandwidth-bound problems could, for many applications, be solved successfully using the 
FPGA-based DataFlow systems. The most critical latency-bound problems need a different in-
memory computing technology. The Ultimate DataFlow implies elements of internal analog 
processing, which brings potentials, that are first presented, and then explained through an 
adequate elaboration. 
Potentials of Ultimate DataFlow  
The Ultimate DataFlow approach offers an effective solution for latency-bound problems, with the 
following improvement potentials over the FPGA-based solutions:  
September 15, 2020 
(A) Up to about 2000 in speed up,  
(B) Up to about 200 in transistor count,  
(C) Up to about 80 (20x4) in power reduction, and  
(D) Up to about 2 in data precision.  
With the above in mind, this position article covers the issues related to the potentials of the 
concept, using the programming model utilized in numerous FPGA-based DataFlow engines. 
The existing DataFlow approaches are still far away from the ideal Ultimate DataFlow, but for 
specific Machine Learning and BigData applications, they still do achieve considerable speedups 
over ControlFlow machines, especially for some specific BigData problems. Consequently, the 
drive for new technology-supported architectural solutions is not too strong these days. However, 
for research missions, like those around Mubadala and IMEC, or Esperanto and IPSI, innovative 
solutions are badly needed.  
What is good, however, about the existing DataFlow approaches, is that their DataFlow 
programming models are directly applicable to the case of the Ultimate DataFlow, so the 
continuity of existing experiences and the already developed software products could be 
maintained and improved. 
For precision, the ratio 2x was quoted above, since the approach could benefit from approximate 
computing, due to its data format flexibility, as explained later. It is well suited also for bloat16, a 
possible new standard for tensor applications. 
Elaborations of Potentials of Ultimate DataFlow 
For power, the ratio 80x was quoted, for two reasons: First, ControlFlow machines like Intel or 
NVidia operate on up to 4GHz or even higher frequencies, while the current FPGAs operate on 
about 200MHz, which makes the ratio of about 20x, when it comes to dissipation. Second, an 
additional 4x one could get from operating at the 2x lower voltage. Both factors together result in 
the total improvement ratio of 80x. 
For transistor count, the ratio of 200x was quoted, for the following reason: If one looks up the 
Intel microprocessor floor plan, one finds out that only about 0.5% of the area is dedicated to 
Arithmetic and Logic, making the above quoted 1/200x ratio. 
For speedup, the frequently quoted numbers are: (a) 20x as the lowest number on speedup in 
recent publications of the authors of this article, (b) 200x as the highest number ever reported by 
the same authors, and (c) 2000x was quoted for the reason, that has nothing to do with existing 
DataFlow implementations, but has a lot to do with Ultimate DataFlow, as elaborated later; (d) 
even 20000x could be hoped for some applications, as explained next. 
In Ultimate DataFlow, the speedup depends predominantly on the contribution of loops to the 
overall execution time: 
• If loops contribute with more than 99.95% to the overall run time, then one can hope for a 
speedup of 2000x. 
 
• If one looks up some of the applications on the list of current DataFlow successes in 
Machine Learning for BigData, one finds out that in many cases the contribution of loops 
September 15, 2020 
was well over 99.995%, which is why the potentials of Ultimate DataFlow could reach even 
20000x. 
 
Explanations of the Ultimate DataFlow Concept 
The Ultimate DataFlow, as a concept, is built on the following two premises (each one with 4 sub-
premises): 
1. Compiler does the following: 
 
a) Separates effectively spatial and temporal data, to satisfy the requirements of the 
Nobel Laureate Ilya Prigogine, since that action lowers the entropy of a computer 
system, meaning that the rest of the compiler could do a much better optimization job 
(lower entropy brings more order into the optimization process and consequently 
better optimization opportunities). 
b) Maps the execution graph in the way that makes sure that edges are of the minimal 
length, which brings consistency with the observations of Nobel Laureate Richard 
Feynman, related to trade-offs between speed and power. 
c) Enables one to go to a lower precision, for what is not of ultimate importance, and 
consequently to save on resources, that could be reinvested into what is of ultimate 
importance, following the approximate computing wisdom of Nobel Laureate Daniel 
Kahneman. 
d) Enables one to trade between latency and precision, which, in latency-tolerant 
applications, brings more precision with less resources, and in latency-intolerant 
applications, brings less latency, in exchange for a lower precision, thus following the 
wisdom of Nobel Laureate Tom Hunt, and analogies with his findings related to birth, 
life, reproduction, and death of cells. 
 
None of the FPGA-based dataflow compilers, as far as we know, does any of the above. 
 
2. Hardware consists of the following: 
 
a) An analog DataPath of the honeycomb structure, to which one could effectively map 
the execution graphs corresponding to loops. Analog functional units could leverage 
low-precision computation. 
b) A DataPath clocked at a much lower frequency, and hopefully not clocked at all, if the 
analog path is not unacceptably long, so it is literally the voltage difference between 
input and output, that moves data through the execution graph. 
c) A digital memory is on the side of the DataPath, so that computing parameters could 
be kept non-volatile, and temporary results could be stored more effectively. 
d) The I/O connecting the host and the dataflow is much faster. 
 
Unfortunately, FPGAs offer none of the above today! Consequently, the FPGA technology is 
today only the least bad solution on the road to the ultimate goal! 
 
In conclusion, the benefits of the Ultimate DataFlow approach will become fully achievable only 
once the semiconductor and the compiler technologies become capable of supporting the above 
specified two sets of requirements. References leading to the above conclusion are spreading 
four decades of the research of one of the co-authors [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]. The future is in 
September 15, 2020 
in-memory analog AI accelerators, as explained in the recent effort of [11]. Another viewpoint of 
the related issues could be found in [12]. 
 
Experiences in Education and Research 
About 4000 students world-wide have used the dataflow machine at the Mathematical Institute of 
the Serbian Academy of Sciences (https://maxeler.mi.sanu.ac.rs/), and these students come from 
universities like: MIT, Harvard, Princeton, Yale, Columbia, NYU, Purdue, University of Indiana in 
Bloomington, University of Michigan in Ann Arbor, Ohio State, Georgia Tech, CMU, FIU, FAU, etc 
(in the USA), ETH, EPFL (in Switzerland), UNIWIE, TUWIEN (in Austria), Karlsruhe, Heidelberg 
(in Germany), Manchester, Bristol, Cambridge, Oxford (in England), and, of course, from the 
leading schools of Belgrade: ETF, MATF, FON, FFH. They attended the hands-on workshops of 
classes for one, two, three, or six credits.   
As far as research efforts with students, they were asked to compare a real ControlFlow Multicore, 
a real Controlflow ManyCore, a real FPGA-based DataFlow, and a theorethical Ultimate DataFlow 
machine based on an analog Sea-of-Gates architecture. Esspecially intensive was the students-
oriented research effort at the University of Indiana, since early 2016, through courses on 
DataFlow SuperComputing (for BigData) and Software Engineering Management (with Creativity 
Methods), plus through the undergraduate research effort called UROC.  
Two UROC students have contributed significantly to programming that demonstrates the 
potentials of Ultimate DataFlow. Students in Siena, Salerno, Barcelona, and Valencia contributed 
to the development of related concepts. 
The Belgrade University graduate and undergraduate students helped determine the best 
distribution of transistors over resources, for a possible effort based on a 100 billion transistor 
chip. 
 
References 
[1] Milutinovic, V., et al, 
Guide to DataFlow SuperComputing, 
Springer, 
2015 (one textbook, part I) 
and 2017 (two textbooks, parts II and III). 
 
[2] Hurson, A., Milutinovic, V., editors, 
Advances in Computers: DataFlow, 
Elsevier, 2015 (one SCI textbook) 
and 2017 (two SCI textbooks). 
 
[3] Trifunovic, N., Milutinovic, V. et al, 
"The AppGallery.Maxeler.com for BigData SuperComputing," 
Journal of Big Data, Springer, 2016. 
 
[4] Trifunovic, N., Milutinovic, V. et al, 
September 15, 2020 
"Paradigm Shift in SuperComputing: DataFlow vs ControlFlow," 
Journal of Big Data, 2015. 
 
[5] Milutinovic, V., 
"The HoneyComb Architecture," 
Proceedings of the IEEE, 1989. 
 
[6] Milutinovic, V. et all, 
"Splitting Spatial and Temporal Localities for Entropy Minimiation" 
Tutorial of the IEEE ISCA, 1995. 
 
[7] Jovanovic, Z., Milutinovic, V., 
"FPGA Accelerator for Floating-Point Matrix Multiplication," 
The IET Computers and Digital Techniques Premium Award for 2014, 
IET (formerly IEE), Volume 6, Issue 4, 
2012 (pp. 249-256). 
 
 
[8] Milutinovic, V., 
"A Comparison of Suboptimal Detection Algorithms 
(Suboptimal Algorithms for Data Analytics)," 
Proceedings of the IEE (now IET), 1988. 
 
[9] Flynn, M., Mencer, O., Milutinovic, V., at al, 
Moving from PetaFlops to PetaData, 
Communications of the ACM, 
May 2013. 
 
[10] Trobec, R. Vasiljevic, R., Tomasevic, M., Milutinovic, V., et al, 
"Interconnection Networks for PetaComputing," 
ACM Computing Surveys, 
November 2016. 
[11] Cosemans, S., et al, “Towards 10000TOPS/W DNN Inference with Analog in-Memory Computing,” 
The 2019 IEEE International Electron Devices Meeting, IEDM, San Francisco, California, December 2019. 
[12] Reuther, A., “Survey and Benchmarking of Machine Learning Accelerators,” MIT Lincoln Labs, arXiv, 
August 2019.  
[13] Fortes, J.A., Milutinovic, V., Dick, R.J., Helbig, W.A., Moyers, W.D.,  
“A High-Level Systolic Architecture for GaAs,”  
Proceedings of the 19th Hawaii International Conference on System Sciences, pp. 253-258. 
[14] Milutinovic, V., Fura, D., Helbig, W., “Introduction to GaAs Microprocessor Architecture,”  
IEEE Computer, 1986, pp.30-42. 
September 15, 2020 
[15] Milenkovic, A., Milutinovic, V.,  
“Cache Injection: A Novel Technique for Tolerating Memory Latency in Bus-Based Systems,”  
Proceedings of the European Conference on Parallel Processing, 2000, pp. 558-566. 
[16] Milutinovic, V., “The Split Temporal/Spatial Cache: Initial Performance Analysis,”  
Proceedings of the SCIzzL-5, March 1996, pp. 63-69. 
[17] https://www.techpowerup.com/gpu-specs/geforce-gtx-1080-ti.c2877 
[18] https://hackaday.com/2019/08/21/largest-chip-ever-holds-1-2-trillion-
transistors/#:~:text=The%20chip%20has%20400%2C000%20cores,8.5%20inches%20on%20e
ach%20side. 
[19] https://link.springer.com/article/10.1007/BF00932064 
[20] https://m.eet.com/media/1081699/XIL.PDF 
 
Acknowledgements: 
The authors are thankful to Lars Zetterberg of KTH, Henry Markram of EPFL, Roberto Giorgi of the 
University of Siena, and Anton Kos of the University of Ljubljana, for their eyes opening discussions 
related to the topic of ICT. 
