Optimizing Data Intensive Flows for Networks on Chips by Zhang, Junwei et al.
ar
X
iv
:1
81
2.
07
18
3v
2 
 [c
s.D
C]
  1
 Ja
n 2
01
9
OPTIMIZING DATA INTENSIVE FLOWS FOR NETWORKS ON CHIPS
Junwei Zhang
Uber Technoogies Inc
1191 2nd Ave 1200, Seattle, WA 98101
junwei.zhang@stonybrook.edu
Yang Liu
Uber Technoogies Inc
555 Market Street, San Francisco, CA, USA
yangliu89415@gmail.com
Li Shi
Snap Inc
Ocean Front Walk, Venice, CA
lishi.pub@gmail.com
Thomas G. Robertazzi
IEEE fellow
Department of Electrical and Computer Engineering
Stony Brook University
100 Nicolls Rd, Stony Brook, NY 11794
thomas.robertazzi@stonybrook.edu
SpringSim-ANSS, 2019 April 29-May 2, Tucson, AZ, USA; c©2019 Society for Modeling & Simulation International (SCS)
ABSTRACT
Data flow analysis and optimization is considered for homogeneous rectangular mesh networks. We pro-
pose a flow matrix equation which allows a closed form characterization of the nature of the minimal time
solution, speedup and a simple method to determine when and how much load to distribute to processors.
We also propose a rigorous mathematical proof about the flow matrix optimal solution existence and that
the solution is unique. The methodology introduced here is applicable to many interconnection networks
and switching protocols (as an example we examine toroidal networks and hypercube networks in this pa-
per). An important application is improving chip area and chip scalability for networks on chips processing
divisible style loads.
Keywords: Divisible Load Theory, Network on Chip (NOC) Interconnection Network, Mesh, Toroidal
Networks, Hypercube Networks, Performance analysis, Data Intensive Load.
1 INTRODUCTION
1.1 Background
Networks on chips (NOC) represent the smallest networks that have been implemented to date
(Robertazzi 2017). A popular choice for the interconnection network on such networks on chips is the
rectangular mesh. It is straightforward to implement and is a natural choice for a planar chip layout. Data
to be processed can be inserted into the chip at one or more so-called “injection points”, that is node(s)
in the mesh that forward the data to other nodes. Beyond NOCs, injecting data into a parallel proces-
sor’s interconnection network has been done for some time, for instance in IBM’s Bluegene machines
(Krevat, Castaños, and Moreira 2002).
LastName1, LastName2, and LastNameLastAuthor
In this paper it is sought to determine, for a single injection point on a homogeneous rectangular mesh, how
to optimally assign load to different processors/links in a known timed pattern so as to process a load of data
in a minimal amount of time (i.e. minimize makespan). In this paper we succeed in presenting an optimal
technique for single point injection in homogeneous meshes that involves no more complexity than linear
equation solution. The methodology presented here can be applied to a variety of interconnection networks
and switching/scheduling protocols besides those directly covered in this paper. As examples, toroidal and
hypercube networks are also considered in this paper. A companion paper examines this problem with
multiple sources of load (Zhang 2018).
Crucial to our success is the use of divisible load scheduling theory
(Bharadwaj, Ghose, and Robertazzi 2003)(Bharadwaj, Ghose, Mani, and Robertazzi 1996). Devel-
oped over the past few decades, it assumes load is a continuous variable that can be arbitrarily
partitioned among processors and links in a network. Use is made of the divisible load schedul-
ing’s optimally principle (Bharadwaj, Ghose, Mani, and Robertazzi 1996)(Sohn and Robertazzi 1996),
which says makespan is minimized when one forces all processors to stop at the same time
(Bharadwaj, Ghose, Mani, and Robertazzi 1996) (Sohn and Robertazzi 1996) (intuitively otherwise
one could transfer load from busy to idle processors to achieve a better solution). This leads to a series
of chained linear flow and processing equations that can be solved by linear equation techniques, often
yielding recursive and even closed form solutions for quantities such as makespan and speedup.
In this paper, the use of virtual cut-through switching (Kermani and Kleinrock 1979) and a modified version
of store and forward switching is investigated. These are one of many switching protocols that the method-
ology described here applies to. In the virtual cut-through environment, a node can begin relaying the first
part of a message (packet) along a transmission path as soon as it starts to arrive at the node, that is, it doesn’t
have to wait to receive the entire message before it can begin forwarding the message. In pure store and
forward switching, messages must be completely be received before being forwarded.
More specifically, first, an equivalent processor (and makespan, speedup and processor load fractions) is
found for a 2× 2 homogeneous mesh network, which can be generalized to a homogeneous 2× n mesh
network. After that, we analyze the more general case of a homogeneous m×n mesh network and obtain a
general closed-form matrix representation yielding a processor with equivalent processor speed, makespan,
speedup and processor/link load allocation. Different single data injection point positions, such as the corner,
boundary and inner grid are also discussed. In addition, a rigorous mathematical proof about the flow matrix
solution’s existence and uniqueness is presented.
In summary, in this work, a flow matrix quantitative model, which tells one how to deploy the data fractions
to each processor in a homogeneous mesh in a makespan optimal manner is proposed. The complexity of
the technique is no more than that of linear equation solution complexity. This work has relevance to mesh
interconnection networks used in parallel processing in general and to meshes used in Networks on Chips
in particular. An important application is improving chip area and chip scalability for networks on chips
processing divisible style loads.
1.2 Related Work
In path breaking work in the 1990’s, Drozdowksi and others and created models and largely
recursive solutions for single source distribution in 2D (Błaz˙ewicz and Drozdowski 1996)
and 3D meshes (Drozdowski and Głazek 1999) (Głazek 2003), toroidal meshes
(Błaz˙ewicz, Drozdowski, Guinand, and Trystram 1999) and hypercubes (Błazewicz and Drozdowski 1995).
For 2D meshes (Błaz˙ewicz and Drozdowski 1996) recursive solutions and closed form asymptotic results
were found. This was extended to 3D meshes with recursive solutions for load fractions (Głazek 2003)
LastName1, LastName2, and LastNameLastAuthor
(Drozdowski and Głazek 1999). Recursive solutions for toroidal networks and hypercubes were also found
(Błaz˙ewicz, Drozdowski, Guinand, and Trystram 1999) (Błazewicz and Drozdowski 1995). The hypercube
work included a closed form expression for speedup in terms of a fundamental load fraction assignment.
1.3 Our Contribution
This work is distinct form earlier work in providing matrix based solutions (created through induction) for
2D meshes, toroidal networks and hypercubes. Also different injection point locations in finite 2D meshes
(corner, boundary and center) are considered. Extensive simulation results based on this modeling are
presented in (Zhang 2018).
2 FLOWMATRIX MODEL
2.1 Definitions and Assumptions
Definition 1. Equivalence Computation
Equivalence computation is a technique, which consists of combining a cluster of processors as one whole
processor with equivalent processing capabilities.
The following assumptions are used throughout the paper:
• Virtual cut-through (Kermani and Kleinrock 1979) switching and store and forward switching is
used to transmit the assigned workload between processors.
– Under virtual cut-through switching, a node can relay the beginning bits of a message (packet)
before the entire message is received.
– Under store and forward switching, a message must be completely received by a node before it
can be relayed to the next node along its transmission path.
• For simplicity, return communication is not considered.
• The communication delays are taken into consideration.
• The time taken by computation and communication are assumed to be linear function of the data
size.
• The network environment is homogeneous, that is, all the processors have the same computation
capacity. The link speeds between any two unit cores are identical.
• The number of outgoing ports in each processor is limited.
• Single Path Communication : data transfer between two nodes follows a single path.
The optimization objective functions is as follows :
• Equivalence computation : the problem’s objective function is how to partition and schedule the
workloads among the processors to obtain the minimum makespan (finish time).
The minimum time solution is obtained by forcing the processors over a network to stop processing si-
multaneously. Intuitively, this is because the solution could be improved by transfer load from some busy
processors to idle ones (Bharadwaj, Ghose, Mani, and Robertazzi 1996) (Sohn and Robertazzi 1996).
Processor equivalence is discussed in (Robertazzi 1993) (Liu, Zhao, and Li 2007) and figure 1 are examples.
LastName1, LastName2, and LastNameLastAuthor
Figure 1: A m*n mesh network(m = 5, n = 5)
2.2 Notations
The following notations and definitions are utilized:
• Pi: The ith processor. 0≤ i ≤ m∗n−1.
• L: The work load.
• Di: The minimum number of hops from the processor Pi to the data load injection site L.
• α0: The load fraction assigned to the root processor.
• αi: The load fraction assigned to the ith processor.
• αˆi: The load fraction assigned to each processor on the ith layer i ∈ 0 · · · (k−1).
• ωi: The inverse computing speed on the ith processor.
• ωeq: The inverse computing speed on an equivalent node collapsed from a cluster of processors.
• r: The rank of the flow matrix.
• zi: The inverse link speed on the ith link.
• Tcp: Computing intensity constant. The entire load is processed in time ωiTcp seconds on the ith
processor.
• Tcm: Communication intensity constant. The entire load is transmitted in time ziTcm seconds over
the ith link.
• Tˆf : The finish time of the whole processor network. Here Tˆf is equal to ωeqTcp.
• Tf : The finish time for the entire divisible load solved on the root processor. Here Tf is equal to
1×ω0Tcp, that is ω0Tcp.
• Tf ,i: The finish time for the ith processor, i ∈ 0 · · · (m∗n−1).
• σ = zTcm
ωTcp
: The ratio between the communication speed to the computation speed, 0 < σ < 1
(Bharadwaj, Ghose, Mani, and Robertazzi 1996) (Hung and Robertazzi 2004).
• ∑m∗n−1i=0 αi = 1
• Speedup =
Tf
Tˆf
=
ωTcp
α0ωTcp
= 1
α0
LastName1, LastName2, and LastNameLastAuthor
3 VIRTUAL CUT-THROUGH SWITCHING SCENARIO
In the virtual cut-through environment, a node can begin relaying the first part of a message (packets) along
a transmission path as soon as it starts to arrive at the node , that is, it doesn’t have to wait to receive the
entire message before it can begin forwarding the message.
First we consider the 2 ∗ 2 mesh network, which can be generalized to a 2 ∗ n mesh network. We then
analyze a m∗n mesh network and obtain a general closed-form matrix presentation. Finally, we give a key
methodology to address this type of question. In addition, different single data injection positions, such as
the corner, boundary and inner grid are also discussed.
3.1 Data Injection on The Corner Processor
3.1.1 2*2 Mesh Network
The load L is assigned on the corner processor P0 (figure 2). The whole load is processed by four processors
P0, P1, P2, P3 together.
Figure 2: The 2*2 mesh network and the root processor is P0
The processor P0, P1 and P2 start to process its respective load fraction at the same time. This includes P1
and P2 as they are relayed load in virtual cut-through mode at t = 0. Because we assume a homogeneous
network (in processing speed and communication speed), α1 = α2 and P1 and P2 stop processing at the same
time. The processor P3 starts to work when the α1 and α2 complete transmission. That is, the link 0−1 and
0−2 are occupied transmitting load to processor 1 and 2, respectively and only transmission to 3 when that
is finished.
According to the divisible load theory (Bharadwaj, Ghose, and Robertazzi 2003), we obtain the timing dia-
gram figure 3.
Here in the Gantt-like timing diagram communication appears above each axis and computations appears
below the each axis. Let’s assume that all processors stop computing at the same time in order to minimize
the makespan (Sohn and Robertazzi 1996).
Based on the timing diagram, we obtain a group of linear equations to find the fraction workload assigned
to each processor αi :
LastName1, LastName2, and LastNameLastAuthor
Figure 3: The timing diagram for 2*2 mesh network with virtual cut-through and the root processor is P0


α0ωTcp = Tf ,m
α1ωTcp = Tf ,m
α2ωTcp = Tf ,m
α1zTcm +α3ωTcp = Tf ,m
α0+α1+α2+α3 = 1
σ =
zTcm
ωTcp
0< σ < 1
0< α0 ≤ 1
0≤ α1,α2,α3 < 1
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
The group of equations are represented by the matrix form:

 1 2 11 −1 0
0 σ −1 1

×

 α0α1
α3

=

 10
0

 (10)
The matrix is represented as A×α = b. A is named as the flow matrix. Here because of symmetry α1 = α2,
so α2 is not listed in the matrix equations.
Finally, the explicit solution is: 

σ =
zTcm
ωTcp
α0 =
1
4−σ
α1 =
1
4−σ
α3 =
1−σ
4−σ
(11)
(12)
(13)
(14)
LastName1, LastName2, and LastNameLastAuthor
The simulation result is illustrated:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
D
at
a 
Fr
ac
tio
n
Virtual Cut-through Different Level Data Fraction vs 
0
1
3
Figure 4: 2*2 mesh network. α0, α1, α2, α3 value curve
In figure 4, the three processors P0, P1, P2 have the same data fraction workload, so the curve of α0 and α1
coincide. The figure says that as σ grows, the value α3 drops. In other words, as the communication speed
decreases, there is less data workload assigned to P3. Further, it means it will be economical to keep the load
local on P0 P1 P2 and not distribute it, to other processors. Thus for slow communication α0 = α1 = α2 =
1
3
.
The equivalence inverse speed of a a single processor is weq, that can replace the original network as
Tˆf = 1∗weq ∗Tcp
weq = α0 ∗w
Speedup =
Tf
Tˆf
=
ωTcp
α0ωTcp
=
1
α0
= 4−σ
For a fast communication (σ ≈ 0), the speedup is 4.
3.1.2 2*n Mesh Network
The 2∗n figure 5 homogeneous mesh network processes load L and L originates P0.
Load a distribution from P0 to P1 and P2 via virtual cut-through. After P1 and P2 finish receiving load from
link 0−1 and 0−2, they will be used to forward load to P3 and P4 and so on.
Similarly to the analysis of figure 3, the timing diagram for figure 5 is shown in figure 6
LastName1, LastName2, and LastNameLastAuthor
Figure 5: 2*n (n = 10) mesh network and the workload happens on P0
Figure 6: The timing diagram for 2*10 mesh network and the data injection happens on P0 for virtual
cut-through
The equations are presented as:


α0ωTcp = Tf ,m
α1ωTcp = Tf ,m
α2ωTcp = Tf ,m
α1zTcm +α3ωTcp = Tf ,m
α2zTcm +α4ωTcp = Tf ,m
(α1+α3)zTcm +α5ωTcp = Tf ,m
...
(α1+ · · ·+α2×n−1)zTcm +α2×n−1ωTcp = Tf ,m
α0+ · · ·+α2×n−1 = 1
σ =
zTcm
ωTcp
0< σ < 1
0< α0 ≤ 1
0≤ α1 α2 · · · α2×n−1 < 1
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)
(26)
(27)
LastName1, LastName2, and LastNameLastAuthor
The flow matrix is shown:


1 2 2 · · · 2 2 1
1 −1 0 · · · 0 0 0
0 σ −1 1 · · · 0 0 0
0 σ −1 σ 1 0 · · · 0
0 σ −1 σ σ 1 0 0
...
...
...
...
. . .
. . .
0 σ −1 σ · · · σ σ 1


×


α0
α1
α3
α5
...
α2×n−3
α2×n−1


=


1
0
0
0
...
0
0


(28)
According to the Cramer’s rule,the explicit solution for the group of equations is:
{
αi =
∣∣∣∣detA
⋆
i
detA
∣∣∣∣ (29)
where A⋆i is the matrix formed by replacing the i-th column of A by the column vector b.
Specifically,
A⋆0 =


1 2 2 · · · 2 2 1
0 −1 0 · · · 0 0 0
0 σ −1 1 · · · 0 0 0
0 σ −1 σ 1 0 · · · 0
0 σ −1 σ σ 1 0 0
...
...
...
...
. . .
. . .
0 σ −1 σ · · · σ σ 1


(30)
α0 =
∣∣∣∣detA
⋆
0
detA
∣∣∣∣
detA⋆0 =−1
The equivalence inverse processing speed :
Tˆf = 1∗weq ∗Tcp
weq = α0 ∗w
Finally, the speedup is:
Speedup =
Tf
Tˆf
=
ωTcp
α0ωTcp
=
1
α0
= |−detA|
.
Further, we prove the matrix detA 6= 0.
LastName1, LastName2, and LastNameLastAuthor
C =


−1 0 · · · 0 0 0
σ −1 1 · · · 0 0 0
σ −1 σ 1 0 · · · 0
σ −1 σ σ 1 0 0
...
...
...
...
. . .
. . .
σ −1 σ · · · σ σ 1


(31)
C is a lower triangular matrix and the diagonal elements are not 0. SoC is non-degenerate, that is, the matrix
is column linear independence.
After a series of column reduction and row reduction actions, we get
A =


1 2 2 · · · 2 2 1
1 −1 0 · · · 0 0 0
0 σ −1 1 · · · 0 0 0
0 σ −1 σ 1 0 · · · 0
0 σ −1 σ σ 1 0 0
...
...
...
...
. . .
. . .
0 σ −1 σ · · · σ σ 1


Column
−−−−−→
Reduction


1 0 0 · · · 0 0 0
1 −3 −2 · · · −2 −2 −1
0 σ −1 1 · · · 0 0 0
0 σ −1 σ 1 0 · · · 0
0 σ −1 σ σ 1 0 0
...
...
...
...
. . .
. . .
0 σ −1 σ · · · σ σ 1


Row
−−−−−→
Reduction


1 0 0 · · · 0 0 0
0 −3 −2 · · · −2 −2 −1
0 σ −1 1 · · · 0 0 0
0 σ −1 σ 1 0 · · · 0
0 σ −1 σ σ 1 0 0
...
...
...
...
. . .
. . .
0 σ −1 σ · · · σ σ 1


Considering the matrix Cˆ
Cˆ =


−3 −2 · · · −2 −2 −1
σ −1 1 · · · 0 0 0
σ −1 σ 1 0 · · · 0
σ −1 σ σ 1 0 0
...
...
...
...
. . .
. . .
σ −1 σ · · · σ σ 1


(32)
, which is still column linear independence. Considering 0<σ < 1, the flowmatrix is full rank. So detA 6= 0.
This proof can be generalized to m×n case.
LastName1, LastName2, and LastNameLastAuthor
3.1.3 m*n Mesh Network
Considering a general m∗n mesh network, such as figure 7 and figure 1.
Figure 7: 3*8 mesh network. The data injection position is P0
Utilizing the previous methodology, we obtain the flow matrix equations for figure 7:
σ −1= σ ⋆


1 2 3 3 3 3 3 3 2 1
1 −1 0 0 0 0 0 0 0 0
0 σ ⋆ 1 0 0 0 0 0 0 0
0 σ ⋆ σ 1 0 0 0 0 0 0
0 σ ⋆ σ σ 1 0 0 0 0 0
0 σ ⋆ σ σ σ 1 0 0 0 0
0 σ ⋆ σ σ σ σ 1 0 0 0
0 σ ⋆ σ σ σ σ σ 1 0 0
0 σ ⋆ σ σ σ σ σ σ 1 0
0 σ ⋆ σ σ σ σ σ σ σ 1


×


α0
α1
α3
α6
α9
α12
α15
α18
α21
α23


=


1
0
0
0
0
0
0
0
...
0


(33)
LastName1, LastName2, and LastNameLastAuthor
Also, the flow matrix equations for figure 1:


1 2 3 4 5 4 3 2 1
1 −1 0 0 0 0 0 0 0
0 σ ⋆ 1 0 0 0 0 0 0
0 σ ⋆ σ 1 0 0 0 0 0
0 σ ⋆ σ σ 1 0 0 0 0
0 σ ⋆ σ σ σ 1 0 0 0
0 σ ⋆ σ σ σ σ 1 0 0
0 σ ⋆ σ σ σ σ σ 1 0
0 σ ⋆ σ σ σ σ σ σ 1


×


α0
α1
α3
α6
α10
α15
α19
α22
α24


=


1
0
0
0
0
0
0
...
0


(34)
We use the similar method to prove detA 6= 0. The equivalence inverse processing speed :
Tˆf = 1∗weq ∗Tcp
weq = α0 ∗w
so the speedup is:
Speedup =
Tf
Tˆf
=
ωTcp
α0ωTcp
=
1
α0
= |−detA|
.
The first row in the flow matrix describes the number of cores on each Di.
For example, there is 1 core with 0 hop distance (D0) from load site L. There are 2 cores with 1 hop distance
(D1) from load site L. There are 3 cores with 2 hops distance (D2) from load site L, and so on.
The number of rows means the number of different type processor data fractions.
After these cases’ investigation, we find a crucial fact:
∀Di = D j, then αi = α j, 0≤ i, j ≤ m∗n−1
4 OTHER NETWORKS
4.1 Mesh Networks
5 CONCLUSION
In this work a significant problem is addressed: optimal single source load distribution in mesh, toroidal
and hypercube networks. This is done by way of example for virtual cut-through switching and a modified
version of store and forward switching. However the approach outlined here is applicable to a wide variety
of switching/load distribution strategies and architectural parameters. We propose a flow matrix equation
method to characterize the nature of the minimal time solution and a simple method to determine when and
howmuch load to distribute to processors. This work demonstrates that mathematical modeling and tractable
solutions can be an aid to designing and evaluating scheduling strategies in parallel systems. Parallel systems
will be with us for some time so this and related work is likely to be of enduring value.
LastName1, LastName2, and LastNameLastAuthor
ACKNOWLEDGMENTS
“This work has been submitted to the IEEE for possible publication. Copyright may be transferred without
notice, after which this version may no longer be accessible.”
References
Bharadwaj, V., D. Ghose, V. Mani, and T. G. Robertazzi. 1996. Scheduling divisible loads in parallel and
distributed systems, Volume 8. John Wiley & Sons.
Bharadwaj, V., D. Ghose, and T. G. Robertazzi. 2003. “Divisible load theory: A new paradigm for load
scheduling in distributed systems”. Cluster Computing vol. 6 (1), pp. 7–17.
Błazewicz, J., and M. Drozdowski. 1995. “Scheduling divisible jobs on hypercubes”. Parallel Comput-
ing vol. 21 (12), pp. 1945–1956.
Błaz˙ewicz, J., and M. Drozdowski. 1996. “The performance limits of a two dimensional network of load-
sharing processors”. Foundations of Computing and Decision Sciences vol. 21 (1), pp. 3–15.
Błaz˙ewicz, J., M. Drozdowski, F. Guinand, and D. Trystram. 1999. “Scheduling a divisible task in a two-
dimensional toroidal mesh”. Discrete Applied Mathematics vol. 94 (1-3), pp. 35–50.
Drozdowski, M., and W. Głazek. 1999. “Scheduling divisible loads in a three-dimensional mesh of proces-
sors”. Parallel Computing vol. 25 (4), pp. 381–404.
Głazek, W. 2003. “A multistage load distribution strategy for three-dimensional meshes”. Cluster Comput-
ing vol. 6 (1), pp. 31–39.
Hung, J. T., and T. G. Robertazzi. 2004. “Switching in sequential tree networks”. IEEE Transactions on
Aerospace and Electronic Systems vol. 40 (3), pp. 968–982.
Kermani, P., and L. Kleinrock. 1979. “Virtual cut-through: A new computer communication switching
technique”. Computer Networks (1976) vol. 3 (4), pp. 267–286.
Krevat, E., J. G. Castaños, and J. E. Moreira. 2002. “Job scheduling for the BlueGene/L system”. In Work-
shop on Job Scheduling Strategies for Parallel Processing, pp. 38–54. Springer.
Liu, X., H. Zhao, and X. Li. June 25-28, 2007. “Scheduling Divisible Workloads from Multiple Sources in
Linear Daisy Chain Networks”. Proceedings of the International Conference on Parallel and Distributed
Processing Techniques and Applications vol. 2.
Robertazzi, T. G. 1993. “Processor equivalence for daisy chain load sharing processors”. IEEE Transactions
on Aerospace and Electronic Systems vol. 29 (4), pp. 1216–1221.
Robertazzi, T. G. 2017. Introduction to Computer Networking. Springer Science.
Sohn, J., and T. G. Robertazzi. 1996. “Optimal divisible job load sharing for bus networks”. IEEE Transac-
tions on Aerospace and Electronic Systems vol. 32 (1), pp. 34–40.
Zhang, J. 2018. “Data Distribution Equivalence for Data Intensive Interconnection Networks”. Ph.D thesis
Stony Brook University.
AUTHOR BIOGRAPHIES
JUNWEI ZHANG received the PhD degree from the Applied Mathematics and Statistics Department of
Stony Brook University in 2018. His research interests include parallel computing optimization, computa-
tional geometry and applied machine learning. His email address is junwei.zhang@stonybrook.edu.
LastName1, LastName2, and LastNameLastAuthor
YANG LIU received his PhD degree from Department of Electrical and Computer Engineering at Stony
Brook University, Stony Brook, NY, in 2017. Previously, he received his B.E. degree from Department
of Electrical and Computer Engineering at University of Electronic Science and Technology of China,
Chengdu, China, in 2011. His research interests are in the area of distributed/parallel computing, network-
ing, and load balancing algorithms. He is currently working on divisible load theory and heterogeneous
system applications. His email address is yangliu89415@gmail.com.
LI SHI received his Ph.D. degree from Department of Electrical and Computer Engineering at Stony Brook
University, Stony Brook, NY, in 2016. Previous, he received his B.E. degree in electrical and computer
engineering from Shanghai Jiao Tong University, Shanghai, China, in 2010. He is working at Snap Inc,
Venice, CA. His email address is lishi.pub@gmail.com.
THOMAS ROBERTAZZI is a Professor of Electrical and Computer Engineering at Stony Brook Univer-
sity. He is an IEEE Fellow. He received the PhD from Princeton University and the B.E,E, from the Cooper
Union. He has published extensively in areas such as scheduling, performance evaluation and networking.
His email address is thomas.Robertazzi@stonybrook.edu.
