The increasing complexity of the software/hardware stack of modern supercomputers makes understanding the performance of the modern massive-scale codes difficult. Distributed graph algorithms (DGAs) are at the forefront of that complexity, pushing the envelope with their massive irregularity and data dependency. We analyze the existing body of research on DGAs to assess how technical contributions are linked to experimental performance results in the field. We distinguish algorithm-level contributions related to graph problems from runtime-level concerns related to communication, scheduling, and other low-level features necessary to make distributed algorithms work. We show that the runtime is an integral part of DGAs' experimental results. We argue that a DGA can only be fully understood as a combination of these two aspects and that detailed reporting of runtime details must become an integral part of scientific standard in the field if results are to be truly understandable and interpretable. Based on our analysis of the field, we provide a template for reporting the runtime details of DGA results, and we further motivate the importance of these details by discussing in detail how seemingly minor runtime changes can make or break a DGA.
INTRODUCTION
Large, irregular applications are gaining recognition as the future challenge in parallel computing. This is reflected by the Graph500 benchmark [22] , the subject of which is the prototypical irregular problem of graph traversal. Graph traversal is a basic building block of many other graph algorithms. The current Graph500 benchmark is based on breadth-first search (BFS) with a proposal to extend the benchmark with single-source shortest paths (SSSP). In this paper, we concentrate on BFS and SSSP for the same reasons, i.e., as representatives of a class of irregular graph problems.
Research on distributed graph algorithms (DGA) is an emerging and active field. New algorithms, new approaches to distribute the data, and new performance results appear at most major distributed computing conferences. The Graph500 benchmark bears witness to the progress with the best results progressing from seven GTEPS (billions of traversed edges per second) in 2010 to 23 TTEPS in 2014. Many new algorithmic techniques have been developed, e.g., direction optimization [4, 5] , pruning [9] , k-level asynchronous algorithm [17] , hybrid algorithms [9] , and distributed control [31] . A practitioner faces a multitude of published approaches, which are often vague on low-level details of implementations.
However, for DGAs, the low level details of implementation details profoundly affect performance. This is because DGAs exhibit little locality, rarely require any significant computation per memory access, and result in a high-rate of communication of small messages. Thus, unlike regular algorithms that are built on top of well-understood regular communication and memory access, graph algorithms interact with the entire software and hardware stack in a complex way due to their data-driven, fine-grained, irregular nature of tasks. Each piece of the stack, designed independently, from the algorithm level through the transport layer to the hardware layer and the topology of the physical network, interacts within the system. This makes designing DGAs an experimental endeavor, and this state of affairs will be only exacerbated as we move towards exascale computing. It should be noted that the complexity of the interactions between high-level algorithm and low-level runtime is not unknown to the practitioners. However, this knowledge is implicit, fragmented, and often sidelined in presentation of new techniques. Notably, Checconi and Petrini [11] , who achieve Table 1 : Runtime-level aspects of DGAs. These aspects need to be disclosed when reporting results; for those that are quantifiable, numerical values should be reported for each experiment.
the top results in the Graph 500 benchmark in part due to direct access to the SPI (System Programming Interface) low-level primitives, provide an outstanding analysis of their evolving implementation, including a three-years timeline of changing conclusions and understanding. Unfortunately, this manner of reporting is not typical for the field.
We contend that the advancements in the field are difficult to generalize and reconcile because the information reported is commonly incomplete. The low-level details of implementations are often vague or missing. Yet, these can have important impact.
In this paper we propose a template of runtime features, presented in Table 1 , to aid authors in reporting their work. We arrive to this template by combining our own experience in co-designing runtime system and DGAs with findings from review of existing literature. For completeness, we also list algorithm-level aspects, Table 2 . We invite others to extend and revise our suggestions, and to make this a community effort. Widespread adoption of our recommendation will enable transferability of lessons learned across the field, metastudies of the interaction of runtime and algorithmic concerns to potentially derive abstract models, and, Table 2 : Algorithm-Level aspects of DGAs. These aspects are usually disclosed in literature; we present them here for completeness.
with deepened, systematic understanding, improvement of performance.
Contributions:
An analysis of the field (Sec. 2), identifying, classifying, and discussing two levels of distributed graph algorithms: (i) Runtime-level aspects (Sec. 2.1) that authors do not explicitly consider a part of the algorithm but that play a crucial role in the overall performance; and (ii) Algorithm-level aspects (Sec. 2.2) that authors identify as the main algorithmic contributions of their research;
A template of runtime features (Sec. 2, Table 1 ) for authors to consider when describing their research;
Filled-out "report cards" (Table 3) for Distributed Control [31] and BlueGene/Q BFS [11] , which constitute examples of how to use our proposed template; and Demonstration of sensitivity of algorithm performance to runtime parameters (Sec. 4) where we show how seemingly minor changes may have significant effect.
ANALYSIS OF DGAs
In this section, we analyze and describe a sample of the existing research on distributed BFS and SSSP problems. Our motivation is two-fold. First, we want to get an overall feel for how complete (or incomplete) is the information about runtime part of the DGA stack that is presented in literature. Second, we aim to identify potential aspects of DGAs beyond those stemming from our own work. Figure 1: Overview of the runtime stack components algorithm agnostic. For this reason, our analysis of runtime concerns is applicable to any DGAs. The purpose of our effort is to construct a blueprint for a more holistic treatment of DGAs. Tables 1 and 2 summarize the runtime-level and algorithm-level aspects of DGAs discussed in the remainder of the section. Moreover, Table 1 serves as a template for reporting runtime aspects of DGAs, and we use it as such to describe three DGA runtimes in Table 3 .
Runtime-Level Aspects of DGAs
DGAs runtime (Fig. 1 ) has two major distinguishable components working in rapport: a transport (Sec. 2.1.1) and a scheduler (Sec. 2.1.3). The basic component of a transport is a bit transport that is a system-provided implementation of an interface for actually sending bits over the wire. Bit transport can support different protocols for exchanging messages such as eager or rendezvous protocols. The bit transport API may support levels of thread safety, impacting both the use and the internal working of the transport (hence the shared part in the figure). Request tracking is the mechanism for making and keeping track of communication requests. Building on bit transport, the runtime must use the request tracking mechanism in some way, according to thread safety, to initiate and complete requests. We call this process progression. The choice of Communication paradigm has reverberating impact both on the rest of the transport layer and the DGA itself. For example, collectives require more memory for communication results than point-to-point communication and the progression mechanism has to be designed with fewer but more heavyweight requests in mind. Transport may provide logical topologies that provide additional routing on top of physical and job layout topologies to improve performance (e.g., message reductions in logical topology) or to reduce memory requirements (e.g., fewer coalescing buffers). Finally, the transport may employ various optimizations such as message coalescing.
The scheduler part of the runtime schedules worker threads or tasks. One of the responsibilities of the local scheduler is to ensure transport progression. Local scheduler can also be responsible for checking termination of an algorithm.
In the following discussion, we categorize important runtimelevel parameters based on the review of existing literature for DGAs. Table 1 summarizes the set of low-level runtime parameters.
Transport
The transport layer is the part of the stack responsible for sending and receiving bits. Important properties of transport include how message buffers are handled, which entity manages them, and how frequently they need to be managed. The runtime needs to take several decisions regarding these.
The choice of communication paradigm can have a notable impact on performance of DGAs. Each paradigm imposes different tradeoffs in terms of memory constraints, synchronization overhead, and network latency. The collectives paradigm is used when large, low-overhead stages of all-to-all communication are needed, point-to-point paradigm allows for finer overlap between computation and communication at the expense of code complexity, and active messages are a refinement of point-point communication that adds an implicit execution of handlers on remote objects. Finally, one-sided paradigm provides remote memory operations (GET, PUT, etc.) which are very efficient but require remote memory management protocol.
For example, collectives are the base of BLAS approaches and level-synchronous approaches. However, Checconi and Petrini [11] show how using lightweight point-to-point communication may lead to improvements in traditionally synchronous approaches. They compare their point-to-point implementation using low-level BlueGene/Q's System Processing Interface (SPI) to an MPI implementation using collectives. They note the large memory footprint required for collective buffers, which forces them to decrease the scale of the problem per node. Furthermore, collectives do not allow for easy interleaving of computation with communication. Active messages are based on point-to-point communication, and they display similar communication performance characteristics.
Request Tracking (RT) refers to how communication requests are made: a request is scheduled and completed separately (asynchronous), a request is made and completed at the same time (locally synchronous), or a request is made and the requester waits until it has been completely processed (remotely sychronous). As an example, remotely synchronous request tracking in MPI (e.g., by MPI _ Ssend etc.) guarantees a small number of messages on the network, but it hinders parallelization. On the other hand, using locally synchronous RT (e.g., MPI _ Send) makes it easy to reuse buffers and may allow parallelism if eager protocol is used. Finally, asynchronous RT uses interfaces such as MPI _ Isend/MPI _ IRecv to start requests along with interfaces such as MPI _ Testsome to check for their completion, maximizing overlap (if the MPI implementation supports it) between computation and communication at the cost of more complex progression and request management.
Completing a round trip through transport requires bookkeeping, performing bit moving, and delivering the results of completed requests-we call that progression. Progression influences the timeliness and efficiency of transport delivery, and a wrong progression model can render an algorithm infeasible (Sec. 4.3). Asynchronous progression is performed periodically and is scheduled through dedicated resources such as system or user threads. For example, Cray MPI provides an option for starting progression pthreads that perform internal MPI progress in parallel with the algorithm threads. Progression through user threads is scheduled by the runtime explicitly, and, for example, calls MPI repeatedly to generate progress. In contrast, synchronous progression is initiated periodically from the runtime. In explicit progress, the algorithm can choose, bypassing the runtime scheduler (if any), when to call progress, enabling optimizations at the cost of added code complexity. For example, in [31] , we employed explicit polling in our Distributed Control (DC) algorithm for SSSP, but observed a decrease in performance. In a task-based system, network progress can be scheduled as a lightweight task. For example, AM ++ [28] implements network polling, buffer flushing, checking for termination, and executing pending handlers for received messages as tasks, on equal footing with algorithm tasks that run message handlers. HPX-5 [18] executed network progress in a similar fashion, but the more recent versions switched to explicitly initiating progress in the main scheduler loop, giving the runtime more control over when progression is executed. Most authors do not discuss progression and request tracking explicitly, but the choices made for these parameters may have a profound effect on performance (cf., Sec. 4 
.3).
Bit transport is the lowest-level network interface used by upper levels to deliver bits from one location to another. In efficient BlueGene/Q implementations [12, 9, 11, 26] , the System Processing Interface (SPI) communication layer serves as a bit transport (as described above). The majority of implementations in the literature use Message passing Interface (MPI) for their bit transport. SPI is a direct interface to hardware queues, while MPI is a complex framework with extra functionality and semantics. Direct interfaces such as SPI may yield more efficient communication, but are less or not at all portable, and may require more implementation effort. The third type of bit transport is based on remote method invocation (RMI) technique and is used in approaches based on STAPL [16, 17] , a generic parallel library for graph and other data structures and algorithms. STAPL uses the ARMI (Adaptive Remote Method Invocation) active-message communication library, based on RMI. ARMI supports automatic message coalescing but does not provide routing or message reductions natively. HPX-5 runtime has support for two types of bit transport: one is based on MPI, another one is based on Photon [19] RDMA middleware library. Photon is based on RDMA put and get with completion, where requests are completed asynchronously, and their completions are written to ledgers that can be read by higher level runtimes. HPX-5, AM ++ and ARMI can use different bit transport backends, making the interface boundary between the bit transport and the runtime very clear.
Bit transport may employ different protocols. For example, MPI point-to-point communication may support eager protocols for small messages and rendezvous protocols for larger transfers, sending messages without or with, respectively, round-trip communication [3] . The choice of protocols may have a detrimental impact on algorithms (e.g., Sec. 4.2), and it may be difficult to control explicitly. For example, most MPI implementations provide extensive configuration options, but these options are not standardized and often can only be fully utilized by experts. A number of runtime-level optimization techniques have been proposed in the literature to reduce communication overhead and maximize throughput. AM ++ provides message reduction (caching). Pearce et al. [25] used tree-based broadcast, reduction and filtering for communication involving high degree vertices. Panitanarak and Madduri [23] used local lookup arrays to track the tentative distance of every vertex, thus avoiding duplicate request being sent.
Increasing message coalescing (cf., Sec. 4.2) buffer size increases the rate at which small messages can be sent over a network at the cost of latency. Checconi and Petrini [11] use coalescing to pack together all the edges that would be sent to each destination separately and queue them in an intermediate buffer. Pearce et al. [25, 24] combined coalescing with routing to reduce dense communication.
Message routing constructs a logical topology to add intermediate targets for messages. Pearce et al. [24] implemented routing through a synthetic network to mimic the BG/P 3D torus interconnect topology. In a follow-up paper [25] , the authors additionally embedded the delegate tree as a means for further communication reduction. AM ++ [27] supports software routing and provides two predefined strategies: rook routing and Hypercube routing. Rook routing reduces the number of communicating buffers to O( √ p) [14] . Yoo et al. [30] used ring communication in their optimized collective implementation and adjusted the diameter of the ring to achieve better performance.
Thread Safety A message passing framework can support different levels of thread safety. For MPI, there are 4 levels in total [2] : single, funneled, serialized and multiple. We used multiple as the threading level together with an asynchronous progress thread in our AM ++ DC [31] implementation.
Network Topology
Computing resources are organized in several specific physical topologies: 3D torus, dragonfly, 5D torus, and so on. The physical topology impacts the efficiency of communication in a graph computation. For example, Cray MPI provides an all-to-all implementation that is optimized for Aries and Gemini systems. In another example, Checconi and Petrini [10] map parts of graph adjacency matrices onto the Blue Gene/Q 5D torus topology in such a way that neighboring parts of the matrix are also neighbors in the physical topology.
The logical topology is the layout of the data in physical topology. Buluç et al. [8] , for example, found that processor grid skewness, i.e., the distribution and the shape of the blocks of an adjacency matrix, had significant impact on their results: the "tall skinny" grids (more blocks across the Y dimension of the matrix) performed faster, and "short fat" grids (more blocks across the X dimension of the matrix) performed worse than square grids.
Job scheduler for computing resources allocates nodes based on scheduling policy resulting in a job topology. Bhatele et al. [6] showed that job topology may have an impact on performance due to the distances among allocated nodes or due to contention on shared network.
Local Scheduling
Depending on the node-level threading mechanism, thread scheduling policies, and synchronization primitives, tasks associated with a DGA can execute in different order with varying frequencies. For example, in an attempt to quickly spread good work, a message can be sent with priority and put the message handler in front of the task queue [31] . Supporting data structures, for example bitmaps in sync mode and global queue in async mode in [29] , can be another way to implement local scheduling. Below, we discuss several thread-granularity and scheduling-related factors.
Threads (worker threads) can be used for intra-node threading. Buluç and Madduri [7] and Buluç et al. [8] used MPI for inter node processing and GNU OpenMP for intra-node threading. Zalewski et al. [31] used a combination of MPI and pthreads. HPX-5 uses suspendable lightweight threads with their own stacks and with cheap thread transfer.
Lightweight threads or tasks, implemented on top of kernel threads, can be scheduled differently. Task management mechanisms achieve load balancing by mechanisms such as work stealing and FIFO/LIFO schedulers. Table 2 summarizes the set of parameters we identified as the algorithm-level aspects of DGAs, and divide them into four categories. The approach category is about the main algorithmic choices, the algorithmic considerations category covers the main aspects of the approach, and the categories of graph representation and data structures cover the data structures that are used.
Algorithm-Level Aspects of DGAs

OUR TEMPLATE IN PRACTICE
In this section we show how our proposed template can be applied in practice by comparing 3 different runtimes: AM ++ , HPX-5, and the IBM BlueGene/Q implemenatation. The characteristics of the runtimes that we consider are independent of any particular application. Our recommendation is to consider DGAs holistically, encompassing runtime and algorithmic concerns. Specifically, we apply the template to Distributed Control (DC ) algorithm [31] for solving the SSSP problem implemented in the AM ++ and HPX-5 runtimes and to BlueGene/Q BFS [11] . The BlueGene/Q BFS is a rare case where the authors disclosed enough information so that we can fill out a "report card" based on our template (Table 3) . We choose DC because it is particularly well suited as a subject of inquiry into interplay of runtime and algorithmic concerns, as will be evident from next section devoted to experimental results.
The Runtime Report Card
The "report card" in Table 3 enumerates the runtime features of each of the two DGAs according to the template in Table 1 . We summarize the runtimes in a table form for clarity, but we do not advocate that scientific publications use that exact format. The important task is to ensure that all relevant aspects of the runtime are adequately covered. Furthermore, every runtime feature must list relevant associated quantities that are necessary for complete interpretability of experimental results. For example, in presence of coalescing, performance results cannot be interpreted if the size of coalescing buffers is not given.
Algorithmic Concerns
The goal of DC is to remove the overhead of synchronization and global data structures by using only thread-local 
∀vn ∈ neighbors(G, v) : send(vn, dv + weight(v, vn)) 4: end if priority queues to select best work, obtaining an approximation of the global ordering. Pseudo code for a DC algorithm for SSSP is given in Algorithm 1. The algorithm consists of 3 parts: the work loop that processes tasks from the local priority queue, the message handler that receives tasks from other workers, and the relax function that updates distances and generates new work. The work loop in DC is preceded by initialization of the distance map and by relaxing the source vertex (Lines 2-4). In the loop, the work on the graph is performed by removing a task from the thread-local priority queue in every iteration and then relaxing the vertex targeted by the task (Lines 7-14). Vertex relaxation checks whether the distance is better than the distance already in the distance map, and it sends a relax message (task) to all the neighbors with the new distance. Relax handler receives the messages sent from the relax function, and its only purpose is to insert the incoming tasks into the thread-local priority queue. When a handler finishes executing, it is counted as finished in termination detection. Note that there is no synchronization barrier in the algorithm. The work loop in the algorithm description is an abstraction of a more general work scheduling mechanism. In our AM ++ implementation of DC , work loop is very similar to the loop in the figure, but it also includes some heuristics to improve performance and additional code for explicit management of termination. In our HPX-5 implementation, the work loop is provided as a more general priority scheduling mechanism in the runtime, where the runtime promises that a best effort priority will be applied when scheduling work marked as priority work.
RUNTIME PARAMETERS OF DGA PER-FORMANCE
DC is particularly sensitive to the runtime characteristics Sends and receives are scheduled with asynchronous MPI interfaces. The number of receive requests is kept constant, and send requests are created on demand with a flowcontrol mechanism to cap the number of outstanding requests. Required: number of receive buffers, flow control limit Sends are locally synchronous (through SPI) with one buffer per destination (maximum one outstanding send per destination). Reception is asynchronous through polling of SPI counters.
Progression HPX-5 invokes progress explicitly in the scheduler loop. Progress is invoked by a worker thread if there is no local work left, and with the priority scheduler for DC, progression is invoked periodically based on flow control feedback. Progression for MPI is serialized between workers. Sends are processed first from a send queue, then receives are processed and reused as they complete. For Photon, progression can be run by multiple threads (Photon is thread safe 
Network Topology
Physical Topology
Star (central switch).
3D torus Gemini and Dragonfly Aries. BlueGene/Q 5D torus.
Logical Topology
Global Address Space (PGAS or AGAS).
None. None.
Job Topology
Inconsequential because of star topology.
Unknown (execution time averaged from the same batch execution).
Unknown. Largest runs show significant variability.
Local Scheduling
Threading Pthreads, lightweight user threads. Pthreads. Heavyweight worker threads. Task Management LIFO and priority queues of parcels, which represent undone work or suspended threads.
FIFO queue with every coalesced buffer represented as a task.
None, no lightweight tasks.
Termination
SKR termination detection (activity counts periodically reduced).
SKR termination detection (activity counts periodically reduced).
Unknown. "Termination check" is mentioned at least once. [19] otherwise. On Cutter, we also vary the interconnect between InfiniPath and Mellanox. All experiments were run on Graph500 graphs. All execution times are reported in by taking the average of executing multiple problem instances (over the same batch job execution). and ISIR transports. Our one node performance is taken with networking turned on; DC performs much worse than ∆-stepping, but it quickly improves with scale. ∆-Stepping does not show good scaling behaviour altogether. While experimenting with ISIR transport, we have tried different limits for the number of MPI Isend calls that HPX-5 spawns concurrently. Figure 3 shows how varying the send limit changes the performance of ∆-Stepping algorithm for one of the scales. Table 4 shows the optimal send-limits for ∆-stepping algorithm with ISIR transport. This experiment illustrates how a runtime-level bit transport parameter can make an impact on the performance of an algorithm. To increase bandwidth utilization, AM ++ performs message coalescing, combining multiple messages sent to the same destination into a single, larger message. Messages are appended to per-destination buffers. To handle partially filled buffers, a periodic check is performed to check for activity. In the case of DC SSSP, a single message consists of a tuple of a destination vertex and distance, 12 bytes in total. With such small messages, coalescing has great impact on the performance, but finding the optimal size is difficult.
Effect of Bit Transport and Interconnect
Effect of Coalescing Size on Transport Protocol
We investigated the impact of coalescing in Graph500 scale 31 graphs when running DC SSSP with max edge weight of 100 (Figs. 4a and 5) . Figure 4a shows the large impact of a small change in the coalescing size, which is measured by the number of SSSP messages per coalescing buffer. Changing the coalescing size by less than 2% causes over 50% increase in the run time. This unexpected effect is caused by the specifics of Cray MPI protocols. At the smaller coalescing size, full message buffers fit into rendezvous R0 protocol that sends messages of up to 512K using one RDMA GET, while the larger buffers hit R1 protocol that sends chunks of 512K using RDMA PUT operations. At the size of 44,000, the bulk of the message fits into the first 512K buffer, and the small remainder requires another RDMA PUT, causing overheads. The sizes 43,000 and 86,000 fill out 1 and 2 buffers, respectively, achieving similar performance. The larger size, 86,000, results in better scaling properties. We ran a more extensive suite of benchmarks on Edison. Figure 5 shows the coalescing buffer size experiments on Edison. The results are similar, with a periodic increases in the minimum run time as protocol buffers mismatch the coalescing buffers. The maximum run times signify the worst run time, as other parameters related to bit transport than coalescing are adjusted. The results show that adjusting other parameters is less and less important as the coalescing buffer size increases. Figure 4b shows the effects of coalescing on a DC BFS, which is SSSP with maximum weight of 1. Surprisingly, increasing the coalescing size impacts performance negatively. We suspect that with smaller weights the possibility of reward from optimistic parallelism in DC decreases, and the added latency of coalescing has a much larger effect than with larger weights. All three cases shown in Figs. 4a, 4b and 5 show that adjusting the coalescing size is important, and the optimal value is not static. Rather, it depends on algorithmic concerns such as reward from optimistic parallelism.
Transport Progress
At first, when we experimented with DC on Big Red 2 with AM ++ , we found out that it was performing worse than ∆-stepping algorithm [21] . This raised a concern that the DC approach may not be practical. We suspected the possibility of message latencies being a culprit; so, upon researching MPT, we decided to experiment with asynchronous progress, which uses separate threads to perform progress in certain situations. Despite Cray's warning at the time that thread-multiple progress required for asynchronous progress "is not considered a high-performance implementation," we observed significant gains for DC , shown in Fig. 6 . We ran the experiment on Graph500 scale 31 optimal strong scaling results. Without asynchronous progress, performance decreased with the increased number of nodes (with an unexplained anomaly at 112 nodes). (Note that all our experiments are averaged; thus, large anomalies are indicative of unexpected circumstances.) With asynchronous progress thread, the performance of DC has improved more than tenfold with growing node counts, entirely changing the viability of the approach. This dramatic effect illustrates how deeply an algorithm interacts with the runtime and how a gap in parameter space may lead to incorrect conclusions about DGA approaches. Interestingly, we did not observe a similar effect on Edison, where two different asynchronous progress and the standard progress modes perform similarly. We are unable to explain why is it so. This unpredictability and the difficulty of analyzing the performance shows how important it is to document the specific runtime in which DGAs are executed. In addition to transport layer progress, AM ++ performs its own internal progress when AM ++ interfaces are called. Since AM ++ DC is built around a loop that empties the local priority data structure (Lines 5-16 in Algorithm 1), it must occasionally, with some frequency, call into the appropriate AM ++ interfaces that perform progress. This frequency is controlled by two parameters: the end-epoch test frequency (EE) and the eager progress limit (EL). EE controls how many iterations of the DC loop run before AM ++ progress is invoked. The eager limit is a threshold of outstanding DC tasks below which AM ++ progress is performed during every iteration of the DC loop. Figure 7 shows the effects of progress parameters, using performance data averaged over multiple runs while varying orthogonal parameters and choosing the best performing variant, which isolates the effects EE and EL parameters. Edison shows a significant sensitivity to the EE parameter. Smaller values are better, with 22 being the best of the ones tested. This suggests that latency may be a limiting factor on Edison. On Big Red 2, the results of varying the EE parameter are less pronounced, but the average of multiple experiments that we show here still suggests some sensitivity with the optimal value similar to that on Edison. Altogether, the results show that the performance of DC depends on the progress model.
Distributed Control Progress Heuristics
Buffering and Work Efficiency
The prerogative of coalescing in AM ++ is to decrease the overhead by sending as many full coalescing buffers as possible. Partially filled buffers are only sent when no more messages are being inserted. Figure 8 shows DC results on Edison for coalescing buffer size of 100,000. We found that the best predictor of performance is the amount of partial buffers (fewer is better) followed by full buffers (more is better). Partial buffers indicate periods of a lack of work, and this, in turn, indicates that the local priority queues are getting depleted more often, decreasing overall performance. AM ++ was originally optimized for algorithms like BFS and ∆-stepping, which benefit from eager optimization of communication overhead and are not sensitive to work imbalance. Our example shows that optimization of runtime for a seemingly worthy goal can negatively impact algorithms that have other needs not anticipated by runtime developers.
Work vs. Overhead
Performance of an algorithm depends on the amount of work it performs and on the amount of overhead that this work incurs in a given runtime. Figure 9 shows the work statistics comprising of useful work (vertex distance was updated), useless work (distance was not updated), rejected work (distance updated but neighbors are not visited) and invalidated work (useful work overwritten by a better distance) for DC and our implementation of ∆-stepping in AM ++ with scale 31 graph. Although DC performs better than ∆-stepping, DC always executes more work than ∆-stepping in the most efficient configurations of both of the algorithms. Despite consistently performing 10%-25% more work, DC performs better in all instances of tests at scale (3-6 times speedup). This shows that synchronization and uneven distribution of work have an important effect on the performance of DGAs. Although one can attempt to mitigate the work imbalance with algorithmic techniques, the cost of synchronization is hard to control and eliminate. In this regard, an underlying runtime can have a significant impact. The more an algorithm depends on keeping global information about the runtime (e.g., for load balancing), the higher the costs of synchronization necessary to maintain that information. In Figure 9 we count a task as rejected when the vertex distance it delivers is higher than what is already recorded and, consequently, the task is not inserted into the priority queue of DC or a bucket of ∆-stepping. Invalidated tasks are similar to rejected tasks, but their distance expires while they wait in priority queue.
CONCLUSIONS
We demonstrate that the algorithm-level parts of DGAs that are reported as major contributions do not constitute a complete description of a DGA. A DGA consists of two equally important layers: the algorithm-level aspects and the runtime-level aspects, which respectively represent the top and the bottom of the software/hardware stack. Based on analysis of a representative sample of DGAs, we further subdivide the layers into categories. We propose a template for reporting research design and results and we demonstrate how to use it. Altogether, the goal is to make research results in DGAs more accessible, general, and congruent.
Our Tables 1 and 2 serve as a map for reporting the design features and the related quantities relevant for interpretability of experiments. Some runtime aspects may remain "buried in the stack", their impact unknown (e.g., the effects of job placement as in Sec. 2.1.2 are not usually investigated), and some may not be relevant in a given situation. A complete "report card" helps one understand which parts of the parameter space are covered and which are not. Our reporting template helps both consumers and authors of research, the former to understand and the latter to present contributions.
Our analysis and guidelines are the first step in unifying the field. We posit that the DGA research community should collectively develop a set of standards expected from top notch research, acknowledging that DGAs exhibit particularly strong interaction with the software/hardware stack due to their irregularity. Thus we appeal to the wider community to help develop standards for more explicit incorporation of runtime interactions in future research results and by collaboration on a continuously updated consensus on what constitutes the runtime of a DGA.
ACKNOWLEDGMENTS
This material is based upon work supported by the National Science Foundation under Grant No(s). 1111888 and 1319520 and by the Department of Energy, National Nuclear Security Administration, under award number DE-NA0002377 and by in part by Lilly Endowment, Inc.
