This paper introduces a compiler optimization strategy for Software-Programmable Laboratories-on-a-Chip (SP-LoCs), which miniaturize and automate a wide variety of benchtop laboratory experiments. The compiler targets a specific class of SP-LoCs that manipulate discrete liquid droplets on a 2D grid, with cyber-physical feedback provided by integrated sensors and/or video monitoring equipment. The optimization strategy employed here aims to reduce the overhead of transporting fluids between operations, and explores tradeoffs between the latency and resource requirements of mixing operations: allocating more space for mixing shortens mixing time, but reduces the amount of spatial parallelism available to other operations. The compiler is empirically evaluated using a cycle-accurate simulator that mimics the behavior of the target SP-LoC. Our results show that a coalescing strategy, inspired by graph coloring register allocation, effectively reduces droplet transport latencies while speeding up the compiler and reducing its memory footprint. For biochemical reactions that are dominated by mixing operations, we observe a linear correlation between a preliminary result using a default mixing operation resource allocation and the percentage decrease in execution time that is achieved via resizing.
Abstract
This paper introduces a compiler optimization strategy for Software-Programmable Laboratories-on-a-Chip (SP-LoCs), which miniaturize and automate a wide variety of benchtop laboratory experiments. The compiler targets a specific class of SP-LoCs that manipulate discrete liquid droplets on a 2D grid, with cyber-physical feedback provided by integrated sensors and/or video monitoring equipment. The optimization strategy employed here aims to reduce the overhead of transporting fluids between operations, and explores tradeoffs between the latency and resource requirements of mixing operations: allocating more space for mixing shortens mixing time, but reduces the amount of spatial parallelism available to other operations. The compiler is empirically evaluated using a cycle-accurate simulator that mimics the behavior of the target SP-LoC. Our results show that a coalescing strategy, inspired by graph coloring register allocation, effectively reduces droplet transport latencies while speeding up the compiler and reducing its memory footprint. For biochemical reactions that are dominated by mixing operations, we observe a linear correlation between a preliminary result using a default mixing operation resource allocation and the percentage decrease in execution time that is achieved via resizing. 
Introduction
The past 20 years have witnessed the development of programmable, integrated micro-scale machines called laboratories-on-a-chip (LoCs), which can automate and miniaturize a number of laboratory functions which were previously performed by hand at the benchtop scale. While the majority of LoCs that are in use today are applicationspecific, single-use, and disposable, software-programmable (and reusable) LoCs (SP-LoCs) are also available. At present, SP-LoCs are programmed at a level of abstraction akin to machine code, i.e., by specifying a sequence of actuation and deactuation operations for each programmable element. Moreover, many "cyber-physical" SP-LoCs feature integrated sensors, which provide feedback to the software controlling them in real-time. Thus, language and compiler support can help SP-LoCs gain traction and grow a user base. Prior work has made progress toward compiling high level languages for execution on SP-LoCs, but has either been limited in scope to a single basic block, or altogether missed optimization opportunities that can speed up compilation and dramatically decrease execution time. This paper describes solutions to these shortcomings as an optimizing compiler targeting a class of SP-LoCs called Digital Microfluidic Biochips (DMFBs). The optimization strategy crosses basic block boundaries by modeling placement of microfluidic operations on a reconfigurable processing array as a problem that generalizes graph coalescing, similar to graph coloring register allocation in traditional compilers. Fluid transport operations can be eliminated through a coalescing mechanism; when coalescing is not possible, fluid transport lengths can be reduced by incorporating knowledge of transport operations into placement. The compiler also adjusts the size of mixing operations to improve performance: prior work has shown that allocating more space to each mixing operation reduces its latency [65] ; however, doing so reduces the spatial parallelism available to other concurrently scheduled operations. The compiler accounts for all of the aforementioned information, yielding a clear and concise problem formulation that can be solved using either exact or heuristic means.
The paper is organized as follows: § 2 provides an overview of the SP-LoC technology that we target in this paper; § 3 presents the compiler and emphasizes the optimization problems that must be solved, along with their interactions. Sections 4 and 5, respectively present our implementation and simulation-based empirical evaluation, including comparison to prior work. Section 6 summarizes related work on DMFB compilation to put the contribution of this paper in context. Lastly, § 7 concludes the paper and outlines directions for future work.
Background

Language Design for SP-LoCs
An assay is a laboratory procedure that aims to assess the activity of a target entity, called the analyte; as an overgeneralization, we use the term assay to represent a biochemical "algorithm" that will execute on an SP-LoC. Ideally, the (bio-)chemist of the future will specify an assay using an appropriately designed domain-specific programming language (DSL). A compiler or interpreter will translate the specification into an executable format that will run on the SP-LoC. A number of domain-specific programming languages have been proposed for SP-LoCs [4, 5, 17, 18, 64, 83, 84, 86] ; while most of these languages are tied to specific SP-LoC technologies, any DSL compatible with DMFBs (see § 2.2) could be used as a front-end to the compiler presented here.
Digital Microfluidic Biochips (DMFBs)
The compiler described in this paper targets a class of SP-LoCs called Digital Microfluidic Biochips (DMFBs), which manipulate discrete droplets of fluid using electrostatic actuation [49, 60] . DMFBs exploit a physical phenomenon called electrowetting, shown in Fig. 1a : an electrostatic potential applied to a droplet at rest modifies its shape and angle of contact with the surface; droplet transport can then be achieved by activating and deactivating adjacent electrodes in sequence, as shown in Fig. 1b . An optional top "ground electrode" reduces the voltage required to move a droplet and improves the fidelity of on-chip operations.
A DMFB is a 2D electrode array ( Fig. 2a ) which supports an instruction set consisting of five operations: store, transport, mix, merge, and split ( Fig. 2b) [1, 26, 32, 59, 62, 68 ]. An "executable program" is a sequence of electrode activations supplied by a host PC or microcontroller. A compiler translates a text-based assay specification into an executable program [18, 64] . A DMFB is "reconfigurable" in the sense that (a) The electrowetting effect: applying an electrostatic potential to a droplet modifies its contact angle [49, 60] .
(b) A droplet transport is achieved by activating and deactivating electrodes in sequence. each operation can be performed anywhere on the electrode array and any given electrode may contribute to different operations at different points in time during execution. A typical DMFB will integrate non-reconfigurable resources such as I/O reservoirs on its perimeters, as well as heaters [53] or optical detectors [51, 52, 78, 85] into the array itself. All five basic operations can be performed at the same location as a heater (when off) or a detector; however, heating and detection cannot be performed at any location on-chip. Thus, a compiler must know the precise location of all I/O pads on the device perimeter and both the location and function of all other integrated components; these impose constraints that the compiler's code generator must satisfy.
Integration of sensors [1, 8, 16, 23, 43, 45, 46, 56, 61, 69, [73] [74] [75] [76] 82] and online video monitoring [2, 3, 33, 36-39, 47, 54, 55, 66, 93] allows a CPU controlling a DMFB to obtain online feedback regarding the state of the assay during execution. At the language design level, this provides control flow: arbitrary computations can be performed on acquired sensory data, including predicates that resolve conditions at runtime [18, 31] . The compiler must ensure that all droplets are Figure 3 . Overview of our DMFB compiler. The front-end compiles an assay specification to a CFG (not shown). The back-end converts the CFG to an executable format. The "Interference Graph", "coalescing", and "rescheduling" arrows are the novel aspects of this paper. routed to the same location at the start of each basic block, regardless of which control paths are taken [18] .
Mixing Modules
The latency of mixing two fluids depends on the number of electrodes that have been allocated to perform the mixing and also the routing path that the droplet takes within the mixer [65] (see Table 1 ). While larger mixers yield lower latency, they reduce the availability of spatial parallelism on-chip. The compiler described here includes a feedback loop that adjusts the size of different mixing operations in order to optimize performance. 3 Compiler
Overview
The assay is specified in a domain-specific language such as BioCoder [18] or BioScript [64] , that seamlessly interleaves fluidic operations with computation. The target is a cyberphysical DMFB (Fig. 2 ) which provides sensory feedback to the runtime software that manages the device. This enables the programmer to specify assays featuring arbitrary control flow: the assay obtains sensory feedback from the device and performs computations on the acquired data; the result of the computation can be used as a condition which determines which fluidic operations to execute next. Our input language supports function calls, but does not support unbounded recursion. The compiler's preprocessor inlines all function calls, which converts the assay to one procedure. The input language restricts all fluidic variables to be scalars; it does not support fluidic arrays. We hope to relax these assumptions in the future. Figure 4a depicts an assay specified in the BioScript language [64] . We first convert the assay to a hybrid computational-fluidic intermediate representation (IR) [18] , as shown in Fig. 4b . This IR represents the assay as a Control Flow Graph (CFG). Next, we convert both fluidic and computational variables to Static Single Information (SSI) Form [6, 10, 77] ; each basic block is represented as a hybrid-fluidic/data dependence graph. Figures 4c and 4d respectively show the BioScript specification and hybrid-IR converted to SSI Form: in this case, π -and ϕfunctions 1 are inserted for one fluidic variable. Figure 3 outlines the subsequent steps of the compiler's back-end. The following subsections discuss each step in greater detail.
Scheduling
The first step is to schedule assay operations. Each basic block is scheduled individually. The scheduler ensures that each operation starts and finishes within the basic block containing it to ensure atomicity. Referring back to Table 1 , the scheduler assumes 2 × 2 mixers with 9.95s latencies; this assumption is later relaxed during Rescheduling ( § 3.5.1). O'Neal et al. [63] present the problem formulation and survey many scheduling heuristics that have been published to date.
The compiler infers droplet storage operations from the schedule and inserts them into the IR. The IR treats storage as an explicit operation that uses (and consumes) its input and defines a new output droplet. This may necessitate the insertion of additional π -and ϕfunctions to maintain SSI Form, as shown in Figs. 5a and 5b. This representation enables the placer ( § 3.5) to treat droplet storage the same as all other scheduled assay operations. The scheduler enforces resource constraints that conservatively over-approximate placement. To simplify the discussion, we omit resource constraints involving I/O operations. The scheduler partitions the DMFB into N modules ( Fig. 6 ). At any point in the schedule, a reconfigurable module can perform one mix, split, or merge operation, or can store up to k droplets, depending on its size. Any module that features an integrated heater or sensor can perform a heating or sensing operation as well; let the number of such modules be N heat and N sense respectively. Let r j (p) be the number of operations of type j ∈ {mix, split, merдe, store, heat, sense} scheduled at program point p. A legal schedule must satisfy
Our scheduler adds implicit store operations (a) and updates SSI form to generate a schedule (b) that captures the linear def-use chain that SSI form provides. Figure 6 . A DMFB partitioned into a 2×2 array of modules exposed to the scheduler: one module has a heater and one has a sensor.
the following constraints for each program point p:
Scheduling failures may occur and are unavoidable in the general case, even if the problem is solved optimally. If scheduling fails, the only option is to switch to a larger DFMB target, or rewrite the assay. During compilation, switching to larger and faster mixers Table 1 increases the likelihood of failure, which is one reason why we default to the smallest, slowest mixer for the initial scheduling step. Failures due to module sizes are addressed in § 3.5.1.
Interference Graph
Definitions and Properties
Let G = (V , E, A) be the interference graph [11] [12] [13] : V is the set of assay operations, E is the set of interference edges, and A is the set of affinity edges that represent fluid transfers between operations. Let adj[o i ] and aff [o i ] denote the sets of interference and affinity neighbors of
Each vertex is labeled with a type, denoted type[o i ] ∈ {mix, split, merge, store, heat, sense}. As shorthand, and albeit a slight abuse of notation, we define a meta-type, reconfig, as the union of types mix, merge, split, or store.
The set of interference or affinity neighbors of type t are respectively denoted adj t
Construction
An interference edge (o i , o j ) ∈ E is placed between two operations o i and o j whose lifetimes overlap. Affinity edges arise from fluidic dependencies in the IR, including those arising between fluidic variables used and defined by the ϕand π -functions inserted during SSI construction [18, 64] . An affinity edge (o k , o l ) ∈ A indicates that a droplet must be transported between the locations where o k and o l are placed. The transport operation can be eliminated if o k and o l are placed at the same location.
Affinity edges can only be inserted between "compatible" operations. For example, a mix operation is compatible with a heat because a mix operation can be scheduled on a DMFB module that includes an integrated heater (presumably turned off). On the other hand, heat and sense operations are incompatible: to date no DMFB devices has integrated a heater and sensor at the same on-chip location.
The interference graph includes a complete multipartite gadget ( Fig. 7b ) to make resource-related incompatibilities explicit. I/O operations bound to the same reservoir cannot interfere, while I/O operations bound to different reservoirs explicitly interfere. Without loss of generality, a sensing operation cannot be bound to a region of a DMFB that features an integrated heater, and vice-versa. Figure 8a shows the interference graph corresponding to the assay in Fig. 4 after scheduling and storage insertion, and assuming that the target DMFB has at least two heaters. Instructions 1, 2, 3, 9, and 13 are statically bound to I/O reservoirs. Operation 5 (heat c) overlaps with operations 4, 6, and 7; droplet c is stored after operation 7 (detect d) completes. To conserve space, the interference graph omits the interference edges that belong to the gadget in resourceinterferences between operation 7 (detect d) and the three heat operations (4, 5, and 12) . Fluidic dependencies result in affinity edges: 
Coalescing
Coalescing merges non-interfering affinity-related vertices in the interference graph to ensure that the corresponding operations are placed at the same on-chip location: this eliminates the need to transport droplets, which can reduce the burden on placement and routing ( § § 3.5 and 3.6), two NP-complete problems. Coalescing is implemented as an affinity edge contraction operation [11, 24, 40, 44] : given an affinity edge Figure 8a shows the interference graph derived from the scheduled CFG shown in Fig. 5b; Figs. 8b and 8c show two possible coalescing outcomes. In this example, Fig. 8c has coalesced more affinity edges than Fig. 8b . This, in turn, reduces the workload of the placer ( § 3.5) and router ( § 3.6) downstream. Coalescing here differs from register allocation in one key respect. Consider the example shown in Fig. 9a : when
, which may result in extended routes (Fig. 9b ). We instead maintain the affinity, allowing routes to be optimized by placing operations near each other (Fig. 9c ).
When reconfigurable operations of different dimensions are coalesced, the coalesced vertex is given the minimum rectangular dimension that can accommodate its constitutions (see Fig. 10a ). The type of a coalesced vertex has the most restrictive among {reconfig, heat, sense}, as shown in Fig. 10b .
Next, we describe two important subroutines, followed by a description of two coalescing heuristics adapted for our constraints. In the discussion that follows, we talk about interference graph "vertices" rather than assay operations.
Simplification is a subroutine commonly used during register allocation, which we here adapt for our purposes. Any vertex that trivially satisfies the scheduling resource constraints above, but is not affinity-adjacent to any other vertices can be removed from the graph: the rationale is that a legal placement for the simplified vertex can always be found regardless of where all of its neighboring vertices are placed. Removing simplified vertices from the graph creates opportunities for new coalescing while also rendering other vertices simplifiable. Following repeated rounds of simplification, all vertices in the remaining graph can be placed. Simplified vertices can then be placed by processing them in reverse order of their removal.
Conservative Coalescing Coalescing is conservative if the coalesced vertex o i j and its interference neighbors satisfy the scheduler's resource constraints (Eqs. (1) to (3)), i.e.:
Coalescing Strategy
Iterated Coalescing, depicted in Fig. 11 , is adapted from iterated register coalescing [24] , but without spilling. The iterated coalescer simplifies the interference graph until it is not possible to do so any further. It then applies conservative (a) (b) Figure 10 . The rectangular dimensions of a coalesced vertex are the minimum dimensions that can accommodate its constituent parts (a); a coalesced vertex takes on the type of its most restrictive module (b). Figure 11 . Phase ordering of Iterated Coalescing [24] coalescing; if coalescing occurs, further simplification is performed; otherwise, an low-degree vertex with at least one incident affinity edge is "frozen" i.e., the coalescer gives up hope of coalescing its incident affinity edges, thereby allowing the vertex to be simplified. Iterated coalescing terminates when all vertices have been removed via simplification. The graph is then rebuilt and passed to the placer. Conservatism is guaranteed by the observation that the initial interference graph, simplification process, and conservative coalescing strategy ensure that the scheduler's resource constraints are satisfied at each step of the heuristic.
Placement
The placer determines the location on-chip where each assay operation will execute. A legal placement satisfies the constraint that operations o i and o j are placed at nonoverlapping positions for each interference edge (o i , o j ) ∈ E. Our compiler implements two distinct placement strategies that have been published elsewhere: Virtual Topology with Left-Edge Binder (VT-LEB) [28] and Keep All Maximal Empty Rectangles (KAMER) [7] . Prior work implemented these heuristics in a manner similar to linear scan register allocation [67] . Starting with a scheduled basic block, the placer scans each program point in sequential order: operations scheduled to complete at the previous time-step are removed from the current placement, and operations scheduled to begin at the subsequent time-step are added to the placement. Our compiler uses modified versions of VT-LEB and KAMER to perform placement on a coalesced interference graph rather than a scheduled CFG; vertices are processed one-by-one in a worklist sorted by the earliest time step.
When coalescing is performed, affinity relationships between interfering vertices may still exist, indicating exactly which vertices should be placed near each other; hence, after placing o i , if aff [o i ] ∅, we recursively process affinity neighbors prior to returning to the sorted order (see Fig. 9c ).
Let adj < [o i ] be the set of o i 's interference neighbors that precede o i in the computed order. Placement proceeds in a greedy fashion: operation o i can be placed at any position that does not overlap the position(s) where operations in adj < [o i ] have been placed. All vertices that have been coalesced with/into o i are placed at the same location. The resulting placement is guaranteed to be legal as it ensures that o i 's position never overlaps that of any vertices in adj[o i ]. VT-LEB guarantees that a legal placement can be found because it ensures that all placement decisions adhere to scheduling resource constraints. Further details regarding the placement heuristics are available in the supplemental materials.
Mix Operation Resizing and Rescheduling
The rescheduling loop in Fig. 3 enables the compiler to adjust the size of mixing operations (Table 1) to reduce assay execution time. The availability of space to accommodate larger mixing operations is not known until placement; on the other hand, the benefits of adjusting the latency of a mixing operation cannot be ascertained without rescheduling, and the updated schedule may change which fluidic variable live ranges overlap, thereby rendering the interference graph invalid. This observation necessitates the rescheduling loop.
The compiler uses a local search, which converges to a locally optimal solution, to adjust mixing operation sizes. When placing an interference graph, the first mixing operation or coalesced vertex o i that contains at least one mix operation invokes Algorithm 1 to select an appropriate mixer size. The heuristic relies on two subroutines: 1. MaxParallel applies Dilworth's Theorem [20] to compute the width, i.e., the maximum number of operations that could be scheduled concurrently, of the basic block that contains o i ; if o i contains multiple coalesced vertices, MaxParallel returns the maximum width of among all of the basic blocks containing them. 2. CanFit computes the number of mixing modules of size s that can fit on a given DMFB architecture. Referring back to Fig. 6 , CanFit is effectively the same subroutine that a scheduler would use to determine the resource constraints of the target chip.
The heuristic first checks if o i 's scheduled module size CanFit MaxParallel operations. If more parallelism is available than what is currently scheduled, it checks if smaller modules CanFit more than those currently scheduled, and continues until it finds a module size that CanFit up to MaxParallel operations.
The heuristic will increase o i 's size in two cases: When the size of a mixing operation is updated, the size of any other mixing operations that are coalesced with it are updated as well. If a mixing operation is updated during placement, its latency is scaled as per Eq. (7) and the compiler loops back to scheduling: t ′ = t * latency new /latency old (7) For example, if the compiler changes a 10 second mix operation's given work module from a 2 × 3 to a 2 × 4 module, then the compiler computes the new latency as t ′ = 10 * 2.9/6.1 ≈ 4.76 seconds. The compiler rounds the new latency up to the next millisecond. The termination criteria to continue on to droplet routing is either (1) module sizes are not updated during placement, so rescheduling is unnecessary, or (2) the loop taken during a rescheduling loop failed during scheduling or placement. In the case of (2), we revert to the last legal schedule and placement found. The interested reader can find an example of module resizing in the supplemental materials.
Droplet Routing
Once a legal placement solution is obtained, each droplet must be routed from its source to its destination; many papers published in the past 15 years have described routing algorithms, and in principle any can be used [9, 15, 34, 41, 42, 71, 72, 81, 91] . The most advanced routers also integrate washing operations to eliminate cross-contamination [35, 89, 92] . The only additional requirement is that droplet routes must be inserted at basic block boundaries; our compiler implements these routes as part of SSI elimination [18] .
Implementation
Our compiler targets an open-source cycle-accurate DMFB simulator [27, 29] ; we modified a back-end that can statically compile CFGs [18] , and rely on the simulator to report execution time. We used a collection of benchmarks specified using the BioScript language, which is compatible with the current ← o i .size 3: choice ← current 4: max ← MaxParallel(b) 5: curr Num ← CanFit(current) 6: updated ← False 7:
if max > curr Num then 8: chosenNum ← curr Num 9: smaller ← current 10: while smaller smallest do 11: smaller ← decrease(current) 12: check ← CanFit(smaller) 13: if check > chosenNum then 14: choice ← smaller 15: updated ← True 16: chosenNum ← check 17: if chosenNum ≥ max then break 18: if updated = False then 19: larдer ← decrease(current) 20: check ← CanFit(larger) 21: while check = curr Num OR check ≥ max do 22: choice ← larдer 23: if larдer = larдest then break 24: larдer ← increase(choice) 25: check ← CanFit(larger) 26: o i .size ← choice framework's static compilation model [64] . Our compiler uses List Scheduling [28, 80] , the VT-LEB [28] and KAMER [7] placers, and a greedy, yet effective, droplet router [28, 71] .
Interference graph construction and coalescing are performed after scheduling (Fig. 3) . Coalescing is abstracted away from placement so that any existing placement heuristic could be modified easily to operate on a coalesced interference graph. While rescheduling is abstracted away from placement, the resizing operations, by necessity, must be performed during placement, which necessitates a substantial revamp of the heuristic. Our current implementation is only compatible with placers that place operations oneat-a-time in a greedy fashion. A further discussion of the necessary modifications is available for the curious reader as a supplemental material.
Evaluation
Even though we support physical chips, the expense associated with their use is prohibitive for evaluation; hence, we evaluate our compiler through simulation-based empirical studies on known real-world assays specified for execution on DMFBs. Specifically, we aim to evaluate the impact of coalescing and mix operation resizing on compilation and assay execution time. All reported averages use the geometric mean over the ratios of each benchmark to avoid providing too much weight to longer-or shorter-running benchmarks [22] .
Experimental Setup
Experiments were performed on a 2.7 GHz Intel ® ; Core™ i7 processor, 8GB RAM, machine running macOS ® . We compare directly against two previously published compilers [18, 64] using an identical 15 × 19 DMFB architecture. We also report results on 15 × 15, 12 × 12 and 8 × 8 DMFBs to evaluate the impact of our mix operation resizing heuristic.
Benchmarks
Our evaluation uses a set of DMFB benchmarks that were previously used to evaluate the two compilers that we use as a baseline for comparison. Ref. [18] specified them using a variant of the BioCoder language, which is now deprecated; Ref. [64] , as well as this work, uses the replacement BioScript language; a detailed summary of the benchmarks are given in [64] 's supplemental materials. 2 
Baseline Compilers
The DMFB compilers we compare against do not employ coalescing or mix operation resizing: Ref. [18] compiles a CFG one basic block at a time using the standard VT-LEB algorithm for placement ( [28] ), eschewing optimizations across basic block boundaries. Ref. [64] employs the NSGA-II [19] metaheuristic for placement. The NSGA-II placer attempts to maximize the number of affinity-adjacent operations that are placed at the same location, as well as affinity-adjacent operations which interfere nearby one another in order to reduce droplet routing paths, but it does not employ coalescing. The runtime of NSGA-II depends on a complex set of parameter values; to get good results, it needs to run much longer than a greedy heuristic such as VT-LEB or KAMER. Table 2 compares simulated assay execution times previously reported for the two baseline compilers [18, 64] to three configurations of the compiler presented here: VT-LEB placement plus coalescing (VC), KAMER placement plus coalescing (KC), and KAMER placement plus both coalescing and mix operation resizing (KCR). On average, VC, KC, and KCR reduce assay execution time by 1.1%, 1.2%, and 25.0% respectively. These results are not surprising, as assay execution time is known to be dominated by schedule latency, not droplet routing time [81] ; as optimizations, coalescing aims to reduce droplet routing overhead while mix operation resizing can lead to shorter schedules. We observed that convergence typically occurs after 2 iterations of rescheduling when resizing is enabled.
Results and Analysis
The improvements reported for VC and KC over Ref. [64] indicate situations where coalescing turns out to be more effective than the NSGA-II placer; however, NSGA-II may discover different (and possibly better) solutions if the random number seed and other configuration parameters are varied.
Future work may extend the NSGA-II placer to utilize a coalesced interference graph; the amount of work required to extend the NSGA-II placer with resizing capabilities (which would entail re-scheduling and re-placing at every perturbation) is prohibitive, so we did not evaluate this option.
The compiler described in Ref. [18] utilizes the same placer as VC, sans coalescing. Adding coalescing capabilities yielded marginal improvements, due to the fact that droplet routing does not dominate total assay execution time.
Mix operation resizing had a more profound impact on total assay execution time than coalescing. Furthermore, Fig. 12 depicts an observed linear correlation between the amount of time an assay is specified for mixing and the percentage decrease we expect to achieve via resizing across DMFBs of varying size. At the smallest size, 8 × 8 (Figure 12d ), resizing allows us to compile several assays that failed to compile successfully without this optimization turned on. Through inspection, we determined that our resizing heuristic was able to avail the minimum required parallelism for these assays by using a 1 × 4 module size; the default 2 × 2 mixer did not provide enough room for a legal schedule. Table 3 provides details into how coalescing impacts the placer's workload and droplet routing time. On average, coalescing reduces the number of operations that are placed by 77%; this, in turn, reduces the amount of work that needs to be done during both placement and routing. In terms of overall performance impact, the VC and KC placers reduced droplet routing times by 9.4% and 8.6% compared to the baseline.
Related Work
The majority of work on DMFB compilation targets devices that do not feature sensory feedback or control flow; as such, the scope of compilation was limited to programs that consisted of a single basic block. Discrete formulations of the various compilation stages of scheduling [21, 30, 50, 63, 70, 80] , placement [14, 28, 48, 57, 58, 79, 87, 88, 90] , and droplet routing [9, 15, 34, 41, 42, 71, 72, 81, 91] were explored, along with wash-droplet [35, 89, 92] to eliminate contamination on the surface of the chip. The compiler described here is a general framework and could implement any of these algorithms.
Early work on DMFB compilation featuring control flow targeted online error detection and recovery for the single basic-block compilation model described above [2, 3, 33, 36-39, 47, 54, 55, 66, 93] . With appropriate extensions to handle CFGs, these techniques could be integrated into the runtime system that executes assays compiled using the techniques described here on a DMFB; it is beyond the scope of this work to design and evaluate such techniques. This work builds directly on two prior papers that described techniques for DMFB compilers. The first [18] introduced the hybrid computational-fluidic IR used in this paper, and demonstrated how to compile a CFG: each basic block could be compiled individually, with additional droplet routes inserted at control flow edges. These routes ensure that each basic block begins with its incoming droplets at the same position regardless of which control path is taken leading into that basic block. A subsequent paper [64] introduced the BioScript language (which we use here) and represents the first attempt to optimize placement on the granularity of a CFG, as opposed to individual basic blocks; this provided the ability to optimize the additional droplet routes inserted by the earlier compiler [18] ; placement relied on an iterative improvement metaheuristic, which ran slowly but generated locally optimal solutions. The contributions of this paper are threefold: (1) coalescing as a placement strategy; (2) fasterrunning heuristics that can handle placement on the CFG granularity; and (3) mixing operation resizing, which has a much greater impact on performance than coalescing.
Another approach, which is orthogonal to what we propose here, is to interpret assays online, rather than compile them offline [31, 86] . The interpreter JIT-compiles each basic block in an on-demand fashion, emphasizing compilation speed over solution quality. To the best of our knowledge, prior work has not attempted to JIT-compile an assay on the granularity of the CFG; any such approach could build on the techniques used here, noting that the runtime overhead of mix operation resizing may be prohibitive. Further, there is a complex interplay between coalescing and module resizing, as resizing may affect interferences across the CFG during rescheduling; hence, the combination of these optimizations are not well-suited for online compilation.
Conclusion and Future Work
This paper described the framework of an optimizing compiler for DMFBs; the key innovations were twofold: the formulation of the placement problem for CFGs that shares many principle similarities to register allocation [12, 13] , which enabled the adaptation of register coalescing techniques [11, 24] to eliminating otherwise spurious droplet routes, and a mix operation resizing step to reduce schedule latency. While there is certainly room to investigate more effective heuristics that solve the various problems within the compiler, we believe that the general back-end framework presented here represents the correct way to model the constituent optimization problems that must be solved, along with their interactions. Moreover, we believe that the most important topics for future investigation start at the programming language level; for example, determining how to support function calls, fluidic arrays, and fluidic SIMD operations; additionally, there is need to port BioScript (and/or other similar languages) to a variety of SP-LoC targets in addition to DMFBs.
