As advancements in CMOS technology trend toward ever increasing core counts in chip multiprocessors for high-performance embedded computing, the discrepancy between on-and off-chip communication bandwidth continues to widen due to the power and spatial constraints of electronic off-chip signaling. Silicon photonics-based communication offers many advantages over electronics for network-on-chip design, namely power consumption that is effectively agnostic to distance traveled at the chip-and board-scale, even across chip boundaries. In this work we develop a design for a photonic network-on-chip with integrated DRAM I/O interfaces and compare its performance to similar electronic solutions using a detailed network-on-chip simulation. When used in a circuit-switched network, silicon nanophotonic switches offer higher bandwidth density and low power transmission, adding up to over 10× better performance and 3-5× lower power over the baseline for projective transform, matrix multiply, and Fast Fourier Transform (FFT), all key algorithms in embedded real-time signal and image processing.
I. INTRODUCTION
Many important classes of applications including personal mobile devices, image processing, avionics, and defense applications such as aerial surveillance require the design of highperformance embedded systems. These systems are characterized by a combination of real-time performance requirements, the need for fast streaming access to memory, and very stringent energy constraints [12] , [46] , [50] . While commodity general purpose processors offer a cheap and customizable solution, they typically do not meet the power and performance requirements for the systems in question. For this reason, specialized chip multiprocessors (CMPs) are used.
As the number of cores in CMPs scale to provide greater on-chip computational power, communication becomes an increasing contributor to power and performance. The gap between the available off-chip bandwidth and that which is required to appropriately feed the processors continues to widen under current memory access architectures. For many highperformance embedded computing applications, the bandwidth available for both on-and off-chip communications can play a § This work is sponsored by Defense Advanced Research Projects Agency (DARPA) under Air Force contract FA8721-05-C-0002, DARPA MTO under grant ARL-W911NF-08-1-0127, the NSF (Award #: 0811012), and the FCRP Interconnect Focus Center (IFC).. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. vital role in efficient execution due to the use of data-parallel or data-centric algorithms.
Unfortunately, current electronic memory access architectures have the following characteristics that will impede performance scaling and energy efficiency for applications that require large memory bandwidths:
• Distance-Dependant. Electronic I/O wires must often be path-length matched to reduce clock skew. In addition, there are limitations on the length of these wires which constrains board layout and scalability [19] . • Low I/O Density. Electronic I/O wires are predicted to have pitches on the order of around 80 microns [18] . Increasing the available off-chip communication bandwidth will become difficult while staying within manageable pin counts. • Low I/O Frequencies. Driving long I/O wires requires lower frequencies, currently up to 1600 MT/s with the most recent DDR3 implementation [32] . Recent advances in silicon nanophotonic devices and integration have made it possible to consider optical transmission on the chip-and board-scale [7] , [28] . Microprocessor I/O signaling can directly benefit from photonics in the following ways:
• Distance-Independent. Optical transmission of data can be made agnostic to distance at the chip-and board-scale; photonic energy dissipation is effectively not a function of distance. • Data-rate Transparent. Most photonic devices, including switches as well as on-and off-chip waveguides are not bitrate-dependent, providing a natural bandwidth match between compute cores and the memory subsystem. • High Bandwidth Density. Waveguides crossing the chip boundary can have a similar pitch to that of electronics [41] , which makes the bandwidth density of nanophotonics using wavelength division multiplexing (WDM) orders of magnitude higher than electronic wires. Though photonics can offer significant physical-layer advantages, constructing a memory access architecture to realize them requires significant design space exploration. Trade-offs exist in the selection of specific components, architectures, and protocols. Our approach to this problem employs a single circuit-switched photonic network-on-chip (NoC) design, enabling both core-to-core and core-to-DRAM communication which are necessary for efficiently implementing programming models such as PGAS [4] .
In this work, we study the problem of designing a NoC architecture for an embedded computing platform that supports both on-chip communication and off-chip memory access in a power-efficient way. In particular, we propose the adoption of circuit-switched NoC architectures that rely on a simple mechanism to switch circuit paths off-chip to exchange data with the DRAM memory modules. While this method is presented independently of the particular transmission technology, we show the advantages offered by an implementation based on photonic communication over an electronic one.
We simulate this memory access architecture on a 256-core chip with a concentrated 64-node network using detailed traces of computation kernels widely used in signal and image processing high-performance embedded applications, specifically the projective transformation, matrix multiply, and Fast Fourier Transform (FFT). This work accomplishes the first complete detailed simulation of a nanophotonic NoC with physicallyaccurate photonic device models coupled with cycle-accurate DRAM device and control models. These simulations are used to determine the benefits of circuit-switching and silicon photonic technology in CMP memory access performance.
II. PACKET-SWITCHED MEMORY ACCESS
Packet-switched NoCs use router buffers to store and forward small packets through the network, where a packet is a small number of flits (flow control units). Typically, purely electronic store-and-forward routers use multiple physical buffers to implement virtual channels, alleviating head-of-line blocking under congestion. An illustration of a pipelined router can be seen in Figure 1 . If a core-to-DRAM or core-to-core application-level message is larger than the physical buffers themselves, or larger than the flow control mechanism can reasonably sustain without deadlock, these messages must be broken into several smaller packets. The structure of a packet-switched NoC has important implications on how memory accesses are performed. Typically, multiple on-chip memory controllers distributed around the periphery of a CMP service requests from all the cores. If a memory controller receives packets from different cores (different messages), it must then schedule memory transactions with potentially disparate addresses. Indeed, the memory controller depends on this paradigm to optimize the utilization of the data and control buses using rank and bank concurrency. Figure 2 (a) shows the basic protocol of a single memory transaction. The row address is latched into the DRAM chip with the row address select (RAS) signal for the row access time (t RAS ) until the decoded row is driven into the sense amps. After the row-column delay time (t RCD ), the column address then selects the starting point in the array, using the column address select (CAS) signal. A write enable (WE) signal determines whether the I/O circuitry is accepting data from the bus or pushing data onto it. Data is then read or written after the column-access latency (t CL ), incrementing the initial column address in a burst. Once the transaction is complete, depending on the control policy, the row can be closed and must be precharged (PRE) for a time t P RE . Figure 2 (b) shows how a contemporary DRAM memory controller schedules transactions concurrently across banks, chips, and ranks to maximize performance and hide the access latency. There exist different control policies to manage queued transactions for lower latency and higher throughput, both dynamic in the memory controller (e.g. page mode), and static at compile-time [29] . The burst length is usually fixed in this configuration, matching the on-chip cache-line size. Allowing a variable burst length would introduce significant complexity to the scheduling mechanism.
Typical DRAM subsystems implemented this way have been effective for providing short latencies for small, random accesses, as required by contemporary cache miss access patterns. However, providing the increasing bandwidth required by future embedded applications will come at the cost of power consumption in the on-chip interconnect, due partially Arbiter … Control Router Data Switch to the relationship of the amount of network buffering to performance.
III. MEMORY ACCESS FOR EMBEDDED COMPUTING
Embedded processors are devices typically found in mobile or extreme environments, and their design is commonly driven by the needs of the application in question. They frequently require specialized hardware or software, or commonly, both to efficiently meet their performance, power, and reliablity requirements. Because of this, a hardware / software co-design approach is generally taken [31] .
Of key consideration to this work are embedded applications that involve signal and image processing (SIP). These applications typically require the aggregation and processing of many data points collected from various locations over a period of time, originating from sensors or other continuous data streams. A typical example of this is a camera or other sensor placed on an unmanned air vehicle (UAV). Applications in this domain require signifcant computing power in the form of high bandwidth data access and streaming processing capabilities. In addition, they must achieve this using a low power budget.
In these applications, data is typically placed in contiguous blocks of an embedded computing system's memory space around a central CMP via direct memory access (DMA) or a similar mechanism by incoming data streams. The memory access system outlined in this section proposes to make use of the fact that these contiguous blocks of data can be accessed using long burst lengths. The application can exhibit very dynamic communication patterns between individual cores and banks of memory, all while making use of efficient memory access circuits.
A. Circuit-Switched Memory Access
In a circuit-switched network, a control network provides a mechanism for setting up and tearing down energy-efficient high-bandwidth end-to-end circuit paths. If a network node wishes to send data to another node, a PATH-SETUP message is sent to reserve the necessary network resources to allocate the path. A PATH-BLOCKED message is returned to the node if some parts of the path is currently reserved by another circuit. A PATH-ACK message is returned if the path successfully made it to the end node. After data is transmitted along the data plane, a PATH-TEARDOWN message is sent from the sending node to release network resources for other paths.
This method effectively relaxes the relationship between router buffer size, a large contributor (> 30%) to NoC power [21] , and performance because router buffers do not become directly congested as communication demands grow. Figure 3 shows the router architecture for a circuit-switched NoC. The control network uses smaller buffers and channels to transmit the small control messages, which reduces the total amount of buffering (and thus power) in the network. Because the higherbandwidth data plane is circuit switched end-to-end, it suffers from higher latency due to the circuit-path setup overhead, which must be amortized through a combination of larger messages and well-scheduled or time-division multiplexed communication patterns.
Aside from the power savings advantage, we can also decrease considerably the complexity of the memory controller through circuit-switching. We propose to allow a circuitswitched on-chip network to directly access memory modules, giving a single core exclusive access to a memory module for the duration of the transaction it requested. Access overhead is amortized using increased burst lengths as shown in Figure 2 (c). The memory controller complexity can be greatly reduced because a memory module must sustain only one transaction at a time. The key difference is that each transaction is an entire message using long burst lengths, as opposed to small packets that must be properly scheduled. In addition, variable burst lengths are inherently supported without introducing additional complexity.
To facilitate switching on-chip circuit paths off chip to memory modules, we place memory access points (MAPs) around the periphery of the chip connected to the network. These MAPs, shown in Figure 4 , contain a memory controller that can service memory transactions and use the NoC to allow end-to-end communication between cores and DRAM modules. Figure 5 shows the logic behind this control.
Read transactions are first sent as small control messages to the memory controller. If another transaction is currently in progress at the MAP, this request is then queued up. Once a read is started, it first sets up the data switch for communication from the memory controller to the memory module (for DRAM commands) and from memory module back to the core (for returning read data). A circuit-path is then established back to the core via the NoC path-setup mechanism. The memory controller can then issue row and column access commands, allowing the memory module to freely send data back to the core. The memory controller is responsible for knowing the access time of the read, so that it can issue a PATH-TEARDOWN at the correct time (labeled 1 in Figure 5 ), which completes the transaction.
Writes begin by a core setting up a circuit-path to a MAP. By virtue of a PATH-SETUP message successfully arriving to the MAP, the core will have gained exclusive access to it. Writes that arrive to a MAP that is servicing a read return to the core as a blocked path (labeled 2) instead of queuing it, to release network resources for other transactions (including the potential read setup that is attempting). The memory controller then sets up the data switch from memory controller to memory, which allows the transmission of DRAM row/col access commands. The data switch is then set from core to memory module, and a PATH-ACK is sent back to the core, completing the path setup. Upon receiving the path acknowledgment, the core then begins transmitting write data directly to the memory module. The memory controller considers the transaction finished when it receives a PATH-TEARDOWN from the core (labeled 3). In this way, any core in the network can establish a direct, end-to-end circuit path with any memory module.
Livelock is avoided by using random backoff for pathsetup requests. However, starvation for a core is possible, especially for writes in the presence of many reads. We leave the impact and effectiveness of the memory access mechanism on power and performance to Section V. Addressing memory access starvation through both network design and software/programming models remains a topic for future work.
B. Silicon Nanophotonic Technology
Circuit-switching photonic networks can be achieved using active broad-band ring-resonators whose diameter is manufactured such that its resonant modes directly align with all of the wavelengths injected into the nearby waveguide. For example,
Multi-wavelength signal n-region p-region Ring resonator
Electronic Control
Injected Wavelengths
Off-resonance profile
On-resonance profile a 200um will have a wavelength channel spacing of 50 GHz. The ring resonator can be configured to be used as a photonic switching element (PSE), as shown in Figure 6 . By electrically injecting carriers into the ring, the entire resonant profile is shifted, effectively creating a spatial switch between the ports of the device [27] . This process is analogous to setting the control signals of an electronic crossbar.
Given the operation of a single PSE, we can then construct higher order switches, and ultimately entire networks. Using ring-resonator devices in this way opens the possibility to explore different network topologies in much the same way as packet-switched electronic networks [36] . Different numbers and configurations of ring switches yield different amounts of energy, different path-blocking characteristics, as well as varying insertion loss.
We assume off-chip photonic signaling is achieved through lateral coupling [1] [30] , where the optically encoded data is brought in and out of the chip through inverse-taper optical mode converters which expand the on-chip optical cross section to match the cross section of the external guiding medium. This method is employed due to its lower insertion loss, compared to vertical coupling [39] [14] . Waveguide pitch at the chip edge can easily be on the order of 60 μm interfacing to off-chip arrayed waveguides [41] or optical fiber. This photonic I/O pitch remains well below that of current electrical I/O pitch (e.g. 190 μm in the Sun ULTRASparc T2 [43] ), illustrating the potential for vastly higher bandwidth density that is offered by using photonic waveguides when using WDM.
As shown in Figure 4 , the MAP controls a switch that establishes circuit paths between individual memory modules and the network. The photonic version of this switch is illustrated in Figure 7 , which uses broadband ring-resonators to allow access to multiple memory modules controlled by the same memory controller. Modulators convert electronic DRAM commands from the memory controller to the optical domain. Additional waveguides can be added to incorporate an arbitrary number of memory modules into one MAP, as shown in Figure 7 with three bidirectional memory module connections.
C. Circuit-Accessed Memory Module
Our proposed circuit-switched memory access architecture requires slightly different usage of DRAM modules. Figure  8 (Figure 8 (c)) is responsible for demultiplexing the single optical channel into the address and data bus much in the same way as Rambus RDRAM memory technology [38] , using the simple control flowchart shown in Figure 5 . This shift from electrical to photonic technology presents significant advantages for the physical design and implementation of off-chip signaling. One advantage is that the P-CAMM can be locally clocked, as shown, performing serialization and deserialization on the I/O bitrate, and synchronizing it to the DRAM clock rate. Coding or clock transmission can be used to recover the clock in the transceiver, and matched to the local DRAM clock after deserialization. Local clocking and the elimination of long printed circuit board (PCB) traces that the DRAM chips drove allow the P-CAMM to sustain higher clock frequencies than contemporary DRAM modules.
Although the P-CAMM shown in Figure 8 (a) retains the contemporary SDRAM DIMM form factor, this is not required due to the alleviated pinning requirements. The memory module can then be designed for larger, smaller, or more dense configurations of DRAM chips. Furthermore, the memory module can be placed arbitrarily distant from the processor using low-loss optical fiber without incurring any additional power or optical loss. Latency is also minimal, paying 4.9 ns/m [11] . Additionally, the driver and receiver banks use much less power for photonics using ring-resonator based modulators and SiGe detectors than for off-chip electronic I/O wires [7] .
IV. EXPERIMENTAL SETUP
The main goal of this work is to evaluate how silicon photonic technology and circuit-switching affect power efficiency in transporting data to and from off-chip DRAM. We perform this analysis by investigating different network configurations using PhoenixSim, a simulation environment for physical-layer analysis of chip-scale photonic interconnection networks [6] .
Processor Core Network Router Memory Access Point 
A. On-chip Network Architectures
The 2D mesh topology has some attractive characteristics including a modular design, short interconnecting wires, and simple X-Y routing. For these reasons, the 2D mesh has been used in some of the first industry instantiations of tiled manycore networks-on-chip [16] , [44] . The mesh also provides the simple and effective means of connecting peripheral memory access points at the ends of rows and columns, utilizing router ports that would have otherwise gone unused or required specialized routing.
We consider three different network architectures: Electronic packet-switched (Emesh), Electronic circuit-switched (EmeshCS), and Photonic circuit-switched (PmeshCS). All three use an 8×8 2D mesh topology to connect the grid of 64 network nodes with DRAM access points on the periphery. An abstract illustration of this setup is shown in Figure 9 .
The Emesh and EmeshCS use the routers shown in Figures  1 and 3 , respectively, to construct the on-chip 8×8 mesh. They also use integrated concentration [24] of 4 cores per network gateway, for a total core count of 256.
Similar to the electronic circuit-switched mesh, we replace the electronic data plane with nanophotonic waveguides and switches to achieve a hybrid photonic circuit-switched network. External concentration [24] is used because of the relative difficulty of designing high-radix photonic switches, and to reduce the number of modulator/detector banks. Designs of 4×4 photonic switches in the context of networks have been explored in [5] , but because a mesh router requires 5 ports (4 directions + processor core), we must reconsider the design of the photonic switches to minimize power and insertion loss. Figure 10 introduces two new designs for the photonic 5port ring resonator-based broadband data switch used in the circuit-switching router for the PmeshCS, designated as PS-1 and PS-2. We designed the PS-1 starting with an optimized Fig. 10 . Two designs for a 5-port photonic switch for the PmeshCS 4×4 switch [5] , and adding the modulator and detector banks between lanes. As a result, the switch has a small number of rings and low insertion loss, but exhibits blocking when certain ports are being used (e.g. when the detector bank is being used, the east-bound port is blocked). We designed the PS-2 switch from a full ring-matrix crossbar switch, taking out rings to account for no U-turns being allowed, and routing waveguides to eliminate terminations. The PS-2 switch uses more rings and has larger insertion loss, but is fully nonblocking. Because it is not obvious how the two switch designs will affect the network as a whole, we will consider separate photonic mesh instantiations using each switch.
B. Simulation Environment
The PhoenixSim simulation environment allows us to capture physical-layer details, such as physical dimensions and layout, of both electronic and nanophotonic devices to accurately execute various traffic models. We describe the relevant modeling and parameters below.
Photonic Devices. Modeling of optical components is built on a detailed physical-layer library that has been characterized and validated through the physical measurement of fabricated devices. The modeled components are fabricated in silicon at the nano-scale, and include modulators, photodetectors, waveguides (straight, bending, crossing), filters, and PSEs. The behavior of these devices are characterized and modeled at runtime by attributes such as insertion loss, crosstalk, delay, and power dissipation. Tables I and II show some of the most important optical parameters used.
Photonic Network Physical Layer Analysis. The number of available wavelengths is obtained through an insertion loss analysis, a key tool in our simulation environment [6] . Figure  11 shows the relationship between network insertion loss and the number of wavelengths that can be used. The following equations specify the constraints that must be met in order to achieve reliable optical communication: states that the total injected power at the first modulator must be below the threshold at which nonlinear effects are induced, thus corrupting the data (or introducing significantly more optical loss). A reasonable value for P NT is around 10-20 mW [26] . Equation 2 states that the power received at the detectors must be greater than the detector sensitivity (usually about -20 dBm) to reliably distinguish between zeros and ones. To ensure this, every wavelength must inject at least enough power to overcome the worst-case optical loss through the network. From these relationships, we can see that the number of wavelengths that can be used in a network relies mainly on the worst-case insertion loss through it.
The two photonic switches that we consider here, labeled PS-1 and PS-2, have different insertion loss characteristics. We determine the worst case network-level insertion loss using each of the switches in the photonic mesh, and find that it equates to 13.5 dB and 18.41 dB for the PS-1 and PS-2, respectively. This means that the Pmesh can safely use approximately 128 wavelengths for the PS-1, and 45 for the PS-2. Despite the PS-1 having 2× more bandwidth than the PS-2, its blocking conditions may yield a lower total bandwidth for the network.
Simulation Parameters. The parameters for all networks have been chosen for power-efficient configurations, typically * Dynamic energy calculation based on carrier density, 50-μm ring, 320×250-nm waveguide, 75% exposure, 1-V bias.
† Based on switching energy, including photon lifetime for re-injection. ‡ Same as * , for a 3 μm ring modulator. § Based on experimental measurements in [49] . Calculated for half a 10 GHz clock cycle, with 50% probability of a 1-bit.
¶ Conservative approximation assuming femto-farad class receiverless SiGe detector with C < 1fF .
Same value as used in [20] . Average of 20 degrees thermal tuning required. * * From [51] † † Projections based on [13] ‡ ‡ From [25] Optical Power Budget the most important concern for embedded systems. We consider the key limiting factor for our embedded system design to be ideal I/O bandwidth. For photonics, I/O bandwidth (which is the same as on-chip bandwidth due to bit-rate transparent devices) is limited by insertion loss as described above.
The electronic networks, however, are limited by pin count. Electronic off-chip signaling bandwidth is limited by packaging constraints at a total of 1792 I/O pins (64 pins per MAP), which is more than 2× that of today's CMPs (TILE64 [3] ). Note that even though a real chip would require a significant number of additional I/O ports, we assume that all of these 1792 pins are dedicated to DRAM access. This places the total number of pins well over 4000, assuming a 50% total I/O-to-power/ground ratio. According to ITRS [18] , attaining this pin count will require solutions to significant packaging challenges. Table III shows the more important simulation parameters that will be used for simulations in Section V. For each network, we work backwards from the I/O bandwidth available across the chip boundary to the on-chip and DRAM parameters. We assume all cores run at 2.5 GHz.
The Emesh uses conventional DRAM bidirectional signalling with 2 DRAM channels for increased access concurrency running at 1.6 GT/s, using a conventional 8 arrays per chip and 8 chips per DIMM. The Emesh network runs at 1.6GHz to match this bandwidth. Our router model implements a fully pipelined router which can issue two grant requests per cycle (for different outputs) and uses dimension ordered routing for deadlock avoidance, and bubble flow control [37] for congestion management. One virtual channel (VC) is used for writes and core-to-core communication, and a separate VC is used for read responses for reduced read latency. For power dissipation modeling, the ORION 2.0 electronic router model [21] is integrated into the simulator, which provides detailed technology node-specific modeling of router components such as buffers, crossbars, arbiters, clock tree, and wires. The technology point is specified as 32 nm, and the V DD and V th ORION parameters are set according to frequency (lower voltage, higher threshold for lower frequencies). The ORION model also calculates the area of these components, which is used to determine the lengths of interconnecting wires. Off-chip electronic I/O wires and transceivers are modeled as using 1 pJ/bit, a reasonable projection based on [18] , [33] . The EmeshCS uses high speed (10Gb/s) bidirectional differential pairs for I/O signalling, requiring serialization and deserialization (SerDes) at the chip edge between the 2.5 GHz data plane. The path-setup electronic control plane runs at a slower 1.0 GHz to save power. The photonic networks use the exact same control plane as the EmeshCS, and the same 2.5 Gb/s bitrate per wavelength to avoid significant SerDes power consumption at the network gateways. SerDes power is modeled using ORION flip-flop models as shift registers running at the higher clock rate, bandwidth matching both sides with parallel wires. For all three circuit-switched configurations, we increase the number of DRAM arrays per chip by decreasing the row and column count to be able to continuously feed the I/O. A bit-rate clock is sent with the data on a separate channel to lock on to the data at the receiver, and we allocate 16 clock cycles of overhead for each transmission for locking.
DRAM Modeling. The cycle-accurate simulation of the DRAM memory subsystem along with the network on chip for the Emesh is accomplished by integrating DRAMsim [48] into our simulator. The Emesh behaves like a typical contemporary system in that the packetization of messages required by the packet-switched network yields small memory transaction sizes, analogous to today's cachelines. Therefore, a DRAM model which is based on typical DDR SDRAM components and control policies that might be seen in real systems, such as DRAMsim, is appropriate for this configuration.
The two circuit-switched networks, however, exhibit different memory access behavior than a packet-switched version, thus enabling a simplification of the memory control logic. For this reason, we use our own model for DRAM components and control. This model cycle-accurately enforces all timing constraints of real DRAM chips, including row access time, row-column delay, column access latency, and precharge time. Because access to the memory modules is arbitrated by the on-chip path-setup mechanism, only one transaction must be sustained by a MAP, which greatly simplifies the control logic as previously discussed.
We base our model parameters around a Micron 1-Gb DDR3 chip [32] , with (t RCD -t RP -t CL ) chosen as (12.5 -12.5 -12.5) (ns). To normalize the three different network architectures for experiment, we assign them the same amount of similarly-configured DDR3 DRAM around the periphery.
V. EMBEDDED APPLICATION SIMULATION

A. Evaluation Framework
We evaluate the proposed network architectures using the application modeling framework, Mapping and Optimization Runtime Environment (MORE) to collect traces from the execution of high-performance embedded signal and image processing applications.
The MORE system, based on pMapper [45] , is designed to project a user program written in Matlab onto a distributed or parallel architecture and provide performance results and analysis. The MORE framework translates application code into a dependency-based instruction trace, which captures the individual operations performed as well as their interdependencies. By creating an instruction trace interface for PhoenixSim, we were able to accurately model the execution of applications on the proposed architectures.
MORE consists of the following primary components:
• The program analysis component is responsible for converting the user program, taken as input, into a parse graph, a description of the high-level operations and their dependences on one another. • The data mapping component is responsible for distributing the data of each variable specified in the user code across the processors in the architecture. • The operations analysis component is responsible for taking the parse graph and data maps and forming the dependency graph, a description of the low-level operations and their dependences on one another. PhoenixSim then reads the dependency graphs produced by MORE, generating computation and communication events. Combining PhoenixSim with MORE in this way allows us to characterize photonic networks on the physical level by generating traffic which exactly describes the communication, memory access, and computation of the given application.
Three applications are considered: projective transform, matrix multiply, and fast fourier transform (FFT). Results for power usage, performance (GOPS), and efficiency (GOPS/W) improvement are provided for each.
Projective Transform. When registering multiple images taken from various aerial surveillance platforms, it is frequently advantageous to change the perspective of these images so that they are all registered from a common angle and orientation (typically straight down with north being at the top of the image). In order to do this, a process known as projective transform is used [22] .
Projective transform takes as input a two-dimensional image M as well as a transformation matrix t that expresses the transformational component between the angle and orientation of the image presented and the desired image. The projective transform algorithm outputs M , or the image M after projection through t. To populate a pixel p in M , its x and y positions are back-projected through t to get their relative position in M , p. This position likely does not fall directly on a pixel in M , but rather somewhere between a set of four pixels. Using the distance from p to each of its corners as well as the corner values themselves, the value for p can be obtained.
MORE allows us to retain identical image and projections sizes while still inducing data movement in the projection process as well as investigating various transformation matrices. For this experiment, we consider this application on various image sizes where the image orientation is rotated by ninety degrees.
Matrix Multiply Matrix multiplication is a common operation in signal and image processing, where it can be used in filtering as well as to control hue, saturation and contrast in an image. It is a natural candidate for consideration on our architecture, given that multiple data points need to be accessed and then summed to form a single entry in the result.
While various algorithms for matrix multiplication can be considered for matrices of any dimension, we shall focus our analysis on an inner product algorithm over square matrices. Here, in an N × N matrix, each entry is generated by first multiplying together two vectors of size N (corresponding to a row and a column), and then summing the entries in the resulting vector to form a single entry in the result.
The inner product algorithm requires time proportional to N 3 . While the best known algorithm for matrix multiply is O(N 2.376 ), the constants in the algorithm make it infeasible for all but the largest of matrices. Even Strassen's algorithm [42] , with a bound of O(N 2.806 ) is frequently considered too cumbersome and awkward to implement, especially in a parallel environment. Though more computationally expensive, the inner product algorithm also lends itself more naturally to a parallel implementation, making it our algorithm of choice.
Fast Fourier Transform Computing the Fast Fourier Transform (FFT) of a set of data points is an essential algorithm which underlies many signal processing and scientific applications. In addition to the widespread use of the FFT, the inherent data parallelism that can be exploited in its computation makes it a good match for measuring the performance of networks-on-chip. A typical way the FFT is computed in parallel, and which is employed in our execution model, is the Cooley-Tukey method [10] . The communication patterns and computation stages for 8 nodes are shown in Figure 12 . We run the FFT where each core begins with 2 10 , 2 12 , 2 14 , 2 16 , and 2 18 samples, and average the results.
B. Simulation Results
Table IV shows the averaged results for the different network configurations across the 3 applications, showing network-related power, total system performance (GOPS), and total system efficiency (GOPS/W) which is normalized to the Emesh for comparison. In all cases, the circuit switched networks achieve considerable improvements in both performance and power over the Emesh.
For the Projective Transform and Matrix Multiply, the EmeshCS consumes some additional power to achieve considerable gains in performance. The photonic networks also perform significantly better than the Emesh, though at much lower power than the EmeshCS. The PS-2 generally consumes less power because it has less modulators (but less bandwidth), and uses non-blocking switches which reduces path-setup block and retry on the electronic control plane, and therefore power. The FFT exhibits different communication and memory access behavior than the other applications, and gains are not as profound though still an order of magnitude in efficiency for PS-2.
The breakdown of power consumption for the various network components is shown in Figure 13 for the Projective Transform, one of the more network-active applications. We can see that the Emesh power is comprised mostly of buffer, crossbar, and clock power. EmeshCS alleviates buffer power as intended, but at the cost of crossbar and wire power in the higher-frequency data plane. Finally, the photonic networks achieve drastically lower power through distance-independent efficient modulation and detection.
VI. RELATED WORK
Networks-on-chip have entered the computer architecture arena to enable core-to-core and core-to-DRAM communication on contemporary processors. The Tilera TILE-Gx processors [44] and Intel Polaris [16] are examples of real packet-switched NoC implementations with up to 100 and 80 cores, respectively. The Cell BE [9] uses a circuit-switched network to connect heterogeneous cores and a single memory controller.
Next-generation NoC designs using silicon nanophotonic technology have also been proposed. The Corona network is an example of a network that uses optical arbitration via a wavelength-routed token ring to reserve access to a full serpentine crossbar made from redundant waveguides, modulators, and detectors [47] . Similarly, wavelength-routed bus-based architectures have been proposed which take advantage of WDM for arbitration [23] , [34] . Batten et al. proposed a wavelength-selective routed architecture for off-chip communications which takes advantage of WDM to dedi-cate wavelengths to different DRAM banks, forming a large wavelength-tuned ring-resonator matrix as a central crossbar [2] on which source nodes transmit on the specific wavelength that is received by a single destination. Hadke et al. proposed OCDIMM, a WDM-based optical interconnect for FBDIMM memory banks, which uses wavelength-routing to achieve a memory system that scales while sustaining low latencies [15] . Phastlane was designed for a cache-coherent CMP, enabling snoop-broadcasts and cacheline transfers in the optical domain [8] . Finally, on-chip hybrid electronically circuit-switched photonic networks have been proposed by Shacham et al. [40] and Petracca et al. [35] , and further investigated by Hendry et al. [17] and Chan et al. [5] .
The main contribution of this work over previous work is to explore circuit-switching as a memory access method in the context of a nanophotonic-enabled interconnect, using the same network resources which enable core-to-core communication. Uniquely, our simulation framework incorporates physically-accurate photonic device models, detailed electronic component models, and cycle-accurate DRAM device and control models into a full system simulation.
VII. CONCLUSION
By incorporating cycle-accurate DRAM control and device models into a network simulator with detailed physicallyaccurate models of both photonic and electronic components, we are able to investigate circuit-switched memory access in an embedded high-performance CMP computing node design. We run three signal and image processing applications on different network implementations normalized to topology, pin constraints, total memory, and CMOS technology to characterize the different networks with respect to bandwidth and latency. Accessing memory using a circuit-switched network was found to increase performance through long burst lengths and decrease power by eliminating performance-dependent buffers. Silicon nanophotonic technology adds to these benefits with low-energy transmission and higher bandwidth density which will enable future scaling. Additional benefits include reduced memory controller complexity, dramatically lower pin counts, and relaxed memory module and compute board design constraints, all of which are beneficial to the embedded computing world.
