58,484 research outputs found
How Much Can D2D Communication Reduce Content Delivery Latency in Fog Networks with Edge Caching?
A Fog-Radio Access Network (F-RAN) is studied in which cache-enabled Edge
Nodes (ENs) with dedicated fronthaul connections to the cloud aim at delivering
contents to mobile users. Using an information-theoretic approach, this work
tackles the problem of quantifying the potential latency reduction that can be
obtained by enabling Device-to-Device (D2D) communication over out-of-band
broadcast links. Following prior work, the Normalized Delivery Time (NDT) --- a
metric that captures the high signal-to-noise ratio worst-case latency --- is
adopted as the performance criterion of interest. Joint edge caching, downlink
transmission, and D2D communication policies based on compress-and-forward are
proposed that are shown to be information-theoretically optimal to within a
constant multiplicative factor of two for all values of the problem parameters,
and to achieve the minimum NDT for a number of special cases. The analysis
provides insights on the role of D2D cooperation in improving the delivery
latency.Comment: Submitted to the IEEE Transactions on Communication
The status of US Teraflops-scale projects
The current status of United States projects pursuing Teraflops-scale
computing resources for lattice field theory is discussed. Two projects are in
existence at this time: the Multidisciplinary Teraflops Project, incorporating
the physicists of the QCD Teraflops Collaboration, and a smaller project,
centered at Columbia, involving the design and construction of a 0.8 Teraflops
computer primarily for QCD.Comment: Contribution to Lattice 94. 7 pages. Latex source followed by
compressed, uuenocded postscript file of the complete paper. Individual
figures available from [email protected]
A 90 nm CMOS 16 Gb/s Transceiver for Optical Interconnects
Interconnect architectures which leverage high-bandwidth optical channels offer a promising solution to address the increasing chip-to-chip I/O bandwidth demands. This paper describes a dense, high-speed, and low-power CMOS optical interconnect transceiver architecture. Vertical-cavity surface-emitting laser (VCSEL) data rate is extended for a given average current and corresponding reliability level with a four-tap current summing FIR transmitter. A low-voltage integrating and double-sampling optical receiver front-end provides adequate sensitivity in a power efficient manner by avoiding linear high-gain elements common in conventional transimpedance-amplifier (TIA) receivers. Clock recovery is performed with a dual-loop architecture which employs baud-rate phase detection and feedback interpolation to achieve reduced power consumption, while high-precision phase spacing is ensured at both the transmitter and receiver through adjustable delay clock buffers. A prototype chip fabricated in 1 V 90 nm CMOS achieves 16 Gb/s operation while consuming 129 mW and occupying 0.105 mm^2
Stencils and problem partitionings: Their influence on the performance of multiple processor systems
Given a discretization stencil, partitioning the problem domain is an important first step for the efficient solution of partial differential equations on multiple processor systems. Partitions are derived that minimize interprocessor communication when the number of processors is known a priori and each domain partition is assigned to a different processor. This partitioning technique uses the stencil structure to select appropriate partition shapes. For square problem domains, it is shown that non-standard partitions (e.g., hexagons) are frequently preferable to the standard square partitions for a variety of commonly used stencils. This investigation is concluded with a formalization of the relationship between partition shape, stencil structure, and architecture, allowing selection of optimal partitions for a variety of parallel systems
NaNet: a Low-Latency, Real-Time, Multi-Standard Network Interface Card with GPUDirect Features
While the GPGPU paradigm is widely recognized as an effective approach to
high performance computing, its adoption in low-latency, real-time systems is
still in its early stages.
Although GPUs typically show deterministic behaviour in terms of latency in
executing computational kernels as soon as data is available in their internal
memories, assessment of real-time features of a standard GPGPU system needs
careful characterization of all subsystems along data stream path.
The networking subsystem results in being the most critical one in terms of
absolute value and fluctuations of its response latency.
Our envisioned solution to this issue is NaNet, a FPGA-based PCIe Network
Interface Card (NIC) design featuring a configurable and extensible set of
network channels with direct access through GPUDirect to NVIDIA Fermi/Kepler
GPU memories.
NaNet design currently supports both standard - GbE (1000BASE-T) and 10GbE
(10Base-R) - and custom - 34~Gbps APElink and 2.5~Gbps deterministic latency
KM3link - channels, but its modularity allows for a straightforward inclusion
of other link technologies.
To avoid host OS intervention on data stream and remove a possible source of
jitter, the design includes a network/transport layer offload module with
cycle-accurate, upper-bound latency, supporting UDP, KM3link Time Division
Multiplexing and APElink protocols.
After NaNet architecture description and its latency/bandwidth
characterization for all supported links, two real world use cases will be
presented: the GPU-based low level trigger for the RICH detector in the NA62
experiment at CERN and the on-/off-shore data link for KM3 underwater neutrino
telescope
Serialized Asynchronous Links for NoC
This paper proposes an asynchronous serialized link for NoC that can achieve the same levels of performance in terms of flits per second as a synchronous link but with a reduced number of wires in the point to point switch links and reduced power consumption. This is achieved by employing serialization in the asynchronous domain as opposed to synchronous to facilitate the removal of global clocking on the serial links. Based on transistor level simulations using 0.12 ?m foundry models it has been shown that it is possible to achieve the same level of performance as synchronous but with 75% reduction in wires and 65% reduction in power for a 300 MFlit/s link with 8 buffers with a switch clock speed of 300 MHz. Furthermore the paper presents the design requirements arising from interfacing switches of synchronous NoC and asynchronous serial links
- …