15 research outputs found
RHINO: reconfigurable hardware interface for computation and radio
Field-programmable gate arrays, or FPGAs, provide an attractive computing platform for software-defined radio applications. Their reconfigurable nature allows many digital signal processing (DSP) algorithms to be highly parallelised within the FPGA fabric, while their customisable I/O interfaces allow simple interfacing to analogue-to-digital converters (ADCs) and digital-to-analogue converters (DACs). However, FPGA boards that deliver sufficient performance to be useful in real-world applications are generally expensive. Rhino is an FPGA-based hardware processing platform that primarily supports software-defined radio applications. The final cost estimate for a complete Rhino system is under $1700, cheaper than similar FPGA boards that deliver much lower performance
Understanding and Improving the Latency of DRAM-Based Memory Systems
Over the past two decades, the storage capacity and access bandwidth of main
memory have improved tremendously, by 128x and 20x, respectively. These
improvements are mainly due to the continuous technology scaling of DRAM
(dynamic random-access memory), which has been used as the physical substrate
for main memory. In stark contrast with capacity and bandwidth, DRAM latency
has remained almost constant, reducing by only 1.3x in the same time frame.
Therefore, long DRAM latency continues to be a critical performance bottleneck
in modern systems. Increasing core counts, and the emergence of increasingly
more data-intensive and latency-critical applications further stress the
importance of providing low-latency memory access.
In this dissertation, we identify three main problems that contribute
significantly to long latency of DRAM accesses. To address these problems, we
present a series of new techniques. Our new techniques significantly improve
both system performance and energy efficiency. We also examine the critical
relationship between supply voltage and latency in modern DRAM chips and
develop new mechanisms that exploit this voltage-latency trade-off to improve
energy efficiency.
The key conclusion of this dissertation is that augmenting DRAM architecture
with simple and low-cost features, and developing a better understanding of
manufactured DRAM chips together lead to significant memory latency reduction
as well as energy efficiency improvement. We hope and believe that the proposed
architectural techniques and the detailed experimental data and observations on
real commodity DRAM chips presented in this dissertation will enable
development of other new mechanisms to improve the performance, energy
efficiency, or reliability of future memory systems.Comment: PhD Dissertatio
Recommended from our members
Energy Efficient High Port Count Optical Switches
The advance of internet applications, such as video streaming, big data and cloud computing, is reshaping the telecommunication and internet industries. Bandwidth demands in datacentres have been boosted by these emerging data-hungry internet applications. Regarding inter- and intra-datacentre communications, fine-grained data need to be exchanged across a large shared memory space.
Large-scale high-speed optical switches tend to use a rearrangeably non-blocking architecture as this limits the number of switching elements required. However, this comes at the expense of requiring more sophisticated route selection within the switch and also some forms of time-slotted protocols. The looping algorithm is the classical routing algorithm to set up paths in rearrangeably non-blocking switches. It was born in the electronic switch era, where all links in the switches are equal. It is, therefore, not able to accommodate loss difference between optical paths due to the different length of waveguides and distinct numbers of crossings, and bends, leading to sub-optimal performance.
We, therefore, propose an advanced path-selection algorithm based on the looping algorithm that minimises the path-dependent loss. It explores all possible set-ups for a given connection assignment and selects the optimal one. It guarantees that no individual path would have a sufficiently substantial loss, therefore, improve the overall performance of the switch. The performance of the proposed algorithm has been assessed by modelling switches using the VPI simulator. An 8×8 Clos-tree switch demonstrates a 2.7dB decrease in loss and 1.9dB improvement in IPDR with 1.5 dB penalty for the worst case. An 8×8 dilated Beneš shows more than 4 dB loss reduction for the lossiest path and 1.4 dB IPDR improvement for 1 dB power penalty. The improved algorithm can be run once for each switch design and store its output in a compact lookup table, enabling rapid switch reconfiguration.
Microelectromechanical systems (MEMS) based optical switches have been fabricated with over 1,000 ports which meet the port count requirements in data centre networks. However, the reconfiguration speed of the MEMS switches is limited to the millisecond to microsecond timescale, which is not sufficient for packet switching in datacentres. Opto-electronic devices, such as Mach-Zehnder Interferometers (MZIs) and semiconductor optical amplifiers (SOAs) with nanosecond response time show the potential to fulfil the requirements of packet switching. However, the scalability of MZI switches is inherently limited by insertion loss and accumulated crosstalk, while the scalability of SOA switches is restricted by accumulated noise and distortion.
We, therefore, have proposed a dilated Beneš hybrid MZI-SOA design, where MZIs are implemented as 1×2 or 2×1 low-loss switching elements, minimising crosstalk by using a single input, and where short SOAs are included as gain or absorption units, offering either loss compensation or crosstalk suppression though adding only minimal noise and distortion. A 4×4 device has been fabricated and exhibits a mere 1.3dB loss, an extinction ratio of 47dB, and more than 13dB IPDR for a 0.5dB power penalty. When operating with 10 Gb/s per port, 6pJ/bit energy consumption is demonstrated, delivering 20% reduced energy consumption compared with SOA-based switches. The tolerance of the current control accuracy of this switch is very broad. Within a 5 mA bias current range, the power penalty can be maintained below 0.2 dB for 8 dB IPDR and 12 mA for 10 dB IPDR with a penalty less 0.5 dB. The excellent crosstalk and power penalty performance demonstrated by this chip enable the scalability of this hybrid approach. The performance of 16×16 port dilated Beneš hybrid switch is experimentally assessed by cascading 4×4 switch chips, demonstrating an IPDR of 15 dB at a 1 dB penalty with a 0.6 dB power penalty floor. In terms of switches with port count larger than 16×16, the power penalty performance has been analysed with physical layer simulations fitted with state-of-the-art data. We assess the feasibility of three potential topologies, with different architectural optimisations: dilated Beneš, Beneš and Clos-Beneš. Quantitative analysis for switches with up to 2048 ports is presented, achieving a 1.15dB penalty for a BER of 10-3, compatible with soft-decision forward error correction.Cambridge Overseas Trust; China Scholarship Council
Recommended from our members
Photonic Interconnects Beyond High Bandwidth
The extraordinary growth of parallelism in high-performance computing requires efficient data communication for scaling compute performance. High-performance computing systems have been using photonic links for communication of large bandwidth-distance product during the last decade. Photonic interconnection networks, however, should not be a wire-for-wire replacement based on conventional electrical counterparts. Features of photonics beyond high bandwidth, such as transparent bandwidth steering, can implement important functionalities needed by applications. In another aspect, application characteristics can be exploited to design better photonic interconnects. Therefore, this thesis explores codesign opportunities at the intersection between photonic interconnect architectures and high-performance computing applications. The key accomplishments of this thesis, ranging from system level to node level, are as follows.
Chapter 2 presents a system-level architecture that leverages photonic switching to enable a reconfigurable interconnect. The architecture, called Flexfly, reconfigures the inter-group level of the widely-used Dragonfly topology using information about the application’s communication pattern. It can steal additional direct bandwidth for communication-intensive group pairs. Simulations with applications such as GTC, Nekbone and LULESH show up to 1.8x speedup over Dragonfly paired with UGAL routing, along with halved hop count and latency for cross-group messages. To demonstrate the effectiveness of our approach, we built a 32-node Flexfly prototype using a silicon photonic switch connecting four groups and demonstrated 820 ns interconnect reconfiguration time. This is the first demonstration of silicon photonic switching and bandwidth steering in a high-performance computing cluster.
Chapter 3 extends photonic switching to the node level and presents a reconfigurable silicon photonic memory interconnect for many-core architectures. The interconnect targets at important memory access issues, such as network-on-chip hot-spots and non-uniform memory access. Integrated with the processor through 2.5D/3D stacking, a fast-tunable silicon photonic memory tunnel can transparently direct traffic from any off-chip memory to any on-chip interface – thus alleviating the hot-spot and non-uniform access effects. We demonstrated the operation of our proposed architecture using a tunable laser, a 4-port silicon photonic switch (four wavelength-routed memory channels) and a 4x4 mesh network-on-chip synthesized by FPGA. The emulated system achieves a 15-ns channel switching time. Simulations based on a 12-core 4-memory model show that for such switching speeds the interconnect system can realize a 2x speedup for the STREAM benchmark in the hot-spot scenario and a reduction of execution time for data-intensive applications such as 3D stencil and K-means clustering by 23% and 17%, respectively.
Chapters 4 explores application-level characteristics that can be exploited to hide photonic path setup delays. In view of the frequent reuse of optical circuits by many applications, we proposed a circuit-cached scheme that amortizes the setup overhead by maximizing circuit reuses. In order to improve circuit “hit” rates, we developed a reuse-distance based replacement policy called “Farthest Next Use”. We further investigated the tradeoffs between the realized hit rate and energy consumption. Finally, we experimentally demonstrated the feasibility of the proposed concept using silicon photonic devices in an FPGA-controlled network testbed.
Chapter 5 proceeds to develop an application-guided circuit-prefetch scheme. By learning temporal locality and communication patterns from upper-layer applications, the scheme not only caches a set of circuits for reuses, but also proactively prefetches circuits based on predictions. We applied this technique to communication patterns from a spectrum of science and engineering applications. The results show that setup delays via circuit misses are significantly reduced, showing how the proposed technique can improve circuit switching in photonic interconnects
Evaluating Techniques for Wireless Interconnected 3D Processor Arrays
In this thesis the viability of a wireless interconnect network for a highly parallel computer is investigated. The main theme of this thesis is to project the performance of a wireless network used to connect the processors in a parallel machine of such design. This thesis is going to investigate new design opportunities a wireless interconnect network can offer for parallel computing.
A simulation environment is designed and implemented to carry out the tests. The results have shown that if the available radio spectrum is shared effectively between building blocks of the parallel machine, there are substantial chances to achieve high processor utilisation. The results show that some factors play a major role in the performance of such a machine. The size of the machine, the size of the problem and the communication and computation capabilities of each element of the machine are among those factors. The results show these factors set a limit on the number of nodes engaged in some classes of tasks. They have shown promising potential for further expansion and evolution of our idea to new architectural opportunities, which is discussed by the end of this thesis.
To build a real machine of this type the architects would need to solve a number of challenging problems including heat dissipation, delivering electric power and Chip/board design; however, these issues are not part of this thesis and will be tackled in future
10Gbps Length adaptive on-chip RF serial link for Network on Chips and Multiprocessor chips applications
International audienc
Atomic Transfer for Distributed Systems
Building applications and information systems increasingly means dealing with concurrency and faults stemming from distribution of system components. Atomic transactions are a well-known method for transferring the responsibility for handling concurrency and faults from developers to the software\u27s execution environment, but incur considerable execution overhead. This dissertation investigates methods that shift some of the burden of concurrency control into the network layer, to reduce response times and increase throughput. It anticipates future programmable network devices, enabling customized high-performance network protocols.
We propose Atomic Transfer (AT), a distributed algorithm to prevent race conditions due to messages crossing on a path of network switches. Switches check request messages for conflicts with response messages traveling in the opposite direction. Conflicting requests are dropped, obviating the request\u27s receiving host from detecting and handling the conflict. AT is designed to perform well under high data contention, as concurrency control effort is balanced across a network instead of being handled by the contended endpoint hosts themselves.
We use AT as the basis for a new optimistic transactional cache consistency algorithm, supporting execution of atomic applications caching shared data. We then present a scalable refinement, allowing hierarchical consistent caches with predictable performance despite high data update rates.
We give detailed I/O Automata models of our algorithms along with correctness proofs. We begin with a simplified model, assuming static network paths and no message loss, and then refine it to support dynamic network paths and safe handling of message loss.
We present a trie-based data structure for accelerating conflict-checking on switches, with benchmarks suggesting the feasibility of our approach from a performance stand-point
데이터 전송로 확장성과 루프 선형성을 향상시킨 다중채널 수신기들에 관한 연구
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2013. 2. 정덕균.Two types of serial data communication receivers that adopt a multichannel architecture for a high aggregate I/O bandwidth are presented. Two techniques for collaboration and sharing among channels are proposed to enhance the loop-linearity and channel-expandability of multichannel receivers, respectively.
The first proposed receiver employs a collaborative timing scheme recovery which relies on the sharing of all outputs of phase detectors (PDs) among channels to extract common information about the timing and multilevel signaling architecture of PAM-4. The shared timing information is processed by a common global loop filter and is used to update the phase of the voltage-controlled oscillator with better rejection of per-channel noise. In addition to collaborative timing recovery, a simple linearization technique for binary PDs is proposed. The technique realizes a high-rate oversampling PD while the hardware cost is equivalent to that of a conventional 2x-oversampling clock and data recovery. The first receiver exploiting the collaborative timing recovery architecture is designed using 45-nm CMOS technology. A single data lane occupies a 0.195-mm2 area and consumes a relatively low 17.9 mW at 6 Gb/s at 1.0V. Therefore, the power efficiency is 2.98 mW/Gb/s. The simulated jitter is about 0.034 UI RMS given an input jitter value of 0.03 UI RMS, while the relatively constant loop bandwidth with the PD linearization technique is about 7.3-MHz regardless of the data-stream noise.
Unlike the first receiver, the second proposed multichannel receiver was designed to reduce the hardware complexity of each lane. The receiver employs shared calibration logic among channels and yet achieves superior channel expandability with slim data lanes. A shared global calibration control, which is used in a forwarded clock receiver based on a multiphase delay-locked loop, accomplishes skew calibration, equalizer adaptation, and the phase lock of all channels during a calibration period, resulting in reduced hardware overhead and less area required by each data lane. The
second forwarded clock receiver is designed in 90-nm CMOS technology. It achieves error-free eye openings of more than 0.5 UI across 9− 28 inch Nelco 4000-6 microstrips at 4− 7 Gb/s and more than 0.42 UI at data rates of up to 9 Gb/s. The data lane occupies only 0.152 mm2 and consumes 69.8 mW, while the rest of the receiver occupies 0.297 mm2 and consumes 56 mW at a data rate of 7 Gb/s and a supply voltage of 1.35 V.1. Introduction 1
1.1 Motivations
1.2 Thesis Organization
2. Previous Receivers for Serial-Data Communications
2.1 Classification of the Links
2.2 Clocking architecture of transceivers
2.3 Components of receiver
2.3.1 Channel loss
2.3.2 Equalizer
2.3.3 Clock and data recovery circuit
2.3.3.1. Basic architecture
2.3.3.2. Phase detector
2.3.3.2.1. Linear phase detector
2.3.3.2.2. Binary phase detector
2.3.3.3. Frequency detector
2.3.3.4. Charge pump
2.3.3.5. Voltage controlled oscillator and delay-line
2.3.4 Loop dynamics of PLL
2.3.5 Loop dynamics of DLL
3. The Proposed PLL-Based Receiver with Loop Linearization Technique
3.1 Introduction
3.2 Motivation
3.3 Overview of binary phase detection
3.4 The proposed BBPD linearization technique
3.4.1 Architecture of the proposed PLL-based receiver
3.4.2 Linearization technique of binary phase detection
3.4.3 Rotational pattern of sampling phase offset
3.5 PD gain analysis and optimization
3.6 Loop Dynamics of the 2nd-order CDR
3.7 Verification with the time-accurate behavioral simulation
3.8 Summary
4. The Proposed DLL-Based Receiver with Forwarded-Clock
4.1 Introduction
4.2 Motivation
4.3 Design consideration
4.4 Architecture of the proposed forwarded-clock receiver
4.5 Circuit description
4.5.1 Analog multi-phase DLL
4.5.2 Dual-input interpolating deley cells
4.5.3 Dedicated half-rate data samplers
4.5.4 Cherry-Hooper continuous-time linear equalizer
4.5.5 Equalizer adaptation and phase-lock scheme
4.6 Measurement results
5. Conclusion
6. BibliographyDocto
Real-Time Sensor Networks and Systems for the Industrial IoT
The Industrial Internet of Things (Industrial IoT—IIoT) has emerged as the core construct behind the various cyber-physical systems constituting a principal dimension of the fourth Industrial Revolution. While initially born as the concept behind specific industrial applications of generic IoT technologies, for the optimization of operational efficiency in automation and control, it quickly enabled the achievement of the total convergence of Operational (OT) and Information Technologies (IT). The IIoT has now surpassed the traditional borders of automation and control functions in the process and manufacturing industry, shifting towards a wider domain of functions and industries, embraced under the dominant global initiatives and architectural frameworks of Industry 4.0 (or Industrie 4.0) in Germany, Industrial Internet in the US, Society 5.0 in Japan, and Made-in-China 2025 in China. As real-time embedded systems are quickly achieving ubiquity in everyday life and in industrial environments, and many processes already depend on real-time cyber-physical systems and embedded sensors, the integration of IoT with cognitive computing and real-time data exchange is essential for real-time analytics and realization of digital twins in smart environments and services under the various frameworks’ provisions. In this context, real-time sensor networks and systems for the Industrial IoT encompass multiple technologies and raise significant design, optimization, integration and exploitation challenges. The ten articles in this Special Issue describe advances in real-time sensor networks and systems that are significant enablers of the Industrial IoT paradigm. In the relevant landscape, the domain of wireless networking technologies is centrally positioned, as expected