Search CORE

28 research outputs found

Discrete Fractional Clock Generation for Systems-on-FPGA

Author: Köhler Steffen
Preußer Thomas B.
Publication venue: Technische Universität Dresden
Publication date: 14/11/2012
Field of study

This article describes an inexpensive way of clock generation for FPGA-based circuit cores, which reduces the number of external clock sources and eases synchronization problems. We introduce a modified version of the BRESENHAM line drawing algorithm and use it outside its original application domain for the rational division of clocks. An optimized hardware design for BRESENHAM-based clock division is presented and the quality of its output is evaluated. The optimal initialization conditions in terms of phase shift and jitter are identified and formally proven. Finally, the complexity characteristics of a generic synthesizable VHDL design based on this algorithm are examined and verified by synthesis examples. Special attention is paid to implementation results in conjunction with different FPGA families

Technische Universität Dresden: Qucosa

Background of the Analysis of a Fully-Scalable Digital Fractional Clock Divider

Author: Preußer Thomas B.
Publication venue: Technische Universität Dresden
Publication date: 14/11/2012
Field of study

It was previously shown that the BRESENHAM algorithm is well-suited for digital fractional clock generation. Specifically, it proved to be the optimal approximation of a desired clock in terms of the switching edges provided by an available reference clock. Moreover, some synthesis results for hardwired dividers on Altera FPGAs showed that this technique for clock division achieves a high performance often at or close to the maximum frequency supported by the devices for moderate bit widths of up to 16 bits. This paper extends the investigations on the clock division by the BRESENHAM algorithm. It draws out the limits encountered by the existing implementation for both FPGA and VLSI realizations. A rather unconventional adoption of the carry-save representation combined with a soft-threshold comparison is proposed to circumvent these limitations. The resulting design is described and evaluated. Mathematically appealing results on the quality of the approximation achieved by this approach are presented. The underlying proofs and technical details are provided in the appendix

Technische Universität Dresden: Qucosa

Putting Queens in Carry Chains

Author: Nägel Bernd
Preußer Thomas B.
Spallek Rainer G.
Publication venue: Technische Universität Dresden
Publication date: 14/11/2012
Field of study

This paper describes an FPGA implementation of a solution-counting solver for the N-Queens Puzzle. The proposed algorithmic mapping utilizes the fast carrychain logic found on modern FPGA architectures in order to achieve a regular and efficient design. From an initial full chessboard mapping, several optimization strategies are explored. Also, the infrastructure is described, which we have constructed for the computation of the currently unknown solution count of the 26- Queens Puzzle. Finally, we compare the performance of our used concrete FPGA device mappings also in contrast to general-purpose CPUs

Technische Universität Dresden: Qucosa

The SHAP Microarchitecture and Java Virtual Machine

Author: Preußer Thomas B.
Reichel Peter
Zabel Martin
Publication venue: Technische Universität Dresden
Publication date: 14/11/2012
Field of study

This report presents the SHAP platform consisting of its microarchitecture and its implementation of the Java Virtual Machine (JVM). Like quite a few other embedded implementations of the Java platform, the SHAP microarchitecture relies on an instruction set architecture based on Java bytecode. Unlike them, it, however, features a design with well-encapsulated components autonomously managing their duties on rather high abstraction levels. Thus, permanent runtime duties are transferred from the central computing core to concurrently working components so that it can actually spent a larger fraction of time executing application code. The degree of parallelity between the application and the runtime implementation is increased. Currently, the stack and heap management including the automatic garbage collection are implemented this way. After detailing the design of the microarchitecture, the SHAP implementation of the Java Virtual Machine is described. A major focus is laid on the presentation of the layout and the use of the runtime data structures representing the various language abstractions provided by Java. Also, the boot sequence starting the JVM is described

Technische Universität Dresden: Qucosa

An Embedded Garbage Collection Module with Support for Multiple Mutators and Weak References

Author: Preußer Thomas B.
Reichel Peter
Spallek Rainer G.
Publication venue: Technische Universität Dresden
Publication date: 14/11/2012
Field of study

This report details the design of a garbage collection (GC) module, which introduces modern GC features to the domain of embedded implementations. The described design supports weak references and feeds reference queues. Its architecture allows multiple concurrent application cores operating as mutators on the shared memory managed by the GC module. The garbage collection is exact and fully concurrent so as to enable the uninterrupted computational progress of the mutators. It combines a distributed root marking with a centralized heap scan of the managed memory. It features a novel mark-and-copy GC strategy on a segmented memory, which thereby overcomes both the tremendous space overhead of two-space copying and the compaction race of mark-and-compact approaches. The proposed GC architecture has been practically implemented and proven using the embedded bytecode processor SHAP as a sample testbed. The synthesis results for settings up to three SHAP mutator cores are given and online functional measurements are presented. Basic performance dependencies on the system configuration are evaluated

Technische Universität Dresden: Qucosa

MicroRec: Efficient Recommendation Inference by Hardware and Data Structure Solutions

Author: Alonso Gustavo
Feng Liang
He Zhenhao
Jiang Wenqi
Li Yong
Liu Tongxuan
Preußer Thomas B.
Zeng Kai
Zhang Ce
Zhang Jiansong
Zhang Shuai
Zhou Jingren
Publication venue
Publication date: 19/02/2021
Field of study

Deep neural networks are widely used in personalized recommendation systems. Unlike regular DNN inference workloads, recommendation inference is memory-bound due to the many random memory accesses needed to lookup the embedding tables. The inference is also heavily constrained in terms of latency because producing a recommendation for a user must be done in about tens of milliseconds. In this paper, we propose MicroRec, a high-performance inference engine for recommendation systems. MicroRec accelerates recommendation inference by (1) redesigning the data structures involved in the embeddings to reduce the number of lookups needed and (2) taking advantage of the availability of High-Bandwidth Memory (HBM) in FPGA accelerators to tackle the latency by enabling parallel lookups. We have implemented the resulting design on an FPGA board including the embedding lookup step as well as the complete inference process. Compared to the optimized CPU baseline (16 vCPU, AVX2-enabled), MicroRec achieves 13.8~14.7x speedup on embedding lookup alone and 2.5$~5.4x speedup for the entire recommendation inference in terms of throughput. As for latency, CPU-based engines needs milliseconds for inferring a recommendation while MicroRec only takes microseconds, a significant advantage in real-time recommendation systems.Comment: Accepted by MLSys'21 (the 4th Conference on Machine Learning and Systems

arXiv.org e-Print Archive

Repository for Publications and Research Data