21 research outputs found
Separation logic for high-level synthesis
High-level synthesis (HLS) promises a significant shortening of the digital hardware design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. However, applications using dynamic, pointer-based data structures remain difficult to implement well, yet such constructs are widely used in software. Automated optimisations that leverage the memory bandwidth of dedicated hardware implementations by distributing the application data over separate on-chip memories and parallelise the implementation are often ineffective in the presence of dynamic data structures, due to the lack of an automated analysis that disambiguates pointer-based memory accesses. This thesis takes a step towards closing this gap. We explore recent advances in separation logic, a rigorous mathematical framework that enables formal reasoning about the memory access of heap-manipulating programs. We develop a static analysis that automatically splits heap-allocated data structures into provably disjoint regions. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations which enable loop parallelisation and physical memory partitioning by off-the-shelf HLS tools.
We then extend the scope of our technique to pointer-based memory-intensive implementations that require access to an off-chip memory. The extended HLS design aid generates parallel on-chip multi-cache architectures. It uses the disjointness property of memory accesses to support non-overlapping memory regions by private caches. It also identifies regions which are shared after parallelisation and which are supported by parallel caches with a coherency mechanism and synchronisation, resulting in automatically specialised memory systems. We show up to 15x acceleration from heap partitioning, parallelisation and the insertion of the custom cache system in demonstrably practical applications.Open Acces
Tigris: Architecture and Algorithms for 3D Perception in Point Clouds
Machine perception applications are increasingly moving toward manipulating
and processing 3D point cloud. This paper focuses on point cloud registration,
a key primitive of 3D data processing widely used in high-level tasks such as
odometry, simultaneous localization and mapping, and 3D reconstruction. As
these applications are routinely deployed in energy-constrained environments,
real-time and energy-efficient point cloud registration is critical.
We present Tigris, an algorithm-architecture co-designed system specialized
for point cloud registration. Through an extensive exploration of the
registration pipeline design space, we find that, while different design points
make vastly different trade-offs between accuracy and performance, KD-tree
search is a common performance bottleneck, and thus is an ideal candidate for
architectural specialization. While KD-tree search is inherently sequential, we
propose an acceleration-amenable data structure and search algorithm that
exposes different forms of parallelism of KD-tree search in the context of
point cloud registration. The co-designed accelerator systematically exploits
the parallelism while incorporating a set of architectural techniques that
further improve the accelerator efficiency. Overall, Tigris achieves
77.2 speedup and 7.4 power reduction in KD-tree search over an
RTX 2080 Ti GPU, which translates to a 41.7% registration performance
improvements and 3.0 power reduction.Comment: Published at MICRO-52 (52nd IEEE/ACM International Symposium on
Microarchitecture); Tiancheng Xu and Boyuan Tian are co-primary author
Building Professionally-Based Communities of Learning among Faculty, Students, and Practioners
Residential and non-residential “communities of learning” have been used within institutions of higher education as formal methods to enhance interactions among individuals that ultimately helps learning. Typically, these communities have included student-to-student and faculty-to-student interactions within residential living areas, teams in a core of courses, or teams of students within a course. If students are to develop into leaders within their respective disciplines an additional component that should be integrated into communities of learning is practioners. The objectives of our paper are to describe: 1) communities of learning and why they should be established for all students to enhance learning, 2) how to integrate a community of learning into its respective community of practice, 3) models of communities of learning and their characteristics, and 4) what roles natural resource practitioners, faculty, and students can play in developing and maintaining non-residential communities of learning to meet academic and professional objectives. Ultimately, the integration of faculty, students, and practioners for developing and maintaining learning communities will help create an educational culture that produces life-long learners and leaders in natural resource
Separation Logic-Assisted Code Transformations for Efficient High-Level Synthesis
Abstract—The capabilities of modern FPGAs permit the mapping of increasingly complex applications into reconfigurable hardware. High-level synthesis (HLS) promises a significant shortening of the FPGA design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. Applications using dynamic, pointer-based data structures and dynamic memory allocation, however, remain difficult to im-plement well, yet such constructs are widely used in software. Automated optimizations that aim to leverage the increased memory bandwidth of FPGAs by distributing the application data over separate banks of on-chip memory are often ineffective in the presence of dynamic data structures, due to the lack of an automated analysis of pointer-based memory accesses. In this work, we take a step towards closing this gap. We present a static analysis for pointer-manipulating programs which automatically splits heap-allocated data structures into disjoint, independent regions. The analysis leverages recent advances in separation logic, a theoretical framework for reasoning about heap-allocated data which has been successfully applied in recent software verification tools. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations which enable automatic loop parallelization and memory partitioning by off-the-shelf HLS tools. We demonstrate the successful loop parallelization and memory partitioning by our tool flow using three real-life applications which build, tra-verse, update and dispose dynamically allocated data structures. Our case studies, comparing the automatically parallelized to the non-parallelized HLS implementations, show an average latency reduction by a factor of 2.5 across our benchmarks. Keywords—FPGA; high-level synthesis; memory system; dy-namic data structures; separation logic; static analysis; I
FPGA-BASED K-MEANS CLUSTERING USING TREE-BASED DATA STRUCTURES
K-means clustering is a popular technique for partitioning a data set into subsets of similar features. Due to their simple control flow and inherent fine-grain parallelism, K-means algorithms are well suited for hardware implementations, such as on field programmable gate arrays (FPGAs), to accelerate the computationally intensive calculation. However, the available hardware resources in massively parallel implementations are easily exhausted for large problem sizes. This paper presents an FPGA implementation of an efficient variant of K-means clustering which prunes the search space using a binary kd-tree data structure to reduce the computational burden. Our implementation uses on-chip dynamic memory allocation to ensure efficient use of memory resources. We describe the trade-off between data-level parallelism and search space reduction at the expense of increased control overhead. A data-sensitive analysis shows that our approach requires up to five times fewer computational FPGA resources than a conventional massively parallel implementation for the same throughput constraint. 1