1,657 research outputs found

    Instruction-set customization for multi-tasking embedded systems

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Automated design of domain-specific custom instructions

    Get PDF

    Automatic design of domain-specific instructions for low-power processors

    Get PDF
    This paper explores hardware specialization of low­ power processors to improve performance and energy efficiency. Our main contribution is an automated framework that analyzes instruction sequences of applications within a domain at the loop body level and identifies exactly and partially-matching sequences across applications that can become custom instructions. Our framework transforms sequences to a new code abstraction, a Merging Diagram, that improves similarity identification, clusters alike groups of potential custom instructions to effectively reduce the search space, and selects merged custom instructions to efficiently exploit the available customizable area. For a set of 11 media applications, our fast framework generates instructions that significantly improve the energy-delay product and speed­ up, achieving more than double the savings as compared to a technique analyzing sequences within basic blocks. This paper shows that partially-matched custom instructions, which do not significantly increase design time, are crucial to achieving higher energy efficiency at limited hardware areas

    Design methodologies for instruction-set extensible processors

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    GeNN: a code generation framework for accelerated brain simulations

    Get PDF
    Large-scale numerical simulations of detailed brain circuit models are important for identifying hypotheses on brain functions and testing their consistency and plausibility. An ongoing challenge for simulating realistic models is, however, computational speed. In this paper, we present the GeNN (GPU-enhanced Neuronal Networks) framework, which aims to facilitate the use of graphics accelerators for computational models of large-scale neuronal networks to address this challenge. GeNN is an open source library that generates code to accelerate the execution of network simulations on NVIDIA GPUs, through a flexible and extensible interface, which does not require in-depth technical knowledge from the users. We present performance benchmarks showing that 200-fold speedup compared to a single core of a CPU can be achieved for a network of one million conductance based Hodgkin-Huxley neurons but that for other models the speedup can differ. GeNN is available for Linux, Mac OS X and Windows platforms. The source code, user manual, tutorials, Wiki, in-depth example projects and all other related information can be found on the project website http://genn-team.github.io/genn/

    Comparative Study of Keccak SHA-3 Implementations

    Get PDF
    This paper conducts an extensive comparative study of state-of-the-art solutions for im- plementing the SHA-3 hash function. SHA-3, a pivotal component in modern cryptography, has spawned numerous implementations across diverse platforms and technologies. This research aims to provide valuable insights into selecting and optimizing Keccak SHA-3 implementations. Our study encompasses an in-depth analysis of hardware, software, and software–hardware (hybrid) solutions. We assess the strengths, weaknesses, and performance metrics of each approach. Critical factors, including computational efficiency, scalability, and flexibility, are evaluated across differ- ent use cases. We investigate how each implementation performs in terms of speed and resource utilization. This research aims to improve the knowledge of cryptographic systems, aiding in the informed design and deployment of efficient cryptographic solutions. By providing a comprehensive overview of SHA-3 implementations, this study offers a clear understanding of the available options and equips professionals and researchers with the necessary insights to make informed decisions in their cryptographic endeavors

    Efficient Implementation of Particle Filters in Application-Specific Instruction-Set Processor

    Get PDF
    RÉSUMÉ Cette thèse considère le problème de l’implémentation de filtres particulaires (particle filters PFs) dans des processeurs à jeu d’instructions spécialisé (Application-Specific Instruction-set Processors ASIPs). Considérant la diversité et la complexité des PFs, leur implémentation requiert une grande efficacité dans les calculs et de la flexibilité dans leur conception. La conception de ASIPs peut se faire avec un niveau intéressant de flexibilité. Notre recherche se concentre donc sur l’amélioration du débit des PFs dans un environnement de conception de ASIP. Une approche générale est tout d’abord proposée pour caractériser la complexité computationnelle des PFs. Puisque les PFs peuvent être utilisés dans une vaste gamme d’applications, nous utilisons deux types de blocs afin de distinguer les propriétés des PFs. Le premier type est spécifique à l’application et le deuxième type est spécifique à l’algorithme. Selon les résultats de profilage, nous avons identifié que les blocs du calcul de la probabilité et du rééchantillonnage sont les goulots d’étranglement principaux des blocs spécifiques à l’algorithme. Nous explorons l’optimisation de ces deux blocs aux niveaux algorithmique et architectural. Le niveau algorithmique offre un grand potentiel d’accélération et d’amélioration du débit. Notre travail débute donc à ce niveau par l’analyse de la complexité des blocs du calcul de la probabilité et du rééchantillonnage, puis continue avec leur simplification et modification. Nous avons simplifié le bloc du calcul de la probabilité en proposant un mécanisme de quantification uniforme, l’algorithme UQLE. Les résultats démontrent une amélioration significative d’une implémentation logicielle, sans perte de précision. Le pire cas de l’algorithme UQLE implémenté en logiciel à virgule fixe avec 32 niveaux de quantification atteint une accélération moyenne de 23.7× par rapport à l’implémentation logicielle de l’algorithme ELE. Nous proposons aussi deux nouveaux algorithmes de rééchantillonnage pour remplacer l’algorithme séquentiel de rééchantillonnage systématique (SR) dans les PFs. Ce sont l’algorithme SR reformulé et l’algorithme SR parallèle (PSR). L’algorithme SR reformulé combine un groupe de boucles en une boucle unique afin de faciliter sa parallélisation dans un ASIP. L’algorithme PSR rend les itérations indépendantes, permettant ainsi à l’algorithme de rééchantillonnage de s’exécuter en parallèle. De plus, l’algorithme PSR a une complexité computationnelle plus faible que l’algorithme SR. Du point de vue architectural, les ASIPs offrent un grand potentiel pour l’implémentation de PFs parce qu’ils présentent un bon équilibre entre l’efficacité computationnelle et la flexibilité de conception. Ils permettent des améliorations considérables en débit par l’inclusion d’instructions spécialisées, tout en conservant la facilité relative de programmation de processeurs à usage général. Après avoir identifié les goulots d’étranglement de PFs dans les blocs spécifiques à l’algorithme, nous avons généré des instructions spécialisées pour les algorithmes UQLE, SR reformulé et PSR. Le débit a été significativement amélioré par rapport à une implémentation purement logicielle tournant sur un processeur à usage général. L’implémentation de l’algorithme UQLE avec instruction spécialisée avec 32 intervalles atteint une accélération de 34× par rapport au pire cas de son implémentation logicielle, avec 3.75 K portes logiques additionnelles. Nous avons produit une implémentation de l’algorithme SR reformulé, avec quatre poids calculés en parallèle et huit catégories définies par des bornes uniformément distribuées qui sont comparées simultanément. Elle atteint une accélération de 23.9× par rapport à l’algorithme SR séquentiel dans un processeur à usage général. Le surcoût est limité à 54 K portes logiques additionnelles. Pour l’algorithme PSR, nous avons conçu quatre instructions spécialisées configurées pour supporter quatre poids entrés en parallèle. Elles mènent à une accélération de 53.4× par rapport à une implémentation de l’algorithme SR en virgule flottante sur un processeur à usage général, avec un surcoût de 47.3 K portes logiques additionnelles. Finalement, nous avons considéré une application du suivi vidéo et implémenté dans un ASIP un algorithme de FP basé sur un histogramme. Nous avons identifié le calcul de l’histogramme comme étant le goulot principal des blocs spécifiques à l’application. Nous avons donc proposé une architecture de calcul d’histogramme à réseau parallèle (PAHA) pour ASIPs. Les résultats d’implémentation démontrent qu’un PAHA à 16 voies atteint une accélération de 43.75× par rapport à une implémentation logicielle sur un processeur à usage général.----------ABSTRACT This thesis considers the problem of the implementation of particle filters (PFs) in Application-Specific Instruction-set Processors (ASIPs). Due to the diversity and complexity of PFs, implementing them requires both computational efficiency and design flexibility. ASIP design can offer an interesting degree of design flexibility. Hence, our research focuses on improving the throughput of PFs in this flexible ASIP design environment. A general approach is first proposed to characterize the computational complexity of PFs. Since PFs can be used for a wide variety of applications, we employ two types of blocks, which are application-specific and algorithm-specific, to distinguish the properties of PFs. In accordance with profiling results, we identify likelihood processing and resampling processing blocks as the main bottlenecks in the algorithm-specific blocks. We explore the optimization of these two blocks at the algorithmic and architectural levels. The algorithmic level is at a high level and therefore has a high potential to offer speed and throughput improvements. Hence, in this work we begin at the algorithm level by analyzing the complexity of the likelihood processing and resampling processing blocks, then proceed with their simplification and modification. We simplify the likelihood processing block by proposing a uniform quantization scheme, the Uniform Quantization Likelihood Evaluation (UQLE). The results show a significant improvement in performance without losing accuracy. The worst case of UQLE software implementation in fixed-point arithmetic with 32 quantized intervals achieves 23.7× average speedup over the software implementation of ELE. We also propose two novel resampling algorithms instead of the sequential Systematic Resampling (SR) algorithm in PFs. They are the reformulated SR and Parallel Systematic Resampling (PSR) algorithms. The reformulated SR algorithm combines a group of loops into a parallel loop to facilitate parallel implementation in an ASIP. The PSR algorithm makes the iterations independent, thus allowing the resampling algorithms to perform loop iterations in parallel. In addition, our proposed PSR algorithm has lower computational complexity than the SR algorithm. At the architecture level, ASIPs are appealing for the implementation of PFs because they strike a good balance between computational efficiency and design flexibility. They can provide considerable throughput improvement by the inclusion of custom instructions, while retaining the ease of programming of general-purpose processors. Hence, after identifying the bottlenecks of PFs in the algorithm-specific blocks, we describe customized instructions for the UQLE, reformulated SR, and PSR algorithms in an ASIP. These instructions provide significantly higher throughput when compared to a pure software implementation running on a general-purpose processor. The custom instruction implementation of UQLE with 32 intervals achieves 34× speedup over the worst case of its software implementation with 3.75 K additional gates. An implementation of the reformulated SR algorithm is evaluated with four weights calculated in parallel and eight categories defined by uniformly distributed numbers that are compared simultaneously. It achieves a 23.9× speedup over the sequential SR algorithm in a general-purpose processor. This comes at a cost of only 54 K additional gates. For the PSR algorithm, four custom instructions, when configured to support four weights input in parallel, lead to a 53.4× speedup over the floating-point SR implementation on a general-purpose processor at a cost of 47.3 K additional gates. Finally, we consider the specific application of video tracking, and an implementation of a histogram-based PF in an ASIP. We identify that the histogram calculation is the main bottleneck in the application-specific blocks. We therefore propose a Parallel Array Histogram Architecture (PAHA) engine for accelerating the histogram calculation in ASIPs. Implementation results show that a 16-way PAHA can achieve a speedup of 43.75× when compared to its software implementation in a general-purpose processor

    Energy Measurements of High Performance Computing Systems: From Instrumentation to Analysis

    Get PDF
    Energy efficiency is a major criterion for computing in general and High Performance Computing in particular. When optimizing for energy efficiency, it is essential to measure the underlying metric: energy consumption. To fully leverage energy measurements, their quality needs to be well-understood. To that end, this thesis provides a rigorous evaluation of various energy measurement techniques. I demonstrate how the deliberate selection of instrumentation points, sensors, and analog processing schemes can enhance the temporal and spatial resolution while preserving a well-known accuracy. Further, I evaluate a scalable energy measurement solution for production HPC systems and address its shortcomings. Such high-resolution and large-scale measurements present challenges regarding the management of large volumes of generated metric data. I address these challenges with a scalable infrastructure for collecting, storing, and analyzing metric data. With this infrastructure, I also introduce a novel persistent storage scheme for metric time series data, which allows efficient queries for aggregate timelines. To ensure that it satisfies the demanding requirements for scalable power measurements, I conduct an extensive performance evaluation and describe a productive deployment of the infrastructure. Finally, I describe different approaches and practical examples of analyses based on energy measurement data. In particular, I focus on the combination of energy measurements and application performance traces. However, interweaving fine-grained power recordings and application events requires accurately synchronized timestamps on both sides. To overcome this obstacle, I develop a resilient and automated technique for time synchronization, which utilizes crosscorrelation of a specifically influenced power measurement signal. Ultimately, this careful combination of sophisticated energy measurements and application performance traces yields a detailed insight into application and system energy efficiency at full-scale HPC systems and down to millisecond-range regions.:1 Introduction 2 Background and Related Work 2.1 Basic Concepts of Energy Measurements 2.1.1 Basics of Metrology 2.1.2 Measuring Voltage, Current, and Power 2.1.3 Measurement Signal Conditioning and Analog-to-Digital Conversion 2.2 Power Measurements for Computing Systems 2.2.1 Measuring Compute Nodes using External Power Meters 2.2.2 Custom Solutions for Measuring Compute Node Power 2.2.3 Measurement Solutions of System Integrators 2.2.4 CPU Energy Counters 2.2.5 Using Models to Determine Energy Consumption 2.3 Processing of Power Measurement Data 2.3.1 Time Series Databases 2.3.2 Data Center Monitoring Systems 2.4 Influences on the Energy Consumption of Computing Systems 2.4.1 Processor Power Consumption Breakdown 2.4.2 Energy-Efficient Hardware Configuration 2.5 HPC Performance and Energy Analysis 2.5.1 Performance Analysis Techniques 2.5.2 HPC Performance Analysis Tools 2.5.3 Combining Application and Power Measurements 2.6 Conclusion 3 Evaluating and Improving Energy Measurements 3.1 Description of the Systems Under Test 3.2 Instrumentation Points and Measurement Sensors 3.2.1 Analog Measurement at Voltage Regulators 3.2.2 Instrumentation with Hall Effect Transducers 3.2.3 Modular Instrumentation of DC Consumers 3.2.4 Optimal Wiring for Shunt-Based Measurements 3.2.5 Node-Level Instrumentation for HPC Systems 3.3 Analog Signal Conditioning and Analog-to-Digital Conversion 3.3.1 Signal Amplification 3.3.2 Analog Filtering and Analog-To-Digital Conversion 3.3.3 Integrated Solutions for High-Resolution Measurement 3.4 Accuracy Evaluation and Calibration 3.4.1 Synthetic Workloads for Evaluating Power Measurements 3.4.2 Improving and Evaluating the Accuracy of a Single-Node Measuring System 3.4.3 Absolute Accuracy Evaluation of a Many-Node Measuring System 3.5 Evaluating Temporal Granularity and Energy Correctness 3.5.1 Measurement Signal Bandwidth at Different Instrumentation Points 3.5.2 Retaining Energy Correctness During Digital Processing 3.6 Evaluating CPU Energy Counters 3.6.1 Energy Readouts with RAPL 3.6.2 Methodology 3.6.3 RAPL on Intel Sandy Bridge-EP 3.6.4 RAPL on Intel Haswell-EP and Skylake-SP 3.7 Conclusion 4 A Scalable Infrastructure for Processing Power Measurement Data 4.1 Requirements for Power Measurement Data Processing 4.2 Concepts and Implementation of Measurement Data Management 4.2.1 Message-Based Communication between Agents 4.2.2 Protocols 4.2.3 Application Programming Interfaces 4.2.4 Efficient Metric Time Series Storage and Retrieval 4.2.5 Hierarchical Timeline Aggregation 4.3 Performance Evaluation 4.3.1 Benchmark Hardware Specifications 4.3.2 Throughput in Symmetric Configuration with Replication 4.3.3 Throughput with Many Data Sources and Single Consumers 4.3.4 Temporary Storage in Message Queues 4.3.5 Persistent Metric Time Series Request Performance 4.3.6 Performance Comparison with Contemporary Time Series Storage Solutions 4.3.7 Practical Usage of MetricQ 4.4 Conclusion 5 Energy Efficiency Analysis 5.1 General Energy Efficiency Analysis Scenarios 5.1.1 Live Visualization of Power Measurements 5.1.2 Visualization of Long-Term Measurements 5.1.3 Integration in Application Performance Traces 5.1.4 Graphical Analysis of Application Power Traces 5.2 Correlating Power Measurements with Application Events 5.2.1 Challenges for Time Synchronization of Power Measurements 5.2.2 Reliable Automatic Time Synchronization with Correlation Sequences 5.2.3 Creating a Correlation Signal on a Power Measurement Channel 5.2.4 Processing the Correlation Signal and Measured Power Values 5.2.5 Common Oversampling of the Correlation Signals at Different Rates 5.2.6 Evaluation of Correlation and Time Synchronization 5.3 Use Cases for Application Power Traces 5.3.1 Analyzing Complex Power Anomalies 5.3.2 Quantifying C-State Transitions 5.3.3 Measuring the Dynamic Power Consumption of HPC Applications 5.4 Conclusion 6 Summary and Outloo
    corecore