387 research outputs found

    Adaptive Microarchitectural Optimizations to Improve Performance and Security of Multi-Core Architectures

    Get PDF
    With the current technological barriers, microarchitectural optimizations are increasingly important to ensure performance scalability of computing systems. The shift to multi-core architectures increases the demands on the memory system, and amplifies the role of microarchitectural optimizations in performance improvement. In a multi-core system, microarchitectural resources are usually shared, such as the cache, to maximize utilization but sharing can also lead to contention and lower performance. This can be mitigated through partitioning of shared caches.However, microarchitectural optimizations which were assumed to be fundamentally secure for a long time, can be used in side-channel attacks to exploit secrets, as cryptographic keys. Timing-based side-channels exploit predictable timing variations due to the interaction with microarchitectural optimizations during program execution. Going forward, there is a strong need to be able to leverage microarchitectural optimizations for performance without compromising security. This thesis contributes with three adaptive microarchitectural resource management optimizations to improve security and/or\ua0performance\ua0of multi-core architectures\ua0and a systematization-of-knowledge of timing-based side-channel attacks.\ua0We observe that to achieve high-performance cache partitioning in a multi-core system\ua0three requirements need to be met: i) fine-granularity of partitions, ii) locality-aware placement and iii) frequent changes. These requirements lead to\ua0high overheads for current centralized partitioning solutions, especially as the number of cores in the\ua0system increases. To address this problem, we present an adaptive and scalable cache partitioning solution (DELTA) using a distributed and asynchronous allocation algorithm. The\ua0allocations occur through core-to-core challenges, where applications with larger performance benefit will gain cache capacity. The\ua0solution is implementable in hardware, due to low computational complexity, and can scale to large core counts.According to our analysis, better performance can be achieved by coordination of multiple optimizations for different resources, e.g., off-chip bandwidth and cache, but is challenging due to the increased number of possible allocations which need to be evaluated.\ua0Based on these observations, we present a solution (CBP) for coordinated management of the optimizations: cache partitioning, bandwidth partitioning and prefetching.\ua0Efficient allocations, considering the inter-resource interactions and trade-offs, are achieved using local resource managers to limit the solution space.The continuously growing number of\ua0side-channel attacks leveraging\ua0microarchitectural optimizations prompts us to review attacks and defenses to understand the vulnerabilities of different microarchitectural optimizations. We identify the four root causes of timing-based side-channel attacks: determinism, sharing, access violation\ua0and information flow.\ua0Our key insight is that eliminating any of the exploited root causes, in any of the attack steps, is enough to provide protection.\ua0Based on our framework, we present a systematization of the attacks and defenses on a wide range of microarchitectural optimizations, which highlights their key similarities.\ua0Shared caches are an attractive attack surface for side-channel attacks, while defenses need to be efficient since the cache is crucial for performance.\ua0To address this issue, we present an adaptive and scalable cache partitioning solution (SCALE) for protection against cache side-channel attacks. The solution leverages randomness,\ua0and provides quantifiable and information theoretic security guarantees using differential privacy. The solution closes the performance gap to a state-of-the-art non-secure allocation policy for a mix of secure and non-secure applications

    Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

    Get PDF
    Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is more relevant for unstructured applications whose workflow is not statically predictable due to their heavily data-dependent nature. One possible solution for this problem is the introduction of an intelligent Domain-Specific Language (iDSL) that transparently helps to maintain correctness, hides the idiosyncrasies of lowlevel hardware, and scales applications. An iDSL includes domain-specific language constructs, a compilation toolchain, and a runtime providing task scheduling, data placement, and workload balancing across and within heterogeneous nodes. In this work, we focus on the runtime framework. We introduce a novel design and extension of a runtime framework, the Parallel Runtime Environment for Multicore Applications. In response to the ever-increasing intra/inter-node concurrency, the runtime system supports efficient task scheduling and workload balancing at both levels while allowing the development of custom policies. Moreover, the new framework provides abstractions supporting the utilization of heterogeneous distributed nodes consisting of CPUs and GPUs and is extensible to other devices. We demonstrate that by utilizing this work, an application (or the iDSL) can scale its performance on heterogeneous exascale-era supercomputers with minimal effort. A future goal for this framework (out of the scope of this thesis) is to be integrated with machine learning to improve its decision-making and performance further. As a bridge to this goal, since the framework is under development, we experiment with data from Nuclear Physics Particle Accelerators and demonstrate the significant improvements achieved by utilizing machine learning in the hit-based track reconstruction process

    CELLO: Compiler-Assisted Efficient Load-Load Ordering in Data-Race-Free Regions

    Get PDF
    Efficient Total Store Order (TSO) implementations allow loads to execute speculatively out-of-order. To detect order violations, the load queue (LQ) holds all the in-flight loads and is searched on every invalidation and cache eviction. Moreover, in a simultaneous multithreading processor (SMT), stores also search the LQ when writing to cache. LQ searches entail considerable energy consumption. Furthermore, the processor stalls upon encountering the LQ full or when its ports are busy. Hence, the LQ is a critical structure in terms of both energy and performance. In this work, we observe that the use of the LQ could be dramatically optimized under the guarantees of the datarace-free (DRF) property imposed by modern programming languages. To leverage this observation, we propose CELLO, a software-hardware co-design in which the compiler detects memory operations in DRF regions and the hardware optimizes their execution by safely skipping LQ searches without violating the TSO consistency model. Furthermore, CELLO allows removing DRF loads from the LQ earlier, as they do not need to be searched to detect consistency violations. With minimal hardware overhead, we show that an 8-core 2- way SMT processor with CELLO avoids almost all conservative searches to the LQ and significantly reduces its occupancy. CELLO allows i) to reduce the LQ energy expenditure by 33% on average (up to 53%) while performing 2.8% better on average (up to 18.6%) than the baseline system, and ii) to shrink the LQ size from 192 to only 80 entries, reducing the LQ energy expenditure as much as 69% while performing on par with a mainstream LQ implementation

    Structured parallelism discovery with hybrid static-dynamic analysis and evaluation technique

    Get PDF
    Parallel computer architectures have dominated the computing landscape for the past two decades; a trend that is only expected to continue and intensify, with increasing specialization and heterogeneity. This creates huge pressure across the software stack to produce programming languages, libraries, frameworks and tools which will efficiently exploit the capabilities of parallel computers, not only for new software, but also revitalizing existing sequential code. Automatic parallelization, despite decades of research, has had limited success in transforming sequential software to take advantage of efficient parallel execution. This thesis investigates three approaches that use commutativity analysis as the enabler for parallelization. This has the potential to overcome limitations of traditional techniques. We introduce the concept of liveness-based commutativity for sequential loops. We examine the use of a practical analysis utilizing liveness-based commutativity in a symbolic execution framework. Symbolic execution represents input values as groups of constraints, consequently deriving the output as a function of the input and enabling the identification of further program properties. We employ this feature to develop an analysis and discern commutativity properties between loop iterations. We study the application of this approach on loops taken from real-world programs in the OLDEN and NAS Parallel Benchmark (NPB) suites, and identify its limitations and related overheads. Informed by these findings, we develop Dynamic Commutativity Analysis (DCA), a new technique that leverages profiling information from program execution with specific input sets. Using profiling information, we track liveness information and detect loop commutativity by examining the code’s live-out values. We evaluate DCA against almost 1400 loops of the NPB suite, discovering 86% of them as parallelizable. Comparing our results against dependence-based methods, we match the detection efficacy of two dynamic and outperform three static approaches, respectively. Additionally, DCA is able to automatically detect parallelism in loops which iterate over Pointer-Linked Data Structures (PLDSs), taken from wide range of benchmarks used in the literature, where all other techniques we considered failed. Parallelizing the discovered loops, our methodology achieves an average speedup of 3.6× across NPB (and up to 55×) and up to 36.9× for the PLDS-based loops on a 72-core host. We also demonstrate that our methodology, despite relying on specific input values for profiling each program, is able to correctly identify parallelism that is valid for all potential input sets. Lastly, we develop a methodology to utilize liveness-based commutativity, as implemented in DCA, to detect latent loop parallelism in the shape of patterns. Our approach applies a series of transformations which subsequently enable multiple applications of DCA over the generated multi-loop code section and match its loop commutativity outcomes against the expected criteria for each pattern. Applying our methodology on sets of sequential loops, we are able to identify well-known parallel patterns (i.e., maps, reduction and scans). This extends the scope of parallelism detection to loops, such as those performing scan operations, which cannot be determined as parallelizable by simply evaluating liveness-based commutativity conditions on their original form

    It is too hot in here! A performance, energy and heat aware scheduler for Asymmetric multiprocessing processors in embedded systems.

    Get PDF
    Modern architecture present in self-power devices such as mobiles or tablet computers proposes the use of asymmetric processors that allow either energy-efficient or performant computation on the same SoC. For energy efficiency and performance consideration, the asymmetry resides in differences in CPU micro-architecture design and results in diverging raw computing capability. Other components such as the processor memory subsystem also show differences resulting in different memory transaction timing. Moreover, based on a bus-snoop protocol, cache coherency between processors comes with a peculiarity in memory latency depending on the processors operating frequencies. All these differences come with challenging decisions on both application schedulability and processor operating frequencies. In addition, because of the small form factor of such embedded systems, these devices generally cannot afford active cooling systems. Therefore thermal mitigation relies on dynamic software solutions. Current operating systems for embedded systems such as Linux or Android do not consider all these particularities. As such, they often fail to satisfy user expectations of a powerful device with long battery life. To remedy this situation, this thesis proposes a unified approach to deliver high-performance and energy-efficiency computation in each of its flavours, considering the memory subsystem and all computation units available in the system. Performance is maximized even when the device is under heavy thermal constraints. The proposed unified solution is based on accurate models targeting both performance and thermal behaviour and resides at the operating systems kernel level to manage all running applications in a global manner. Particularly, the performance model considers both the computation part and also the memory subsystem of symmetric or asymmetric processors present in embedded devices. The thermal model relies on the accurate physical thermal properties of the device. Using these models, application schedulability and processor frequency scaling decisions to either maximize performance or energy efficiency within a thermal budget are extensively studied. To cover a large range of application behaviour, both models are built and designed using a generative workload that considers fine-grain details of the underlying microarchitecture of the SoC. Therefore, this approach can be derived and applied to multiple devices with little effort. Extended evaluation on real-world benchmarks for high performance and general computing, as well as common applications targeting the mobile and tablet market, show the accuracy and completeness of models used in this unified approach to deliver high performance and energy efficiency under high thermal constraints for embedded devices

    Performance, memory efficiency and programmability: the ambitious triptych of combining vertex-centricity with HPC

    Get PDF
    The field of graph processing has grown significantly due to the flexibility and wide applicability of the graph data structure. In the meantime, so has interest from the community in developing new approaches to graph processing applications. In 2010, Google introduced the vertex-centric programming model through their framework Pregel. This consists of expressing computation from the perspective of a vertex, whilst inter-vertex communications are achieved via data exchanges along incoming and outgoing edges, using the message-passing abstraction provided. Pregel ’s high-level programming interface, designed around a set of simple functions, provides ease of programmability to the user. The aim is to enable the development of graph processing applications without requiring expertise in optimisation or parallel programming. Such challenges are instead abstracted from the user and offloaded to the underlying framework. However, fine-grained synchronisation, unpredictable memory access patterns and multiple sources of load imbalance make it difficult to implement the vertex centric model efficiently on high performance computing platforms without sacrificing programmability. This research focuses on combining vertex-centric and High-Performance Comput- ing (HPC), resulting in the development of a shared-memory framework, iPregel, which demonstrates that a performance and memory efficiency similar to that of non-vertex- centric approaches can be achieved while preserving the programmability benefits of vertex-centric. Non-volatile memory is then explored to extend single-node capabilities, during which multiple versions of iPregel are implemented to experiment with the various data movement strategies. Then, distributed memory parallelism is investigated to overcome the resource limitations of single node processing. A second framework named DiP, which ports applicable iPregel ’s optimisations to distributed memory, prioritises performance to high scalability. This research has resulted in a set of techniques and optimisations illustrated through a shared-memory framework iPregel and a distributed-memory framework DiP. The former closes a gap of several orders of magnitude in both performance and memory efficiency, even able to process a graph of 750 billion edges using non-volatile memory. The latter has proved that this competitiveness can also be scaled beyond a single node, enabling the processing of the largest graph generated in this research, comprising 1.6 trillion edges. Most importantly, both frameworks achieved these performance and capability gains whilst also preserving programmability, which is the cornerstone of the vertex-centric programming model. This research therefore demonstrates that by combining vertex-centricity and High-Performance Computing (HPC), it is possible to maintain performance, memory efficiency and programmability

    Naval Postgraduate School Academic Catalog - February 2023

    Get PDF

    RE-LANG---A Parallel-by-default Programming Language

    Get PDF
    In recent years, programming language features such as lightweight threads have gained popularity in the software development workflow. Our research takes a critical look at these recent trends, rethinking them through an academic lens. We propose a construct called "smart assignment," supported by rewriting semantics, which enables a novel parallel-by-default programming paradigm. We present a new programming language—RE-LANG—that implements this feature. Specifically, we demonstrate how the design philosophy of RE-LANG makes imperative, parallel programming more developer-friendly. We discuss the implementation of the language and showcase performance benchmarks, as well as overhead analysis, to demonstrate its efficiency.Doctor of Philosoph
    • …
    corecore