21 research outputs found

    Principled Approaches to Last-Level Cache Management

    Get PDF
    Memory is a critical component of all computing systems. It represents a fundamental performance and energy bottleneck. Ideally, memory aspects such as energy cost, performance, and the cost of implementing management techniques would scale together with the size of all different computing systems; unfortunately this is not the case. With the upcoming trends in applications, new memory technologies, etc., scaling becomes a bigger a problem, aggravating the performance bottleneck that memory represents. A memory hierarchy was proposed to alleviate the problem. Each level in the hierarchy tends to have a decreasing cost per bit, an increased capacity, and a higher access time compared to its previous level. Preferably all data will be stored in the fastest level of memory, unfortunately, faster memory technologies tend to be associated with a higher manufacturing cost, which often limits their capacity. The design challenge is, to determine which is the frequently used data, and store it in the faster levels of memory. A cache is a small, fast, on-chip chunk of memory. Any data stored in main memory can be stored in the cache. For many programs, a typical behavior is to access data that has been accessed previously. Taking advantage of this behavior, a copy of frequently accessed data is kept in the cache, in order to provide a faster access time next time is requested. Due to capacity constrains, it is likely that all of the frequently reused data cannot fit in the cache, because of this, cache management policies decide which data is to be kept in the cache, and which in other levels of the memory hierarchy. Under an efficient cache management policy, an encouraging amount of memory requests will be serviced from a fast on-chip cache. The disparity in access latency between the last-level cache and main memory motivates the search for efficient cache management policies. There is a great amount of recently proposed work that strives to utilize cache capacity in the most favorable to performance way possible. Related work focus on optimizing the performance of caches focusing on different possible solutions, e.g. reduce miss rate, consume less power, reducing storage overhead, reduce access latency, etc. Our work focus on improving the performance of last-level caches by designing policies based on principles adapted from other areas of interest. In this dissertation, we focus on several aspects of cache management policies, we first introduce a space-efficient placement and promotion policy which goal is to minimize the updates to the replacement policy state on each cache access. We further introduce a mechanism that predicts whether a block in the cache will be reused, it feeds different features from a block to the predictor in order to increase the correlation of a previous access to a future access. We later introduce a technique that tweaks traditional cache indexing, providing fast accesses to a vast majority of requests in the presence of a slow access memory technology such as DRAM

    Methoden zur applikationsspezifischen Effizienzsteigerung adaptiver Prozessorplattformen

    Get PDF
    General-Purpose Prozessoren sind für den durchschnittlichen Anwendungsfall optimiert, wodurch vorhandene Ressourcen nicht effizient genutzt werden. In der vorliegenden Arbeit wird untersucht, in wie weit es möglich ist, einen General-Purpose Prozessor an einzelne Anwendungen anzupassen und so die Effizienz zu steigern. Die Adaption kann zur Laufzeit durch das Prozessor- oder Laufzeitsystem anhand der jeweiligen Systemparameter erfolgen, um eine Effizienzsteigerung zu erzielen

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Intrinsically Evolvable Artificial Neural Networks

    Get PDF
    Dedicated hardware implementations of neural networks promise to provide faster, lower power operation when compared to software implementations executing on processors. Unfortunately, most custom hardware implementations do not support intrinsic training of these networks on-chip. The training is typically done using offline software simulations and the obtained network is synthesized and targeted to the hardware offline. The FPGA design presented here facilitates on-chip intrinsic training of artificial neural networks. Block-based neural networks (BbNN), the type of artificial neural networks implemented here, are grid-based networks neuron blocks. These networks are trained using genetic algorithms to simultaneously optimize the network structure and the internal synaptic parameters. The design supports online structure and parameter updates, and is an intrinsically evolvable BbNN platform supporting functional-level hardware evolution. Functional-level evolvable hardware (EHW) uses evolutionary algorithms to evolve interconnections and internal parameters of functional modules in reconfigurable computing systems such as FPGAs. Functional modules can be any hardware modules such as multipliers, adders, and trigonometric functions. In the implementation presented, the functional module is a neuron block. The designed platform is suitable for applications in dynamic environments, and can be adapted and retrained online. The online training capability has been demonstrated using a case study. A performance characterization model for RC implementations of BbNNs has also been presented

    Enabling multi-threaded execution and improved memory access in fine-grain near-data processing systems

    Get PDF
    Orientador: Marco Antonio Zanata AlvesTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 08/07/2022Inclui referênciasÁrea de concentração: Ciência da ComputaçãoResumo: Aplicações que lidam com grandes quantidades de dados são cada vez mais populares. No entanto, as arquiteturas tradicionais centradas em computação estão mal equipadas para lidar com essas aplicatções, pois elas causam muito movimento de dados no sistema devido aos acessos de dados quase constantes. Isso leva a um processamento ineficiente, com longos tempos de execução e alto consumo de energia. Os problemas causados por essa disparidade são amplamente conhecidos como memory wall. A partir do final da década de 1990, a ideia de mover parte da computação para perto da memória, quando benéfico, começou a ser considerada. Este conceito tornou-se conhecido como processamento próximo à memória e ganhou mais atenção no início da década de 2010 com o advento da tecnologia de Through-Silicon Via (TSV), que permitiu a integração direta das lógicas de processamento e armazenamento de dados no mesmo chip. Memórias 3D, que integram verticalmente armazenamento e lógica, tornaram-se comercialmente disponíveis desde então e pesquisadores da área de arquitetura de computadores reagiram propondo muitos projetos que colocam elementos de processamento na camada lógica normalmente encontrada nesses dispositivos. Esta tese propõe a Vector-In-Memory Architecture (VIMA), uma arquitetura de processamento próximo à memória baseada em memória 3D que implementa o processamento na memória colocando unidades funcionais na camada lógica desses dispositivos. Nosso projeto usa unidades funcionais vetoriais e uma memória cache para armazenamento dedicado e avança o estado da arte implementando exceções precisas e permitindo multi-threading próximo aos dadosna memória. Simulamos a execução de várias aplicações orientadas a dados em nossa arquitetura e, nossos resultados mostram que o design proposto, que utiliza 1 core e a VIMA, é capaz de superar uma arquitetura tradicional moderna de 16 cores em pelo menos 2× ao lidar com grandes tamanhos de conjuntos de dados. Além disso, essa aceleração no tempo de execução é alcançada enquanto se reduz o consumo de energia em pelo menos 75% de acordo com nossas estimativas. Em comparação com um trabalho similar do estado da arte, a VIMA é capaz de reduzir o tempo de execução de aplicações que fazem streaming de dados em pelo menos 32%.Abstract: Applications that deal with large amounts of data are increasingly popular. However, traditional computation-centric architectures are ill-equipped to handle such applications as they cause much data movement across the system due to their near-constant data accesses. This leads to inefficient processing, with long execution times and high energy consumption. Issues caused by this disparity are widely known as the memory wall. Starting in the late 1990s, the idea of moving portions of the computations close to the memory when beneficial began to be considered. This concept has now become known as Near-Data Processing (NDP) and gained more attention in the early 2010s with the advent of TSV technology, which enabled straight-forward integration of processing logic and data storage in the same chip. 3D-stacked memories, which vertically integrate storage and logic, have become commercially available ever since and computer architecture researchers have reacted by proposing many designs that place processing elements on the logic layer typically found in those devices. This thesis proposes VIMA, a 3D-stacked memory-based NDP architecture that implements processing in the memory by placing Functional Units (FUs) on the logic layer of those devices. Our design uses a vector functional units and a cache memory for dedicated storage and advances the state-of-the-art by implementing near-data precise exceptions and enabling near-data multi-threading. We simulate execution of several common data-driven applications on our architecture and, out results show that the proposed design, with only a single processing core and VIMA, is able to outperform a modern 16-thread by at least 2× when dealing with large dataset sizes. Moreover, such a speedup in performance is achieved while reducing energy consumption by at least 75% according to our estimates. In comparison to its most closely related state-of-the-art work, VIMA is able to reduce the execution time of data-streaming applications by at least 32%

    Minimisation of Energy Consumption Variance in Manufacturing through Production Schedule Manipulation

    Get PDF
    In the manufacturing sector, despite the vital role it plays, the consumption of energy is rarely considered as a manufacturing process variable during the scheduling of production jobs. Due to both physical and contractual limits, the local power infrastructure can only deliver a finite amount of electrical energy at any one time. As a consequence of not considering the energy usage during the scheduling process, this limited capacity can be inefficiently utilised or exceeded, potentially resulting in damage to the infrastructure. To address this, this thesis presents a novel schedule optimisation system. Here, a Genetic Algorithm is used to optimise the start times of manufacturing jobs such that the variance in production line energy consumption is minimised, while ensuring that typical hard and soft schedule constraints are maintained. Prediction accuracy is assured through the use of a novel library-based system which is able to provide historical energy data at a high temporal granularity, while accounting for the influence of machine conditions on the energy consumption. In cases where there is insufficient historical data for a particular manufacturing job, the library-based system is able to analyse the available energy data and utilise machine learning to generate temporary synthetic profiles compensated for probable machine conditions. The performance of the entire proposed system is optimised through significant experimentation and analysis, which allows for an optimised schedule to be produced within an acceptable amount of time. Testing in a lab-based production line demonstrates that the optimised schedule is able to significantly reduce the energy consumption variance produced by a production schedule, while providing a highly accurate prediction as to the energy consumption during the schedules execution. The proposed system is also demonstrated to be easily expandable, allowing it to consider local renewable energy generation and energy storage, along with objectives such as the minimisation of peak energy consumption, and energy drawn from the National Grid

    A hybrid rate control mechanism for forwarding and congestion control in named data network

    Get PDF
    Named Data Networking (NDN) is an emerging Internet architecture that employs a pull-based, in-path caching, hop-by-hop, and multi-path transport architecture. Therefore, transport algorithms which use conventional paradigms would not work correctly in the NDN environment, since the content source location frequently changes. These changes raise forwarding and congestion control problems, and they directly affect the link utilization, fairness, and stability of the network. This study proposes a Hybrid Rate Control Mechanism (HRCM) to control the forwarding rate and link congestion to enhance network scalability, stability, and fairness performance. HRCM consists of three schemes namely Shaping Deficit Weight Round Robin (SDWRR), Queue-delay Parallel Multipath (QPM), and Explicit Control Agile-based conservative window adaptation (EC-Agile). The SDWRR scheme is scheduling different flows in router interfaces by fairly detecting and notifying the link congestion. The QPM scheme has been designed to forward Interest packets to all available paths that utilize idle bandwidths. The EC-Agile scheme controls forwarding rates by examining each packet received. The proposed HRCM was evaluated by comparing it with two different mechanisms, namely Practical Congestion Control (PCON) and Hop-by-hop Interest Shaping (HIS) through ndnSIM simulation. The findings show that HRCM enhances the forwarding rate and fairness. HRCM outperforms HIS and PCON in terms of throughput by 75%, delay 20%, queue length 55%, link utilization 41%, fairness 20%, and download time 20%. The proposed HRCM contributes to providing an enhanced forwarding rate and fairness in NDN with different types of traffic flow. Thus, the SDWRR, QPM, and EC-Agile schemes can be used in monitoring, controlling, and managing congestion and forwarding for the Internet of the future

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    A survey of the application of soft computing to investment and financial trading

    Get PDF

    Elastic computation placement in edge-based environments

    Get PDF
    Today, technologies such as machine learning, virtual reality, and the Internet of Things are integrated in end-user applications more frequently. These technologies demand high computational capabilities. Especially mobile devices have limited resources in terms of execution performance and battery life. The offloading paradigm provides a solution to this problem and transfers computationally intensive parts of applications to more powerful resources, such as servers or cloud infrastructure. Recently, a new computation paradigm arose which exploits the huge amount of end-user devices in the modern computing landscape - called edge computing. These devices encompass smartphones, tablets, microcontrollers, and PCs. In edge computing, devices cooperate with each other while avoiding cloud infrastructure. Due to the proximity among the participating devices, the communication latencies for offloading are reduced. However, edge computing brings new challenges in form of device fluctuation, unreliability, and heterogeneity, which negatively affect the resource elasticity. As a solution, this thesis proposes a computation placement framework that provides an abstraction for computation and resource elasticity in edge-based environments. The design is middleware-based, encompasses heterogeneous platforms, and supports easy integration of existing applications. It is composed of two parts: the Tasklet system and the edge support layer. The Tasklet system is a flexible framework for computation placement on heterogeneous resources. It introduces closed units of computation that can be tailored to generic applications. The edge support layer handles the characteristics of edge resources. It copes with fluctuation and unreliability by applying reactive and proactive task migration. Furthermore, the performance heterogeneity and the consequent bottlenecks are handled by two edge-specific task partitioning approaches. As a proof of concept, the thesis presents a fully-fledged prototype of the design, which is evaluated comprehensively in a real-world testbed. The evaluation shows that the design is able to substantially improve the resource elasticity in edge-based environments
    corecore