29 research outputs found

    A Survey of Techniques for Architecting TLBs

    Get PDF
    “Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently and a TLB miss is extremely costly, prudent management of TLB is important for improving performance and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and distinctions. We believe that this paper will be useful for chip designers, computer architects and system engineers

    CROSS-LAYER CUSTOMIZATION PLATFORM FOR LOW-POWER AND REAL-TIME EMBEDDED APPLICATIONS

    Get PDF
    Modern embedded applications have become increasingly complex and diverse in their functionalities and requirements. Data processing, communication and multimedia signal processing, real-time control and various other functionalities can often need to be implemented on the same System-on-Chip(SOC) platform. The significant power constraints and real-time guarantee requirements of these applications have become significant obstacles for the traditional embedded system design methodologies. The general-purpose computing microarchitectures of these platforms are designed to achieve good performance on average, which is far from optimal for any particular application. The system must always assume worst-case scenarios, which results in significant power inefficiencies and resource under-utilization. This dissertation introduces a cross-layer application-customizable embedded platform, which dynamically exploits application information and fine-tunes system components at system software and hardware layers. This is achieved with the close cooperation and seamless integration of the compiler, the operating system, and the hardware architecture. The compiler is responsible for extracting application regularities through static and profile-based analysis. The relevant application knowledge is propagated and utilized at run-time across the system layers through the judiciously introduced reconfigurability at both OS and hardware layers. The introduced framework comprehensively covers the fundamental subsystems of memory management and multi-tasking execution control

    WCET-aware prefetching of unlocked instruction caches: a technique for reconciling real-time guarantees and energy efficiency

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2015.A computação embarcada requer crescente vazão sob baixa potência. Ela requer um aumento de eficiência energética quando se executam programas de crescente complexidade. Muitos sistemas embarcados são também sistemas de tempo real, cuja correção temporal precisa ser garantida através de análise de escalonabilidade, a qual costuma assumir que o WCET de uma tarefa é conhecido em tempo de projeto. Como resultado da crescente complexidade do software, uma quantidade significativa de energia é gasta ao se prover instruções através da hierarquia de memória. Como a cache de instruções consome cerca de 40% da energia gasta em um processador embarcado e afeta a energia consumida em memória principal, ela se torna um relevante alvo para otimização. Entretanto, como ela afeta substancialmente o WCET, o comportamento da cache precisa ser restrito via  cache locking ou previsto via análise de WCET. Para obter eficiência energética sob restrições de tempo real, é preciso estender a consciência que o compilador tem da plataforma de hardware. Entretanto, compiladores para tempo real ignoram a energia, embora determinem rapidamente limites superiores para o WCET, enquanto compiladores para sistemas embarcados estimem com precisão a energia, mas gastem muito tempo em  profiling . Por isso, esta tese propõe um método unificado para estimar a energia gasta em memória, o qual é baseado em Interpretação Abstrata, exatamente o mesmo substrato matemático usado para a análise de WCET em caches. As estimativas mostram derivadas que são tão precisas quanto as obtidas via  profiling , mas são computadas 1000 vezes mais rápido, sendo apropriadas para induzir otimização de código através de melhoria iterativa. Como  cache locking troca eficiência energética por previsibilidade, esta tese propõe uma nova otimização de código, baseada em pré-carga por software, a qual reduz a taxa de faltas de caches de instruções e, provadamente, não aumenta o WCET. A otimização proposta é comparada com o estado-da-arte em  cache locking parcial para 37 programas do  Malardalen WCET benchmark para 36 configurações de cache e duas tecnologias distintas (2664 casos de uso). Em média, para obter uma melhoria de 68% no WCET,  cache locking parcial requer 8% mais energia. Por outro lado, a pré-carga por software diminui o consumo de energia em 11% enquanto melhora em 15% o WCET, reconciliando assim eficiência energética e garantias de tempo real.Abstract : Embedded computing requires increasing throughput at low power budgets. It asks for growing energy efficiency when executing programs of rising complexity. Many embedded systems are also real-time systems, whose temporal correctness is asserted through schedulability analysis, which often assumes that the WCET of each task is known at design-time. As a result of the growing software complexity, a significant amount of energy is spent in supplying instructions through the memory hierarchy. Since an instruction cache consumes around 40% of an embedded processor s energy and affects the energy spent in main memory, it becomes a relevant optimization target. However, since it largely impacts the WCET, cache behavior must be either constrained via cache locking or predicted by WCET analysis. To achieve energy efficiency under real-time constraints, a compiler must have extended awareness of the hardware platform. However, real-time compilers ignore energy, although they quickly determine bounds for WCET, whereas embedded compilers accurately estimate energy but require time-consuming profiling. That is why this thesis proposes a unifying method to estimate memory energy consumption that is based on Abstract Interpretation, the very same mathematical framework employed for the WCET analysis of caches. The estimates exhibit derivatives that are as accurate as those obtained by profiling, but are computed 1000 times faster, being suitable for driving code optimization through iterative improvement. Since cache locking gives up energy efficiency for predictability, this thesis proposes a novel code optimization, based on software prefetching, which reduces miss rate of unlocked instruction caches and, provenly, does not increase the WCET. The proposed optimization is compared with a state-of-the-art partial cache locking technique for the 37 programs of the Malardalen WCET benchmarks under 36 cache configurations and two distinct target technologies (2664 use cases). On average, to achieve an improvement of 68% in the WCET, partial cache locking required 8% more energy. On the other hand, software prefetching decreased the energy consumption by 11% while leading to an improvement of 15% in the WCET, thereby reconciling energy efficiency and real-time guarantees

    Low Power Memory/Memristor Devices and Systems

    Get PDF
    This reprint focusses on achieving low-power computation using memristive devices. The topic was designed as a convenient reference point: it contains a mix of techniques starting from the fundamental manufacturing of memristive devices all the way to applications such as physically unclonable functions, and also covers perspectives on, e.g., in-memory computing, which is inextricably linked with emerging memory devices such as memristors. Finally, the reprint contains a few articles representing how other communities (from typical CMOS design to photonics) are fighting on their own fronts in the quest towards low-power computation, as a comparison with the memristor literature. We hope that readers will enjoy discovering the articles within

    Micro-Viruses for Fast and Accurate Characterization of Voltage Margins and Variations in Multicore CPUs

    Get PDF
    Οι ενεργειακά-αποδοτικοί υπολογισμοί είναι δυνατοί μέσω του γρήγορου και ακριβούς προσδιορισμού των δυσοίωνων περιθωρίων τάσης σε σχεδιάσεις πολυπύρηνων επεξεργαστών και πιο συγκεκριμένα με την αποκάλυψη της διακύμανσης των περιθωρίων τάσης ανάμεσα σε πυρήνες και επεξεργαστές. Σε συστήματα με πολλαπλές υποδοχές για πολυπύρηνους επεξεργαστές, με κάθε έναν να διαθέτει πολλαπλούς πυρήνες, η μεταβλητότητα των περιθωρίων μεταξύ πυρήνων και μεταξύ επεξεργαστών μπορεί να αξιοποιηθεί αποτελεσματικά μέσω στρωμάτων λογισμικού για ενεργειακά-αποδοτική χρονοδρομολόγηση των νημάτων. Ο μαζικός αλλά ειλικρινής χαρακτηρισμός των περιθωρίων τάσης διαφορετικών επεξεργαστών και των πυρήνων τους είναι μια υπερβολικά χρονοβόρα και συνεπώς μη προσιτή διαδικασία στις περισσότερες περιπτώσεις, αν αυτή βασιστεί σε δημόσια διαθέσιμα προγράμματα αναφοράς με μεγάλους χρόνους εκτέλεσης. Στη παρούσα διπλωματική εργασία, ακολουθούμε μια διαφορετική στρατηγική για τον χαρακτηρισμό των περιθωρίων τάσης πολυπύρηνων επεξεργαστών και για την μέτρηση της διακύμανσης των περιθωρίων αυτών ανάμεσα σε διαφορετικούς επεξεργαστές και πυρήνες: προτείνουμε την υιοθέτησή γρήγορων και στοχευμένων προγραμμάτων (διαγνωστικοί μικρό-ιοί) των οποίων στόχος είναι να πιέσουν ξεχωριστά τα βασικά συστατικά ενός πολυπύρηνου επεξεργαστή τα οποία είναι γνωστό πως καθορίζουν τα όρια στην μείωση της τάσης, με άλλα λόγια την ελάχιστη δυνατή τάση Vmin. Περιγράφουμε την ανάπτυξη των διαγνωστικών μικρό-ιών που στοχεύουν ξεχωριστά τα τρία επίπεδα των κρυφών μνημών και τις βασικές μονάδες επεξεργασίας, την αριθμητική μονάδα ακέραιων αριθμών και την αριθμητική μονάδα αριθμών κινητής υποδιαστολής. Ο συνδυαστικός χρόνος εκτέλεσης όλων των διαγνωστικών μικρό-ιών είναι σημαντικά συντομότερος από αυτών κανονικών προγραμμάτων, με συνέπεια οι πυρήνες ενός επεξεργαστή να πιέζονται εκτεταμένα ώστε να αποκαλύψουν τα όρια τάσης όταν λειτουργούν κάτω από την ονομαστική τους τάση. Για να επιδείξουμε την αποτελεσματικότητα των μικρό-ιών μας, συγκρίνουμε την ελάχιστη δυνατή τάση λειτουργίας Vmin που αυτοί προσδιορίζουν με αυτή που προσδιορίζουν υπερβολικά χρονοβόρες καμπάνιες προσδιορισμού με τη βοήθεια των προγραμμάτων αναφοράς SPEC CPU2006. Οι μικρό-ιοί που αναπτύχθηκαν απαιτούν τάξεις μεγέθους μικρότερο χρόνο εκτέλεσης, ενώ την ίδια στιγμή τα αποτελέσματα τους απέχουν ελάχιστα (στις περισσότερες περιπτώσεις είναι πανομοιότυπα, με αποκλίσεις της τάξης του 2%) στον προσδιορισμό (α) της ελάχιστης τάσης λειτουργίας Vmin για διαφορετικούς επεξεργαστές (β) της ελάχιστης τάσης λειτουργίας Vmin μεταξύ των διαφορετικών πυρήνων του ίδιου επεξεργαστή και (γ) της μεταβλητότητας των περιθωρίων τάσης από επεξεργαστή σε επεξεργαστή και από πυρήνα σε πυρήνα. Τέλος, αξιολογούμε πειραματικά την ροή χαρακτηρισμού με τους προτεινόμενους μικρό-ιούς (συγκρίνοντας την με αυτή που βασίζεται στα προγράμματα αναφοράς) σε τρεις διαφορετικούς επεξεργαστές (έναν με nominal grade και δύο που ανήκουν στα corner parts) σε έναν server της οικογένειας X-Gene 2 της εταιρείας AppliedMicro (8 πυρήνες, ARMv8 επεξεργαστής κατασκευασμένος με την διαδικασία 28nm). Τα αποτελέσματα επαληθεύουν την ταχύτητα και την ακρίβεια της προτεινόμενης μεθόδου.Energy-efficient computing can be largely enabled by fast and accurate identification of the pessimistic voltage margins of multicore CPU designs and in particular the unveiling of voltage margins variability among cores and among chips. In multi-socketed systems with multiple CPUs each of which consists of several cores, the core-to-core and the chip-to-chip voltage margin variability can be effectively utilized by software layers for diligent power-saving threads scheduling. Massive but straightforward characterization of the voltage margins of different CPU chips and their different cores is an excessively long and thus unaffordable in most cases process if it is naively based on publicly available benchmarks or other in-house programs with large execution times. In this thesis, we follow a different strategy for the characterization of the voltage margins of multicore CPUs and the measurement of the voltage variability among chips and cores: we propose the employment of fast targeted programs (diagnostic micro-viruses) that aim to stress individually the main hardware components of a multicore CPU architecture which are known to determine the limits of voltage reduction, i.e. the Vmin values. We describe the development of the micro-viruses which target separately the three different cache memory levels and the main processing components, the integer and the floating-point arithmetic units. The combined execution of the micro-viruses takes very short time compared to regular programs and extensively stress the CPU cores to reveal their voltage limits when they operate below the nominal voltage levels. To demonstrate the effectiveness of the synthetic micro-virus programs, we compare the safe Vmin reported by the combined micro-viruses characterization against the corresponding safe Vmin values of SPEC CPU2006 benchmarks. The micro-viruses based characterization flow requires orders of magnitude shorter time while it delivers very close results to the excessively characterization campaign (in most cases identical, at most 2% divergences) in terms of: (a) Vmin values for the different CPU chips (b) Vmin values for the different cores within a chip, (c) core-to-core and chip-to-chip voltage margins variability. We evaluate the proposed micro-viruses based characterization flow (and compare it to the SPEC-based flow) on three different chips (a nominal grade and two corner parts) of AppliedMicro’s X-Gene 2 micro-server family (8-core, ARMv8-based CPUs manufactured in 28nm); the reported results validate the speed and accuracy of the proposed method

    Memory-Aware Scheduling for Fixed Priority Hard Real-Time Computing Systems

    Get PDF
    As a major component of a computing system, memory has been a key performance and power consumption bottleneck in computer system design. While processor speeds have been kept rising dramatically, the overall computing performance improvement of the entire system is limited by how fast the memory can feed instructions/data to processing units (i.e. so-called memory wall problem). The increasing transistor density and surging access demands from a rapidly growing number of processing cores also significantly elevated the power consumption of the memory system. In addition, the interference of memory access from different applications and processing cores significantly degrade the computation predictability, which is essential to ensure timing specifications in real-time system design. The recent IC technologies (such as 3D-IC technology) and emerging data-intensive real-time applications (such as Virtual Reality/Augmented Reality, Artificial Intelligence, Internet of Things) further amplify these challenges. We believe that it is not simply desirable but necessary to adopt a joint CPU/Memory resource management framework to deal with these grave challenges. In this dissertation, we focus on studying how to schedule fixed-priority hard real-time tasks with memory impacts taken into considerations. We target on the fixed-priority real-time scheduling scheme since this is one of the most commonly used strategies for practical real-time applications. Specifically, we first develop an approach that takes into consideration not only the execution time variations with cache allocations but also the task period relationship, showing a significant improvement in the feasibility of the system. We further study the problem of how to guarantee timing constraints for hard real-time systems under CPU and memory thermal constraints. We first study the problem under an architecture model with a single core and its main memory individually packaged. We develop a thermal model that can capture the thermal interaction between the processor and memory, and incorporate the periodic resource sever model into our scheduling framework to guarantee both the timing and thermal constraints. We further extend our research to the multi-core architectures with processing cores and memory devices integrated into a single 3D platform. To our best knowledge, this is the first research that can guarantee hard deadline constraints for real-time tasks under temperature constraints for both processing cores and memory devices. Extensive simulation results demonstrate that our proposed scheduling can improve significantly the feasibility of hard real-time systems under thermal constraints

    14th SC@RUG 2017 proceedings 2016-2017

    Get PDF

    14th SC@RUG 2017 proceedings 2016-2017

    Get PDF
    corecore