18 research outputs found

    Tackling the Bus Turnaround Overhead in Real-Time SDRAM Controllers

    Get PDF
    Synchronous dynamic random access memories (SDRAMs) are widely employed in multi- and many-core platforms due to their high-density and low-cost. Nevertheless, their benefits come at the price of a complex two-stage access protocol, which reflects their bank-based structure and an internal level of explicitly managed caching. In scenarios in which requestors demand real-time guarantees, these features pose a predictability challenge and, in order to tackle it, several SDRAM controllers have been proposed. In this context, recent research shows that a combination of bank privatization and open-row policy (exploiting the caching over the boundary of a single request) represents an effective way to tackle the problem. However, such approach uncovered a new challenge: the data bus turnaround overhead. In SDRAMs, a single data bus is shared by read and write operations. Alternating read and write operations is, consequently, highly undesirable, as the data bus must remain idle during a turnaround. Therefore, in this article, we propose a SDRAM controller that reorders read and write commands, which minimizes data bus turnarounds. Moreover, we compare our approach analytically and experimentally with existing real-time SDRAM controllers both from the worst-case latency and power consumption perspectives

    Architektur- und Leistungsanalyse eines Mehgenerationen-SDRAM-Controllers für gemischte Kritikalitätssysteme

    Get PDF
    Due to their high-density and low-cost, DDR SDRAM are the prevailing choice for implementing the main memory of a computer system. Nevertheless, the aforementioned benefits come at the cost of a complex two-stage access protocol, which ultimately means that the time required to serve a memory request depends on the history of previous requests. Otherly stated, DDR SDRAMs are a stateful resource. The main goal of this dissertation is to design a controller that leverages the state of DDR SDRAMs in a mixed criticality environment. More specifically, the controller should provide good average performance for best-effort requestors without compromising timing guarantees for critical requestors. With that regard, this dissertation firstly identifies two challenges of growing relevance for the design of memory controllers for the mixed criticality domain. The first challenge is the data bus turnaround time. The second challenge is the rank-to-rank switching time and only affects multi-rank modules. After pinpointing the two aforementioned challenges, this dissertation proposes a SDRAM controller to tackle them. The proposed controller bundles read and write operations in their corresponding ranks, thus minimizing the number of data bus turnarounds and rank switching events. As a consequence, the average performance of the controller is improved. However, the bundling is carefully designed so that real-time guarantees for critical requestors can be extracted. Moreover, as it will become clear, both the operation of the controller and the corresponding analysis of the temporal properties are described in terms of a generation-independent notation. This is a desirable feature because different SDRAM generations have different architectural features and possibly, timing constraints. Finally, an extensive comparison with the related work is performed. Furthermore, trends in worst-case latency over DDR SDRAM from different speed bins and generations are presented and thoroughly discussed.Aufgrund ihrer hohen Dichte und geringen Kosten sind DDR SDRAM die vorherrschende Wahl für die Implementierung des Hauptspeichers eines Computersystems. Die oben genannten Vorteile gehen jedoch zu Lasten eines komplexen zweistufigen Zugriffsprotokolls, was letztendlich bedeutet, dass die Zeit, die benötigt wird, um eine Speicheranforderung zu bedienen, von der Historie früherer Zugriffe abhängt. Anders ausgedrückt, DDR SDRAM sind eine zustandsabhängige Ressource, was die Umsetzung gemischter Kritikalitäten weiter erschwert, da unterschiedliche Ebenen der Kritikalität widersprüchliche Bedürfnisse haben. Das Hauptziel dieser Dissertation ist es, einen Controller zu entwickeln, der den Zustand der DDR-SDRAMs in einer gemischten Kritikalitätsumgebung nutzt. Genauer gesagt, der Controller soll eine gute durchschnittliche Leistung für best-effort Zugriffe ermöglichen, ohne die Garantien für kritische Zugriffe zu gefährden. In diesem Zusammenhang identifiziert diese Dissertation zunächst zwei Herausforderungen von wachsender Relevanz für das Design von Speichercontrollern für Systeme gemischter Kritikalität. Die erste Herausforderung ist die notwendige Zeit zur Richtungsänderung des Datenbusses. Die zweite Herausforderung ist die Rang-zu-Rang-Schaltzeit und betrifft nur Module mit mehreren Rängen. Nach dem Aufzeigen der beiden oben genannten Herausforderungen, schlägt diese Dissertation einen SDRAM Controller vor, um sie anzugehen. Der vorgeschlagene Controller bündelt Lese und Schreib Operationen in ihren entsprechenden Rängen, wodurch die Anzahl der Richtungsänderungen des Datenbusses und die Anzahl der Rangwechsel minimiert wird. Dadurch wird die durchschnittliche Leistung des Controllers verbessert. Die Bündelung ist so konzipiert, dass Echtzeit-Garantien für kritische Zugriffe abgeleitet werden können. Darüber hinaus werden, wie sich zeigen wird, sowohl das Verhalten des Controllers als auch die entsprechende Analyse der zeitlichen Eigenschaften in Form einer generationsunabhängigen Notation beschrieben. Dies ist ein wünschenswertes Merkmal, da verschiedene SDRAM Generationen unterschiedliche architektonische Merkmale und zeitliche Beschränkungen haben. Abschließend wird ein ausführlicher Vergleich mit inhaltlich verwandten Arbeiten durchgeführt. Außerdem werden Trends in der Worst-Case-Latenz von DDR SDRAM aus verschiedenen Geschwindigkeitsklassen und Generationen vorgestellt und ausführlich diskutiert

    230503

    Get PDF
    In multiprocessor-based real-time systems, the main memory is identified as the main source of shared resource contention. Phased execution models such as the 3-phase task execution model has shown to be a good candidate to tackle the memory contention problem. It divides the execution of tasks into computation and memory phases that enable a fine-grained memory contention analysis. However, the existing work that focuses on the memory contention analysis for 3-phase tasks can overestimate the memory contention that can be suffered by the task under analysis due to the write requests. This overestimation can yield pessimistic bounds on the memory access times and memory contention suffered by tasks which in turn lead to pessimistic worst-case response time (WCRT) bounds. Considering the limitation of the state-of-the-art, this work proposes an improved memory contention analysis for the 3-phase task model. Specifically, we propose a memory contention analysis for the 3-phase task model by tightly bounding the memory contention suffered by the task under analysis due to the write requests. The proposed memory contention analysis integrates memory address mapping of tasks to improve the bounds on the maximum memory contention suffered by tasks.This work was nanced by FCT and EU ECSEL JU within project ADACORSA (ECSEL/0010/2019 - JU grant nr. 876019) - The JU receives support from the EU’s Horizon 2020 R&I Programme and Germany, Netherlands, Austria, France, Sweden, Cyprus, Greece, Lithuania, Portugal, Italy, Finland, Turkey (Disclaimer: This document re ects only the author’s view and the Commission is not responsible for any use that may be made of the information it contains); it is also a result of the work developed under the CISTER Unit (UIDP/UIDB/04234/2020), nanced by FCT/MCTES (Portuguese Foundation for Science and Technology); and under project POCI-01-0247-FEDER-045912 (FLOYD), nanced in the scope of the CMU Portugal, by the European Regional Development Fund (ERDF) under COMPETE 2020, also by FCT under PhD grant 2020.09532.BD.info:eu-repo/semantics/publishedVersio

    A time-predictable many-core processor design for critical real-time embedded systems

    Get PDF
    Critical Real-Time Embedded Systems (CRTES) are in charge of controlling fundamental parts of embedded system, e.g. energy harvesting solar panels in satellites, steering and breaking in cars, or flight management systems in airplanes. To do so, CRTES require strong evidence of correct functional and timing behavior. The former guarantees that the system operates correctly in response of its inputs; the latter ensures that its operations are performed within a predefined time budget. CRTES aim at increasing the number and complexity of functions. Examples include the incorporation of \smarter" Advanced Driver Assistance System (ADAS) functionality in modern cars or advanced collision avoidance systems in Unmanned Aerial Vehicles (UAVs). All these new features, implemented in software, lead to an exponential growth in both performance requirements and software development complexity. Furthermore, there is a strong need to integrate multiple functions into the same computing platform to reduce the number of processing units, mass and space requirements, etc. Overall, there is a clear need to increase the computing power of current CRTES in order to support new sophisticated and complex functionality, and integrate multiple systems into a single platform. The use of multi- and many-core processor architectures is increasingly seen in the CRTES industry as the solution to cope with the performance demand and cost constraints of future CRTES. Many-cores supply higher performance by exploiting the parallelism of applications while providing a better performance per watt as cores are maintained simpler with respect to complex single-core processors. Moreover, the parallelization capabilities allow scheduling multiple functions into the same processor, maximizing the hardware utilization. However, the use of multi- and many-cores in CRTES also brings a number of challenges related to provide evidence about the correct operation of the system, especially in the timing domain. Hence, despite the advantages of many-cores and the fact that they are nowadays a reality in the embedded domain (e.g. Kalray MPPA, Freescale NXP P4080, TI Keystone II), their use in CRTES still requires finding efficient ways of providing reliable evidence about the correct operation of the system. This thesis investigates the use of many-core processors in CRTES as a means to satisfy performance demands of future complex applications while providing the necessary timing guarantees. To do so, this thesis contributes to advance the state-of-the-art towards the exploitation of parallel capabilities of many-cores in CRTES contributing in two different computing domains. From the hardware domain, this thesis proposes new many-core designs that enable deriving reliable and tight timing guarantees. From the software domain, we present efficient scheduling and timing analysis techniques to exploit the parallelization capabilities of many-core architectures and to derive tight and trustworthy Worst-Case Execution Time (WCET) estimates of CRTES.Los sistemas críticos empotrados de tiempo real (en ingles Critical Real-Time Embedded Systems, CRTES) se encargan de controlar partes fundamentales de los sistemas integrados, e.g. obtención de la energía de los paneles solares en satélites, la dirección y frenado en automóviles, o el control de vuelo en aviones. Para hacerlo, CRTES requieren fuerte evidencias del correcto comportamiento funcional y temporal. El primero garantiza que el sistema funciona correctamente en respuesta de sus entradas; el último asegura que sus operaciones se realizan dentro de unos limites temporales establecidos previamente. El objetivo de los CRTES es aumentar el número y la complejidad de las funciones. Algunos ejemplos incluyen los sistemas inteligentes de asistencia a la conducción en automóviles modernos o los sistemas avanzados de prevención de colisiones en vehiculos aereos no tripulados. Todas estas nuevas características, implementadas en software,conducen a un crecimiento exponencial tanto en los requerimientos de rendimiento como en la complejidad de desarrollo de software. Además, existe una gran necesidad de integrar múltiples funciones en una sóla plataforma para así reducir el número de unidades de procesamiento, cumplir con requisitos de peso y espacio, etc. En general, hay una clara necesidad de aumentar la potencia de cómputo de los actuales CRTES para soportar nueva funcionalidades sofisticadas y complejas e integrar múltiples sistemas en una sola plataforma. El uso de arquitecturas multi- y many-core se ve cada vez más en la industria CRTES como la solución para hacer frente a la demanda de mayor rendimiento y las limitaciones de costes de los futuros CRTES. Las arquitecturas many-core proporcionan un mayor rendimiento explotando el paralelismo de aplicaciones al tiempo que proporciona un mejor rendimiento por vatio ya que los cores se mantienen más simples con respecto a complejos procesadores de un solo core. Además, las capacidades de paralelización permiten programar múltiples funciones en el mismo procesador, maximizando la utilización del hardware. Sin embargo, el uso de multi- y many-core en CRTES también acarrea ciertos desafíos relacionados con la aportación de evidencias sobre el correcto funcionamiento del sistema, especialmente en el ámbito temporal. Por eso, a pesar de las ventajas de los procesadores many-core y del hecho de que éstos son una realidad en los sitemas integrados (por ejemplo Kalray MPPA, Freescale NXP P4080, TI Keystone II), su uso en CRTES aún precisa de la búsqueda de métodos eficientes para proveer evidencias fiables sobre el correcto funcionamiento del sistema. Esta tesis ahonda en el uso de procesadores many-core en CRTES como un medio para satisfacer los requisitos de rendimiento de aplicaciones complejas mientras proveen las garantías de tiempo necesarias. Para ello, esta tesis contribuye en el avance del estado del arte hacia la explotación de many-cores en CRTES en dos ámbitos de la computación. En el ámbito del hardware, esta tesis propone nuevos diseños many-core que posibilitan garantías de tiempo fiables y precisas. En el ámbito del software, la tesis presenta técnicas eficientes para la planificación de tareas y el análisis de tiempo para aprovechar las capacidades de paralelización en arquitecturas many-core, y también para derivar estimaciones de peor tiempo de ejecución (Worst-Case Execution Time, WCET) fiables y precisas

    A reconfigurable accelerator card for high performance computing

    Get PDF
    Includes abstract.Includes bibliographical references (leaves 68-70).This thesis describes the design, implementation, and testing of a reconfigurable accelerator card. The goal of the project was to provide a hardware platform for future students to carry out research into reconfigurable computing. Our accelerator design is an expansion card for a traditional Von Neumann host machine, and contains two field-programmable gate arrays. By inserting the card into a host machine, intrinsically parallel processing tasks can be exported to the FPGAs. This is similar to the way in which video game rendering tasks can be exported to the GFC on a graphics accelerator. We show how an FPGA is a suitable processing element, in terms of performance per watt, for many computing tasks. We set out to design and build a reconfigurable card that harnessed the latest FPGAs and fastest available I/O interfaces. The resultant design is one which can run within a host machine, in an array of host machines, or as a stand-alone processing node

    HyperFPGA: SoC-FPGA Cluster Architecture for Supercomputing and Scientific applications

    Get PDF
    Since their inception, supercomputers have addressed problems that far exceed those of a single computing device. Modern supercomputers are made up of tens of thousands of CPUs and GPUs in racks that are interconnected via elaborate and most of the time ad hoc networks. These large facilities provide scientists with unprecedented and ever-growing computing power capable of tackling more complex and larger problems. In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. With more pressure on energy efficiency, an alternative to traditional architectures is needed. Reconfigurable hardware, such as FPGAs, has repeatedly been shown to offer substantial advantages over the traditional supercomputing approach with respect to performance and power consumption. In fact, several works that advanced the field of heterogeneous supercomputing using FPGAs are described in this thesis \cite{survey-2002}. Each cluster and its architectural characteristics can be studied from three interconnected domains: network, hardware, and software tools, resulting in intertwined challenges that designers must take into account. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines, which in turn served as inspiration and background for the HyperFPGA. In this thesis, the HyperFPGA cluster is presented as a way to build scalable SoC-FPGA platforms to explore new architectures for improved performance and energy efficiency in high-performance computing, focusing on flexibility and openness. The HyperFPGA is a modular platform based on a SoM that includes power monitoring tools with high-speed general-purpose interconnects to offer a great level of flexibility and introspection. By exploiting the reconfigurability and programmability offered by the HyperFPGA infrastructure, which combines FPGAs and CPUs, with high-speed general-purpose connectors, novel computing paradigms can be implemented. A custom Linux OS and drivers, along with a custom script for hardware definition, provide a uniform interface from application to platform for a programmable framework that integrates existing tools. The development environment is demonstrated using the N-Queens problem, which is a classic benchmark for evaluating the performance of parallel computing systems. Overall, the results of the HyperFPGA using the N-Queens problem highlight the platform's ability to handle computationally intensive tasks and demonstrate its suitability for its use in supercomputing experiments.Since their inception, supercomputers have addressed problems that far exceed those of a single computing device. Modern supercomputers are made up of tens of thousands of CPUs and GPUs in racks that are interconnected via elaborate and most of the time ad hoc networks. These large facilities provide scientists with unprecedented and ever-growing computing power capable of tackling more complex and larger problems. In recent years, the most powerful supercomputers have already reached megawatt power consumption levels, an important issue that challenges sustainability and shows the impossibility of maintaining this trend. With more pressure on energy efficiency, an alternative to traditional architectures is needed. Reconfigurable hardware, such as FPGAs, has repeatedly been shown to offer substantial advantages over the traditional supercomputing approach with respect to performance and power consumption. In fact, several works that advanced the field of heterogeneous supercomputing using FPGAs are described in this thesis \cite{survey-2002}. Each cluster and its architectural characteristics can be studied from three interconnected domains: network, hardware, and software tools, resulting in intertwined challenges that designers must take into account. The classification and study of the architectures illustrate the trade-offs of the solutions and help identify open problems and research lines, which in turn served as inspiration and background for the HyperFPGA. In this thesis, the HyperFPGA cluster is presented as a way to build scalable SoC-FPGA platforms to explore new architectures for improved performance and energy efficiency in high-performance computing, focusing on flexibility and openness. The HyperFPGA is a modular platform based on a SoM that includes power monitoring tools with high-speed general-purpose interconnects to offer a great level of flexibility and introspection. By exploiting the reconfigurability and programmability offered by the HyperFPGA infrastructure, which combines FPGAs and CPUs, with high-speed general-purpose connectors, novel computing paradigms can be implemented. A custom Linux OS and drivers, along with a custom script for hardware definition, provide a uniform interface from application to platform for a programmable framework that integrates existing tools. The development environment is demonstrated using the N-Queens problem, which is a classic benchmark for evaluating the performance of parallel computing systems. Overall, the results of the HyperFPGA using the N-Queens problem highlight the platform's ability to handle computationally intensive tasks and demonstrate its suitability for its use in supercomputing experiments

    Methodology for cycle-accurate performance analysis of DRAM memories

    Get PDF
    Главна меморија је једна од кључних компоненти сваког рачунарског система. У савременим рачунарским системима главна меморија се најчешће имплементира помоћу меморија типа DRAM. Она непосредно утиче на цену, потрошњу енергије и ефикасност система, а посредно и на његову интерну архитектуру и организацију. Из тих разлога се анализи перформанси DRAM меморије посвећује велика пажња приликом пројектовања. Код мерења и анализе перформанси DRAM меморија тачност је од суштинског значаја. Нетачност смањује поузданост резултата и закључака, што може довести до доношења погрешних одлука приликом пројектовања, а тиме и до значајног губитка времена, труда и новца. Општи трендови у домену технологије израде, архитектуре и организације рачунарских система и развоја системског софтвера воде сталном расту оптерећења главне меморије и њеној све већој виртуелизацији, чиме се проблем тачности додатно погоршава. С обзиром да се не види технологија која би у догледно време могла да замени DRAM, може се очекивати да ће значај овог проблема у будућности само добијати на тежини. Кључни проблем у анализи перформанси DRAM меморија је немогућност да се непосредно утврди да ли су поједини циклуси на меморијској магистрали слободни или заузети. То ствара проблем са мерењем чак и основних показатеља меморијских перформанси, као што су степен искоришћења и ефикасност. Због тога није јасно како одговорити на фундаментално питање: „Како мерити перформансе DRAM меморије са потребним нивоом тачности?“. Чак и за показатеље перформанси које је могуће тачно и непосредно мерити, као што је VI проток података, постоји проблем интерпретације измерених вредности. Теоријски максимуми показатеља меморијских перформанси се континуирано мењају, а не могу се непосредно мерити, па није могуће интерпретирати резултате поређењем измерених и максималних вредности. Стога није јасно како одговорити ни на следећа кључна питања: „Колико су измерене перформансе DRAM меморије добре или лоше?“ и „Како поредити перформансе DRAM меморије измерене у различитим периодима и за различита радна оптерећења?“. Основни циљ дисертације је проналажење начина да се превазиђу ови суштински проблеми у вези са мерењем и анализом перформанси DRAM меморија. Као резултат рада на тој проблематици створена је нова теоријска основа за мерење и анализу меморијских перформанси и формулисана је одговарајућа методологија која дефинише како да се мерење и анализа спроводе у пракси. Методологија се заснива на тачној и једнозначној карактеризацији меморијских циклуса на слободне (неискоришћени циклуси), активне (заузети циклуси чије стање је опсервабилно) и режијске (заузети циклуси чије стање заузетости није опсервабилно). Најважније компоненте предложене методологије су: Функционални и временски модел DRAM меморије који DRAM меморију апстрахује у виду генеричког уређаја чији рад се може потпуно описати помоћу параметризованог коначног аутомата и анализирати са жељеним нивоом тачности Модел за мерење и анализу перформанси DRAM меморија који омогућава прецизну карактеризацију меморијских циклуса Метрика за мерење и анализу перформанси DRAM меморија која представља нов теоријски основ за рад у овој области Метод за процену максимума перформанси DRAM меморија који омогућава решавање проблема интерпретације измерених резултата Методологија специфицира како се наведене компоненте дефинишу, конструишу и параметризују у складу са имплементацијом DRAM меморије и описује све релевантне поступке и процесе који омогућавају мерење и анализу перформанси DRAM меморија у реалном окружењу. VII Предложена методологија доноси суштински напредак у односу на постојећа решења. Њене најважније предности са теоријског становишта су: максимална тачност могућност прецизне процене теоријског максимума перформанси могућност идентификације главних узрока субоптималног рада комплетност (могућност анализе свих типова DRAM трансакција) Са становишта практичне примене, најважније предности су: независност од архитектуре система на коме се генерише радно оптерећење независност од величине радног оптерећења преносивост (могућност имплементације на различитим платформама) ефикасност (брзо генерисање резултата уз мали напор корисника) ниска цена и комплексност имплементације и верификације Предложена методологија може да замени постојећа решења у свим доменима где је потребно побољшати тачност и ефикасност анализе. Уједно, омогућава се и примена у потпуно новим областима, попут анализе критичних сценарија (анализа секвенци које се јављају спорадично, али имају битан ефекат на рад система), трансакционе анализе система (анализа рада система праћењем појединачних трансакција), компаративне анализе (поређење резултата са различитих плаформи, за различита радна оптерећења или у различитим периодима) и др. Методологија омогућава релативно једноставну примену на инжењерском нивоу уз мали утрошак ресурса и тиме обезбеђује висок ниво ефикасности и употребљивост у пракси. Као резултат систематизације на чврстим теоријским основама и решавања кључних проблема који су раније постојали у вези са мерењем и анализом перформанси DRAM меморија, направљен је суштински помак у овој области. Тај напредак омогућава прелазак анализе перформанси DRAM меморија из домена инжењерске вештине у домен научно-стручне дисциплине и подизање процеса анализе рада целокупног рачунарског система на квантитативно и квалитативно виши ниво.Main memory is one of the key components in a computer system. In modern systems, main memory is almost always implemented using DRAM type of memory. Memory has a direct impact on price, power consumption and performance of the system, and an indirect impact on its internal architecture and organization. That is why a lot of attention is paid to DRAM performance analysis during system development. Accuracy is of utmost importance in measurement and analysis of DRAM performance. Inaccuracy reduces reliability of the results and conclusions, which can lead to wrong architectural or design decisions, and a significant loss of time, effort and money. General trends in manufacturing technology, computer architecure and organization, and system software development lead toward its increasing virtualization and greater utilization, which exacerbates the problem of accuracy. Considering that there is no technology in sight that could replace DRAM in the near future, the importance of this problem will only increase. The main problem in analysis of DRAM performance is inability to determine which cycles on the memory bus are busy or idle. That makes even basic memory performance parameters, such as utilization or efficiency, difficult to measure. In essence, it is not clear how to answer the fundamental question: “How can DRAM performance be measured with the required level of accuracy?”. Even for performance paramters that can be measured based on the observable signals, such as data bandwidth, there is a problem of interpretation of the measured results. Theoretical maximums of performance parameters continually fluctuate, and they cannot be directly measured, so it is not possible to interpret them by comparing measured and maximum values. It is thus not X clear how to answer the following key questions either: “How good or bad are measured performance results?” and “How can results measured in two different time periods or for different workloads be compared?”. The main goal of the dissertation was to overcome these fundamental problems. As a result, a new theoretical foundation for DRAM performance measurement and analysis was created, along with a methodology that specifies how to conduct the measurement and analysis in practice. The methodology is based on accurate characterization of memory cycles as idle (cycles not used), active (cycles used, and their state is observable), or overhead (cycle that cannot be used due to DRAM protocol constraints, and their state is not observable). The most important components of the proposed methodology are: Functional and timing model of DRAM memory that abstracts DRAM memory as a generic device in a form of a state machine parameterized by DRAM device configuration and timing parameters, whose operation can be analyzed with the desired level of accuracy DRAM measurement and performance analysis model that enables accurate performance characterization of memory cycles DRAM performance metric that represents a new theoretical foundation for DRAM performance measurement and analysis Method for estimating DRAM performance maximum that enables solving of the problem of interpretation of results The methodology specifies how these components are defined, constructed, and paramterized according to a particular DRAM implementation and describes all the relevant processes and procedures that enable measurement and analysis of DRAM performance in real systems. The proposed methodology makes fundamental advancements over the existing solutions. Its most important advantages, from the theoretical point of view, include: guaranteed maximum accuracy enabling accurate estimation of theoretical maximum enabling root-causing of sub-optimal DRAM performance XI completeness (takes into account all DRAM commands and timing parameters) The most important advantages from the practical point of view include: system agnostic (does not depend on the system that generates workload) workload agnostic (does not depend on the size or type of workload) portability (can be implemented on any type of system) efficiency (generates results fast and with little user effort) low implementation and verification complexity and cost The proposed methodology can replace existing solutions in all domains where accuracy and efficiency are of importance. At the same time, it can be applied in completely new domains, such as analysis of critical scenarios (analysis of sequences that occur sporadically, but have tangible impact on performance), transactional analysis (analysis of system operation by following individual transactions), comparative analysis (comparing resutls from different platforms, for different workloads, or in different time periods), etc. The methodology enables relatively simple application at the engineerig level, with small use of resources, and high level of efficiency. As a result of systematization on firm theoretical grounds, all key problems in the domain of DRAM performance measurement and analysis were solved. The fundamental improvement made in this domain allows DRAM performance measurement and analysis to be elevated from an engineering art to a scientific method, which enables a quantitative and qualitative leap in computer system analysis

    Novel Architectures for Offloading and Accelerating Computations in Artificial Intelligence and Big Data

    Get PDF
    Due to the end of Moore's Law and Dennard Scaling, performance gains in general-purpose architectures have significantly slowed in recent years. While raising the number of cores has been a viable approach for further performance increases, Amdahl's Law and its implications on parallelization also limit further performance gains. Consequently, research has shifted towards different approaches, including domain-specific custom architectures tailored to specific workloads. This has led to a new golden age for computer architecture, as noted in the Turing Award Lecture by Hennessy and Patterson, which has spawned several new architectures and architectural advances specifically targeted at highly current workloads, including Machine Learning. This thesis introduces a hierarchy of architectural improvements ranging from minor incremental changes, such as High-Bandwidth Memory, to more complex architectural extensions that offload workloads from the general-purpose CPU towards more specialized accelerators. Finally, we introduce novel architectural paradigms, namely Near-Data or In-Network Processing, as the most complex architectural improvements. This cumulative dissertation then investigates several architectural improvements to accelerate Sum-Product Networks, a novel Machine Learning approach from the class of Probabilistic Graphical Models. Furthermore, we use these improvements as case studies to discuss the impact of novel architectures, showing that minor and major architectural changes can significantly increase performance in Machine Learning applications. In addition, this thesis presents recent works on Near-Data Processing, which introduces Smart Storage Devices as a novel architectural paradigm that is especially interesting in the context of Big Data. We discuss how Near-Data Processing can be applied to improve performance in different database settings by offloading database operations to smart storage devices. Offloading data-reductive operations, such as selections, reduces the amount of data transferred, thus improving performance and alleviating bandwidth-related bottlenecks. Using Near-Data Processing as a use-case, we also discuss how Machine Learning approaches, like Sum-Product Networks, can improve novel architectures. Specifically, we introduce an approach for offloading Cardinality Estimation using Sum-Product Networks that could enable more intelligent decision-making in smart storage devices. Overall, we show that Machine Learning can benefit from developing novel architectures while also showing that Machine Learning can be applied to improve the applications of novel architectures
    corecore