Search CORE

461 research outputs found

Doctor of Philosophy

Author: Sun Weibin
Publication venue: University of Utah
Publication date: 01/08/2014
Field of study

dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

The University of Utah: J. Willard Marriott Digital Library

Gestión de jerarquías de memoria híbridas a nivel de sistema

Author: Perumkunnil Komalan Manu
Publication venue: 'Universidad Complutense de Madrid (UCM)'
Publication date: 11/05/2017
Field of study

Tesis inédita de la Universidad Complutense de Madrid, Facultad de Informática, Departamento de Arquitectura de Computadoras y Automática y de Ku Leuven, Arenberg Doctoral School, Faculty of Engineering Science, leída el 11/05/2017.In electronics and computer science, the term ‘memory’ generally refers to devices that are used to store information that we use in various appliances ranging from our PCs to all hand-held devices, smart appliances etc. Primary/main memory is used for storage systems that function at a high speed (i.e. RAM). The primary memory is often associated with addressable semiconductor memory, i.e. integrated circuits consisting of silicon-based transistors, used for example as primary memory but also other purposes in computers and other digital electronic devices. The secondary/auxiliary memory, in comparison provides program and data storage that is slower to access but offers larger capacity. Examples include external hard drives, portable flash drives, CDs, and DVDs. These devices and media must be either plugged in or inserted into a computer in order to be accessed by the system. Since secondary storage technology is not always connected to the computer, it is commonly used for backing up data. The term storage is often used to describe secondary memory. Secondary memory stores a large amount of data at lesser cost per byte than primary memory; this makes secondary storage about two orders of magnitude less expensive than primary storage. There are two main types of semiconductor memory: volatile and nonvolatile. Examples of non-volatile memory are ‘Flash’ memory (sometimes used as secondary, sometimes primary computer memory) and ROM/PROM/EPROM/EEPROM memory (used for firmware such as boot programs). Examples of volatile memory are primary memory (typically dynamic RAM, DRAM), and fast CPU cache memory (typically static RAM, SRAM, which is fast but energy-consuming and offer lower memory capacity per are a unit than DRAM). Non-volatile memory technologies in Si-based electronics date back to the 1990s. Flash memory is widely used in consumer electronic products such as cellphones and music players and NAND Flash-based solid-state disks (SSDs) are increasingly displacing hard disk drives as the primary storage device in laptops, desktops, and even data centers. The integration limit of Flash memories is approaching, and many new types of memory to replace conventional Flash memories have been proposed. The rapid increase of leakage currents in Silicon CMOS transistors with scaling poses a big challenge for the integration of SRAM memories. There is also the case of susceptibility to read/write failure with low power schemes. As a result of this, over the past decade, there has been an extensive pooling of time, resources and effort towards developing emerging memory technologies like Resistive RAM (ReRAM/RRAM), STT-MRAM, Domain Wall Memory and Phase Change Memory(PRAM). Emerging non-volatile memory technologies promise new memories to store more data at less cost than the expensive-to build silicon chips used by popular consumer gadgets including digital cameras, cell phones and portable music players. These new memory technologies combine the speed of static random-access memory (SRAM), the density of dynamic random-access memory (DRAM), and the non-volatility of Flash memory and so become very attractive as another possibility for future memory hierarchies. The research and information on these Non-Volatile Memory (NVM) technologies has matured over the last decade. These NVMs are now being explored thoroughly nowadays as viable replacements for conventional SRAM based memories even for the higher levels of the memory hierarchy. Many other new classes of emerging memory technologies such as transparent and plastic, three-dimensional(3-D), and quantum dot memory technologies have also gained tremendous popularity in recent years...En el campo de la informática, el término ‘memoria’ se refiere generalmente a dispositivos que son usados para almacenar información que posteriormente será usada en diversos dispositivos, desde computadoras personales (PC), móviles, dispositivos inteligentes, etc. La memoria principal del sistema se utiliza para almacenar los datos e instrucciones de los procesos que se encuentre en ejecución, por lo que se requiere que funcionen a alta velocidad (por ejemplo, DRAM). La memoria principal está implementada habitualmente mediante memorias semiconductoras direccionables, siendo DRAM y SRAM los principales exponentes. Por otro lado, la memoria auxiliar o secundaria proporciona almacenaje(para ficheros, por ejemplo); es más lenta pero ofrece una mayor capacidad. Ejemplos típicos de memoria secundaria son discos duros, memorias flash portables, CDs y DVDs. Debido a que estos dispositivos no necesitan estar conectados a la computadora de forma permanente, son muy utilizados para almacenar copias de seguridad. La memoria secundaria almacena una gran cantidad de datos aun coste menor por bit que la memoria principal, siendo habitualmente dos órdenes de magnitud más barata que la memoria primaria. Existen dos tipos de memorias de tipo semiconductor: volátiles y no volátiles. Ejemplos de memorias no volátiles son las memorias Flash (algunas veces usadas como memoria secundaria y otras veces como memoria principal) y memorias ROM/PROM/EPROM/EEPROM (usadas para firmware como programas de arranque). Ejemplos de memoria volátil son las memorias DRAM (RAM dinámica), actualmente la opción predominante a la hora de implementar la memoria principal, y las memorias SRAM (RAM estática) más rápida y costosa, utilizada para los diferentes niveles de cache. Las tecnologías de memorias no volátiles basadas en electrónica de silicio se remontan a la década de1990. Una variante de memoria de almacenaje por carga denominada como memoria Flash es mundialmente usada en productos electrónicos de consumo como telefonía móvil y reproductores de música mientras NAND Flash solid state disks(SSDs) están progresivamente desplazando a los dispositivos de disco duro como principal unidad de almacenamiento en computadoras portátiles, de escritorio e incluso en centros de datos. En la actualidad, hay varios factores que amenazan la actual predominancia de memorias semiconductoras basadas en cargas (capacitivas). Por un lado, se está alcanzando el límite de integración de las memorias Flash, lo que compromete su escalado en el medio plazo. Por otra parte, el fuerte incremento de las corrientes de fuga de los transistores de silicio CMOS actuales, supone un enorme desafío para la integración de memorias SRAM. Asimismo, estas memorias son cada vez más susceptibles a fallos de lectura/escritura en diseños de bajo consumo. Como resultado de estos problemas, que se agravan con cada nueva generación tecnológica, en los últimos años se han intensificado los esfuerzos para desarrollar nuevas tecnologías que reemplacen o al menos complementen a las actuales. Los transistores de efecto campo eléctrico ferroso (FeFET en sus siglas en inglés) se consideran una de las alternativas más prometedores para sustituir tanto a Flash (por su mayor densidad) como a DRAM (por su mayor velocidad), pero aún está en una fase muy inicial de su desarrollo. Hay otras tecnologías algo más maduras, en el ámbito de las memorias RAM resistivas, entre las que cabe destacar ReRAM (o RRAM), STT-RAM, Domain Wall Memory y Phase Change Memory (PRAM)...Depto. de Arquitectura de Computadores y AutomáticaFac. de InformáticaTRUEunpu

Lirias

Docta Complutense

Measurement and Analysis of IC Jitters and Soft Failures due to System-level ESD

Author: Jeong Myeongjo
Publication venue: Graduate School of UNIST
Publication date: 01/08/2019
Field of study

Department of Electrical EngineeringHuman touches to ground metal of electronic systems can cause electrostatic discharge (ESD)-induced soft failures without causing physical damage. If a number of data are lost due to ESD-induced noises, then a system freeze, fault, or reboot can occur, and user intervention is required to restore normal system operations. Such malfunction of a system is called system-level ESD soft failures that become more serious as the speed of electronic devices increases and their size becomes more compact. Achieving immunity of systems and integrated circuits (ICs) against soft failures due to system-level ESD is an important design goal. In this thesis, two specific circuits whose soft failures can be fatal to whole system are investigated. One of them is a delay-locked loop (DLL) and the other is a sense amplifier flip-flop (SAFF). The DLL is widely used to compensate the timing of high-speed data communications. The SAFF is commonly used as an input receiver for address and command in a DRAM. The DLL and SAFF were designed and fabricated in a 180-nm CMOS process. They are mounted in each simplified design of dual in-line memory module (DIMM) by chip on board (COB) assembly and the DIMMs are mounted on each simplified motherboard. The input and output voltages of the DLL under ESD-induced noises were measured, and the average values of peak-to-peak jitter and jitter durations of the DLL clock were obtained from repeated measurements. The effects of the VDD-GND decoupling capacitors and a bias decoupling capacitor were investigated. The measured DLL output are reproduced in SPICE simulations using the measured DLL input voltages, and the root causes of the jitter are investigated. Additionally, measurements are conducted in a frequency domain to find the relationship between the power-ground impedance and noises. The soft failures of the SAFF due to system-level ESD were investigated under the ESD injection level of 3, 5, and 8 kV. ESD test case without and with VDD-GND decoupling capacitors (de-caps) were investigated. The measurements were conducted 50 times with each test case above. The noise voltages and the soft failure ratio of the SAFF were obtained. SPICE simulation was conducted to validate the results by using measured noise voltages and root causes of the soft failures.clos

ScholarWorks@UNIST

Automatic Selection of Software Code Regions for Migrating to GPUs

Author: Fábio Daniel Reis Gaspar
Publication venue
Publication date: 04/03/2022
Field of study

Repositório Aberto da Universidade do Porto

Improving time predictability of shared hardware resources in real-time multicore systems : emphasis on the space domain

Author: Jalle Ibarra Javier
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2016
Field of study

Critical Real-Time Embedded Systems (CRTES) follow a verification and validation process on the timing and functional correctness. This process includes the timing analysis that provides Worst-Case Execution Time (WCET) estimates to provide evidence that the execution time of the system, or parts of it, remain within the deadlines. A key design principle for CRTES is the incremental qualification, whereby each software component can be subject to verification and validation independently of any other component, with obvious benefits for cost. At timing level, this requires time composability, such that the timing behavior of a function is not affected by other functions. CRTES are experiencing an unprecedented growth with rising performance demands that have motivated the use of multicore architectures. Multicores can provide the performance required and bring the potential of integrating several software functions onto the same hardware. However, multicore contention in the access to shared hardware resources creates a dependence of the execution time of a task with the rest of the tasks running simultaneously. This dependence threatens time predictability and jeopardizes time composability. In this thesis we analyze and propose hardware solutions to be applied on current multicore designs for CRTES to improve time predictability and time composability, focusing on the on-chip bus and the memory controller. At hardware level, we propose new bus and memory controller designs that control and mitigate contention between different cores and allow to have time composability by design, also in the context of mixed-criticality systems. At analysis level, we propose contention prediction models that factor the impact of contenders and don¿t need modifications to the hardware. We also propose a set of Performance Monitoring Counters (PMC) that provide evidence about the contention. We give an special emphasis on the Space domain focusing on the Cobham Gaisler NGMP multicore processor, which is currently assessed by the European Space Agency for its future missions.Los Sistemas Críticos Empotrados de Tiempo Real (CRTES) siguen un proceso de verificación y validación para su correctitud funcional y temporal. Este proceso incluye el análisis temporal que proporciona estimaciones de el peor caso del tiempo de ejecución (WCET) para dar evidencia de que el tiempo de ejecución del sistema, o partes de él, permanecen dentro de los límites temporales. Un principio de diseño clave para los CRTES es la cualificación incremental, por la que cada componente de software puede ser verificado y validado independientemente del resto de componentes, con beneficios obvios para el coste. A nivel temporal, esto requiere composabilidad temporal, por la que el comportamiento temporal de una función no se ve afectado por otras funciones. CRTES están experimentando un crecimiento sin precedentes con crecientes demandas de rendimiento que han motivado el uso the arquitecturas multi-núcleo (multicore). Los procesadores multi-núcleo pueden proporcionar el rendimiento requerido y tienen el potencial de integrar varias funcionalidades software en el mismo hardware. A pesar de ello, la interferencia entre los diferentes núcleos que aparece en los recursos compartidos de os procesadores multi núcleo crea una dependencia del tiempo de ejecución de una tarea con el resto de tareas ejecutándose simultáneamente en el procesador. Esta dependencia amenaza la predictabilidad temporal y compromete la composabilidad temporal. En esta tésis analizamos y proponemos soluciones hardware para ser aplicadas en los diseños multi núcleo actuales para CRTES que mejoran la predictabilidad y composabilidad temporal, centrándose en el bus y el controlador de memoria internos al chip. A nivel de hardware, proponemos nuevos diseños de buses y controladores de memoria que controlan y mitigan la interferencia entre los diferentes núcleos y permiten tener composabilidad temporal por diseño, también en el contexto de sistemas de criticalidad mixta. A nivel de análisis, proponemos modelos de predicción de la interferencia que factorizan el impacto de los núcleos y no necesitan modificaciones hardware. También proponemos un conjunto de Contadores de Control del Rendimiento (PMC) que proporcionoan evidencia de la interferencia. En esta tésis, damós especial importancia al dominio espacial, centrándonos en el procesador mutli núcleo Cobham Gaisler NGMP, que está siendo actualmente evaluado por la Agencia Espacial Europea para sus futuras misiones

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

DRAM의 전력-성능 상보 관계를 고려한 높은 에너지 효율의 메모리 시스템 설계

Author: 조현윤
Publication venue: 서울대학교 대학원
Publication date: 01/02/2017
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 융합과학부(지능형융합시스템전공), 2017. 2. 안정호.최근 서버에 요구되는 주기억장치의 용량이 증가되면서 기존에 비해 많은 개수의 기억장치 모듈이 추가적으로 장착되기 시작하였다. 이로 인해 대용량 주기억장치를 갖춘 서버 시스템에서 주기억장치가 프로세서에 이어 두 번째로 많은 에너지를 소모하는 구성 성분이 되었다. 게다가 특정 서버에서는 시스템 구성 방법에 따라서는 주기억장치가 프로세서에 맞먹는 에너지를 소모하는 경우까지 있다. 따라서 대용량 주기억장치를 가진 서버 시스템에서 주기억장치의 에너지 효율을 높이는 것이 매우 중요해졌다. 기존의 연구들은 보다 에너지 효율적인 주기억장치 시스템을 구성하기 위해서 모바일용 DRAM인 LPDDR을 활용하려고 하였다. LPDDR은 기존 DDR 대비 전력 소모가 적다는 장점이 있다. 그러나 대신 데이터 접근 지연시간이 너무 크고 대역폭이 낮다는 단점도 동시에 가지고 있다. 따라서 에너지 효율을 높이기 위하여 성능 제약을 극복하려고 애써왔다. 하지만 본 논문에서 DDR4대신 LPDDR4를 기반으로 모바일 DRAM을 대신 사용하는 주기억장치 아키텍처가 더 이상 효과적이지 않다는 것을 실험으로 확인하였다. 주기억장치를 빈번하게 사용하는 워크로드에서는 기준점인 DDR4 대비 LPDDR4를 사용하는 시스템의 에너지 효율이 49% 감소한다. 그 이유는 DDR4가 모바일과 그래픽용 DRAM의 장점(낮은 전력 소모, 높은 대역폭, 많은 뱅크 등)을 벤치마킹하여 적용함으로써 성능과 에너지 효율을 동시에 개선하고자 하였으나, LPDDR4에서 더 높은 대역폭 확보를 위해 대신 에너지 효율을 희생하였기 때문이다. 추가적으로 DDR4의 전력 소모가 제조사별로 산포가 존재하는 것을 확인하였다. 그리고 DDR4의 새로운 에너지 소모 감소 기술에 대하여 심도 있게 조사하였다. 그래서 이 기술들을 적용하였을 경우 에너지 효율이 오히려 나빠질 수 있다는 것을 실험으로 확인하였다. 앞서 나열한 사항에 근거하여, 궁극적으로 에너지 소모 감소를 위하여 가변적으로 DRAM의 power-down 모드를 활용하는, 간단하고 효과적인 방법을 제안한다. 제안하는 방법을 적용하였을 경우 에너지-지연시간의 곱이 기존 power-down 대비 4% 개선됨을 확인하였다.As servers are equipped with more memory modules each with larger capacity, main-memory systems are now the second highest energy-consuming component in big-memory servers and their energy consumption even becomes comparable to processors in some servers. Meanwhile, it is critical for big-memory servers and their main-memory systems to offer high energy efficiency. In pursuit of energy-efficient main memory systems, prior work exploited mobile LPDDR devices advantages (lower power than DDR devices) while attempting to surmount their limitations (longer latency, lower bandwidth, or both). However, we demonstrate that such main memory architectures (based on the latest LPDDR4 devices) are no longer effective and even hurt overall energy efficiency of servers by 49% on memory intensive workloads compared to ones based on DDR4 devices. This is because the power consumption of present DDR4 devices has substantially decreased by adopting the strength of mobile and graphics memory whereas LPDDR4 has sacrificed energy efficiency and focused more on increasing data transfer rateswe also exhibit that the power consumption of DDR4 devices can substantially vary across manufacturers. Moreover, investigating new energy-saving features of DDR4 devices in depth, we show that activating these features often hurts overall energy efficiency of servers due to their performance penalties. Subsequently, we propose a simple but effective scheme that adaptively exploits DRAM power-down modes which improves the system energy-delay product by 4.0%.Introduction 1 Background and Related Work 5 2.1 DRAM Organization and Operation 5 2.2 Breaking Down DRAM Power Dissipation 8 2.3 Recent Progresses in Improving the Energy Efficiency of Main Memory Systems 10 Energy Efficiency and Performance Trade-Offs of Modern Main Memory Devices 14 3.1 DDR4 is not Energy Inefficient Any More 15 3.2 Saving Standby Power by Exploiting Power-down Modes 18 3.3 Saving Data Transfer Energy with DBI/TSV 20 3.3.1 Benefits of DBI 21 3.3.2 Energy savings by DBI considering its cost 22 3.3.3 Impact of module types 23 Improving Main-Memory Efficiency Without Compromising Performance: Exploiting Power-Down Modes Adaptively 25 Experimental Setup 28 Evaluation 31 Conclusion 36 Bibliography 38 국문 초록 43Maste

SNU Open Repository and Archive

Recommended from our members

Scalable Emulation of Heterogeneous Systems

Author: Garcia Cota Emilio
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2019
Field of study

The breakdown of Dennard's transistor scaling has driven computing systems toward application-specific accelerators, which can provide orders-of-magnitude improvements in performance and energy efficiency over general-purpose processors. To enable the radical departures from conventional approaches that heterogeneous systems entail, research infrastructure must be able to model processors, memory and accelerators, as well as system-level changes---such as operating system or instruction set architecture (ISA) innovations---that might be needed to realize the accelerators' potential. Unfortunately, existing simulation tools that can support such system-level research are limited by the lack of fast, scalable machine emulators to drive execution. To fill this need, in this dissertation we first present a novel machine emulator design based on dynamic binary translation that makes the following improvements over the state of the art: it scales on multicore hosts while remaining memory efficient, correctly handles cross-ISA differences in atomic instruction semantics, leverages the host floating point (FP) unit to speed up FP emulation without sacrificing correctness, and can be efficiently instrumented to---among other possible uses---drive the execution of a full-system, cross-ISA simulator with support for accelerators. We then demonstrate the utility of machine emulation for studying heterogeneous systems by leveraging it to make two additional contributions. First, we quantify the trade-offs in different coupling models for on-chip accelerators. Second, we present a technique to reuse the private memories of on-chip accelerators when they are otherwise inactive to expand the system's last-level cache, thereby reducing the opportunity cost of the accelerators' integration

Columbia University Academic Commons

Instruction Memory Hierarchy Generation for Customized Processors

Author: Linjamäki Henry
Publication venue
Publication date: 09/12/2015
Field of study

Prosessoriytimien ja muistien välisten suorituskykyvajeen vuoksi käskymuistihierarkian suunnittelu on erottamaton osa prosessorien suunnittelua. Muistihierarkia ei pelkästään pidä prosessorien suorituskykyä yllä, mutta se voi myös vaikuttaa suurten muistien tehonkulutukseen. Sulautettujen prosessorien suunnittelu vähävirtakulutteisiksi mobiililaitteita varten on myös tärkeää, koska hyvin suunniteltu muistihierarkia voi vähentää tuntuvasti tehonkulutusta eikä vain pelkästään nopeuta muistien käyttöä. Tässä diplomityössä toteutettiin räätälöityjen muistihierarkioiden generointi, joka integroitiin prosessorigeneraattoriin. Tämä generaattori on osa Tampereen teknillisellä yliopistolla kehiteltyä TTA-based Co-design Environment (TCE)-kehitysympäristöä. Se lukee syötteenä muistihierarkiakuvauksen, jonka perusteella se luo prosessorin, joka sisältää määritellyn hierarkian. Lisäksi tuotettiin työkalu taltioimaan generoitujen muistihierarkioiden suorituskykytilastoa, jota käytetään sopivan hierarkiakonfiguraation etsimisessä. Toteutetut ominaisuudet verifioitiin rekisterisiirtotason (register transfer level, RTL) simulaatiossa käyttäen TCE:n luomia prosessoritestipenkkejä. Pinta-ala- ja tehoarvioita tuotettiin käyttäen synteesityökalua kolmelle vähintään yhden gigahertsin kellontaajuutta käyttävälle matalan tehonkulutuksen prosessorikonfiguraatiolle

Trepo - Institutional Repository of Tampere University