Search CORE

125 research outputs found

KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

Author: Gerlach Lukas
Publication venue: Hannover : Institutionelles Repositorium der Leibniz Universität Hannover
Publication date: 01/01/2021
Field of study

The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements

Institutionelles Repositorium der Leibniz Universität Hannover

Dynamically reconfigurable architecture for embedded computer vision systems

Author: Nieto Lareo Alejandro Manuel
Publication venue
Publication date: 01/01/2013
Field of study

The objective of this research work is to design, develop and implement a new architecture which integrates on the same chip all the processing levels of a complete Computer Vision system, so that the execution is efficient without compromising the power consumption while keeping a reduced cost. For this purpose, an analysis and classification of different mathematical operations and algorithms commonly used in Computer Vision are carried out, as well as a in-depth review of the image processing capabilities of current-generation hardware devices. This permits to determine the requirements and the key aspects for an efficient architecture. A representative set of algorithms is employed as benchmark to evaluate the proposed architecture, which is implemented on an FPGA-based system-on-chip. Finally, the prototype is compared to other related approaches in order to determine its advantages and weaknesses

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional da Universidade de Santiago de Compostela

Defining interfaces between hardware and software: Quality and performance

Author: Reid Alastair David
Publication venue
Publication date: 01/01/2019
Field of study

One of the most important interfaces in a computer system is the interface between hardware and software. This interface is the contract between the hardware designer and the programmer that defines the functional behaviour of the hardware. This thesis examines two critical aspects of defining the hardware-software interface: quality and performance. The first aspect is creating a high quality specification of the interface as conventionally defined in an instruction set architecture. The majority of this thesis is concerned with creating a specification that covers the full scope of the interface; that is applicable to all current implementations of the architecture; and that can be trusted to accurately describe the behaviour of implementations of the architecture. We describe the development of a formal specification of the two major types of Arm processors: A-class (for mobile devices such as phones and tablets) and M-class (for micro-controllers). These specifications are unparalleled in their scope, applicability and trustworthiness. This thesis identifies and illustrates what we consider the key ingredient in achieving this goal: creating a specification that is used by many different user groups. Supporting many different groups leads to improved quality as each group finds different problems in the specification; and, by providing value to each different group, it helps justify the considerable effort required to create a high quality specification of a major processor architecture. The work described in this thesis led to a step change in Arm's ability to use formal verification techniques to detect errors in their processors; enabled extensive testing of the specification against Arm's official architecture conformance suite; improved the quality of Arm's architecture conformance suite based on measuring the architectural coverage of the tests; supported earlier, faster development of architecture extensions by enabling animation of changes as they are being made; and enabled early detection of problems created from architecture extensions by performing formal validation of the specification against semi-structured natural language specifications. As far as we are aware, no other mainstream processor architecture has this capability. The formal specifications are included in Arm's publicly released architecture reference manuals and the A-class specification is also released in machine-readable form. The second aspect is creating a high performance interface by defining the hardware-software interface of a software-defined radio subsystem using a programming language. That is, an interface that allows software to exploit the potential performance of the underlying hardware. While the hardware-software interface is normally defined in terms of machine code, peripheral control registers and memory maps, we define it using a programming language instead. This higher level interface provides the opportunity for compilers to hide some of the low-level differences between different systems from the programmer: a potentially very efficient way of providing a stable, portable interface without having to add hardware to provide portability between different hardware platforms. We describe the design and implementation of a set of extensions to the C programming language to support programming high performance, energy efficient, software defined radio systems. The language extensions enable the programmer to exploit the pipeline parallelism typically present in digital signal processing applications and to make efficient use of the asymmetric multiprocessor systems designed to support such applications. The extensions consist primarily of annotations that can be checked for consistency and that support annotation inference in order to reduce the number of annotations required. Reducing the number of annotations does not just save programmer effort, it also improves portability by reducing the number of annotations that need to be changed when porting an application from one platform to another. This work formed part of a project that developed a high-performance, energy-efficient, software defined radio capable of implementing the physical layers of the 4G cellphone standard (LTE), 802.11a WiFi and Digital Video Broadcast (DVB) with a power and silicon area budget that was competitive with a conventional custom ASIC solution. The Arm architecture is the largest computer architecture by volume in the world. It behooves us to ensure that the interface it describes is appropriately defined

Glasgow Theses Service

Configurable Random Instruction Generator for RISC Processors

Author: Mange Krunal
Publication venue: RIT Scholar Works
Publication date: 01/12/2017
Field of study

Processors have evolved and grown more complex to serve enormous computational needs. Even though modern-day processors share same dna with processors half century ago, verifying them today is the huge wall to scale. Verification dominates production cycle even with advances both in software (programming as well as CAD tools) and manufacturing (fabrication) as there are too many test scenarios to cover. Testing complex devices like processors with manual-testing alone in certainty missing the dead lines. Automatic verification is a great way to overcome hurdles of manual testing viz. speed, manpower, and ultimately cost. The work described in this paper targets verification of processors which have in-order instruction execution. Verification is done using SystemVerilog testbench which compares output of device under test to the output of SystemC model, when random instructions are applied

RIT Scholar Works

Flexible Hardware-based Security-aware Mechanisms and Architectures

Author: Dessouky Ghada
Publication venue
Publication date: 01/01/2023
Field of study

For decades, software security has been the primary focus in securing our computing platforms. Hardware was always assumed trusted, and inherently served as the foundation, and thus the root of trust, of our systems. This has been further leveraged in developing hardware-based dedicated security extensions and architectures to protect software from attacks exploiting software vulnerabilities such as memory corruption. However, the recent outbreak of microarchitectural attacks has shaken these long-established trust assumptions in hardware entirely, thereby threatening the security of all of our computing platforms and bringing hardware and microarchitectural security under scrutiny. These attacks have undeniably revealed the grave consequences of hardware/microarchitecture security flaws to the entire platform security, and how they can even subvert the security guarantees promised by dedicated security architectures. Furthermore, they shed light on the sophisticated challenges particular to hardware/microarchitectural security; it is more critical (and more challenging) to extensively analyze the hardware for security flaws prior to production, since hardware, unlike software, cannot be patched/updated once fabricated. Hardware cannot reliably serve as the root of trust anymore, unless we develop and adopt new design paradigms where security is proactively addressed and scrutinized across the full stack of our computing platforms, at all hardware design and implementation layers. Furthermore, novel flexible security-aware design mechanisms are required to be incorporated in processor microarchitecture and hardware-assisted security architectures, that can practically address the inherent conflict between performance and security by allowing that the trade-off is configured to adapt to the desired requirements. In this thesis, we investigate the prospects and implications at the intersection of hardware and security that emerge across the full stack of our computing platforms and System-on-Chips (SoCs). On one front, we investigate how we can leverage hardware and its advantages, in contrast to software, to build more efficient and effective security extensions that serve security architectures, e.g., by providing execution attestation and enforcement, to protect the software from attacks exploiting software vulnerabilities. We further propose that they are microarchitecturally configured at runtime to provide different types of security services, thus adapting flexibly to different deployment requirements. On another front, we investigate how we can protect these hardware-assisted security architectures and extensions themselves from microarchitectural and software attacks that exploit design flaws that originate in the hardware, e.g., insecure resource sharing in SoCs. More particularly, we focus in this thesis on cache-based side-channel attacks, where we propose sophisticated cache designs, that fundamentally mitigate these attacks, while still preserving performance by enabling that the performance security trade-off is configured by design. We also investigate how these can be incorporated into flexible and customizable security architectures, thus complementing them to further support a wide spectrum of emerging applications with different performance/security requirements. Lastly, we inspect our computing platforms further beneath the design layer, by scrutinizing how the actual implementation of these mechanisms is yet another potential attack surface. We explore how the security of hardware designs and implementations is currently analyzed prior to fabrication, while shedding light on how state-of-the-art hardware security analysis techniques are fundamentally limited, and the potential for improved and scalable approaches

TUbiblio

tuprints

Proceedings of the Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015) Krakow, Poland

Author: Carretero Pérez Jesús
García Blas Francisco Javier
Jeannot Emmanuel
Wyrzykowski Roman
Publication venue
Publication date: 01/10/2015
Field of study

Proceedings of: Second International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2015). Krakow (Poland), September 10-11, 2015

Universidad Carlos III de Madrid e-Archivo

Contributions to the fault tolerance of soft-core processors implemented in SRAM-based FPGA Systems.

Author: Gómez-Cornejo Barrena Julen
Publication venue
Publication date: 01/01/2018
Field of study

239 p.Gracias al desarrollo de las tecnologías de diseño y fabricación, los circuitos electrónicos han llegado a grandes niveles de integración. De esta forma, hoy en día es posible implementar completos y complejos sistemas dentro de un único dispositivo incorporando gran variedad de elementos como: procesadores, osciladores, lazos de seguimiento de fase (PLLs), interfaces, conversores ADC y DAC, módulos de memoria, etc. A este concepto de diseño se le denomina comúnmente SoC (System-on-Chip). Una de las plataformas para implementar estos sistemas que más importancia está cobrando son las FPGAs (Field Programmable Gate Array). Históricamente la plataforma más utilizada para albergar los SoCs han sido las ASICs (Application- Specific Integrated Circuits), debido a su bajo consumo energético y su gran rendimiento. No obstante, su costoso proceso de desarrollo y fabricación hace que solo sean rentables en el caso de producciones masivas. Las FPGAs, por el contrario, al ser dispositivos configurables ofrecen, la posibilidad de implementar diseños personalizados a un coste mucho más reducido. Por otro lado, los continuos avances en la tecnología de las FPGAs están haciendo que éstas compitan con las ASICs a nivel de prestaciones (consumo, nivel de integración y eficiencia). Ciertas tecnologías de FPGA, como las SRAM y Flash, poseen una característica que las hace especialmente interesantes en multitud de diseños: la capacidad de reconfiguración. Dicha característica, que incluso puede ser realizada de forma autónoma, permite cambiar completamente el diseño hardware implementado con solo cargar en la FPGA un archivo de configuración denominado bitstream. La reconfiguración puede incluso permitir modificar una parte del circuito configurado en la matriz de la FPGA, mientras el resto del circuito implementado continua inalterado. Esto que se conoce como reconfiguración parcial dinámica, posibilita que un mismo chip albergue en su interior numerosos diseños hardware que pueden ser cargados a demanda. Gracias a la capacidad de reconfiguración, las FPGAs ofrecen numerosas ventajas como: posibilidad de personalización de diseños, capacidad de readaptación durante el funcionamiento para responder a cambios o corregir errores, mitigación de obsolescencia, diferenciación, menores costes de diseño o reducido tiempo para el lanzamiento de productos al mercado. Los SoC basados en FPGAs allanan el camino hacia un nuevo concepto de integración de hardware y software, permitiendo que los diseñadores de sistemas electrónicos sean capaces de integrar procesadores embebidos en los diseños para beneficiarse de su gran capacidad de computación. Gracias a esto, una parte importante de la electrónica hace uso de la tecnología FPGA abarcando un gran abanico de campos, como por ejemplo: la electrónica de consumo y el entretenimiento, la medicina o industrias como la espacial, la aviónica, la automovilística o la militar. Las tecnologías de FPGA existentes ofrecen dos vías de utilización de procesado- res embebidos: procesadores hardcore y procesadores softcore. Los hardcore son procesadores discretos integrados en el mismo chip de la FPGA. Generalmente ofrecen altas frecuencias de trabajo y una mayor previsibilidad en términos de rendimiento y uso del área, pero su diseño hardware no puede alterarse para ser personalizado. Por otro lado, un procesador soft-core, es la descripción hardware en lenguaje HDL (normalmente VDHL o Verilog) de un procesador, sintetizable e implementable en una FPGA. Habitualmente, los procesadores softcore suelen basarse en diseños hardware ya existentes, siendo compatibles con sus juegos de instrucciones, muchos de ellos en forma de IP cores (Intellectual Property co- res). Los IP cores ofrecen procesadores softcore prediseñados y testeados, que dependiendo del caso pueden ser de pago, gratuitos u otro tipo de licencias. Debido a su naturaleza, los procesadores softcore, pueden ser personalizados para una adaptación óptima a diseños específicos. Así mismo, ofrecen la posibilidad de integrar en el diseño tantos procesadores como se desee (siempre que haya disponibles recursos lógicos suficientes). Otra ventaja importante es que, gracias a la reconfiguración parcial dinámica, es posible añadir el procesador al diseño únicamente en los casos necesarios, ahorrando de esta forma, recursos lógicos y consumo energético. Uno de los mayores problemas que surgen al usar dispositivos basados en las tecnologías SRAM o la flash, como es el caso de las FPGAs, es que son especialmente sensibles a los efectos producidos por partículas energéticas provenientes de la radiación cósmica (como protones, neutrones, partículas alfa u otros iones pesados) denominados efectos de eventos simples o SEEs (Single Event Effects). Estos efectos pueden ocasionar diferentes tipos de fallos en los sistemas: desde fallos despreciables hasta fallos realmente graves que comprometan la funcionalidad del sistema. El correcto funcionamiento de los sistemas cobra especial relevancia cuando se trata de tecnologías de elevado costo o aquellas en las que peligran vidas humanas, como, por ejemplo, en campos tales como el transporte ferroviario, la automoción, la aviónica o la industria aeroespacial. Dependiendo de distintos factores, los SEEs pueden causar fallos de operación transitorios, cambios de estados lógicos o daños permanentes en el dispositivo. Cuando se trata de un fallo físico permanente se denomina hard-error, mientras que cuando el fallo afecta el circuito momentáneamente se denomina soft-error. Los SEEs más frecuentes son los soft-errors y afectan tanto a aplicaciones comerciales a nivel terrestre, como a aplicaciones aeronáuticas y aeroespaciales (con mayor incidencia en estas últimas). La contribución exacta de este tipo de fallos a la tasa de errores depende del diseño específico de cada circuito, pero en general se asume que entorno al 90 % de la tasa de error se debe a fallos en elementos de memoria (latches, biestables o celdas de memoria). Los soft-errors pueden afectar tanto al circuito lógico como al bitstream cargado en la memoria de configuración de la FPGA. Debido a su gran tamaño, la memoria de configuración tiene más probabilidades de ser afectada por un SEE. La existencia de problemas generados por estos efectos reafirma la importancia del concepto de tolerancia a fallos. La tolerancia a fallos es una propiedad relativa a los sistemas digitales, por la cual se asegura cierta calidad en el funcionamiento ante la presencia de fallos, debiendo los sistemas poder soportar los efectos de dichos fallos y funcionar correctamente en todo momento. Por tanto, para lograr un diseño robusto, es necesario garantizar la funcionalidad de los circuitos y asegurar la seguridad y confiabilidad en las aplicaciones críticas que puedan verse comprometidos por los SEE. A la hora de hacer frente a los SEE existe la posibilidad de explotar tecnologías específicas centradas en la tolerancia a fallos, como por ejemplo las FPGAs de tipo fusible, o, por otro lado, utilizar la tecnología comercial combinada con técnicas de tolerancia a fallos. Esta última opción va cobrando importancia debido al menor precio y mayores prestaciones de las FPGAs comerciales. Generalmente las técnicas de endurecimiento se aplican durante la fase de diseño. Existe un gran número de técnicas y se pueden llegar a combinar entre sí. Las técnicas prevalentes se basan en emplear algún tipo de redundancia, ya sea hardware, software, temporal o de información. Cada tipo de técnica presenta diferentes ventajas e inconvenientes y se centra en atacar distintos tipos de SEE y sus efectos. Dentro de las técnicas de tipo redundancia, la más utilizada es la hardware, que se basa en replicar el modulo a endurecer. De esta forma, cada una de las réplicas es alimentada con la misma entrada y sus salidas son comparadas para detectar discrepancias. Esta redundancia puede implementarse a diferentes niveles. En términos generales, un mayor nivel de redundancia hardware implica una mayor robustez, pero también incrementa el uso de recursos. Este incremento en el uso de recursos de una FPGA supone tener menos recursos disponibles para el diseño, mayor consumo energético, el tener más elementos susceptibles de ser afectados por un SEE y generalmente, una reducción de la máxima frecuencia alcanzable por el diseño. Por ello, los niveles de redundancia hardware más utilizados son la doble, conocida como DMR (Dual Modular Redundancy) y la triple o TMR (Triple Modular Redundancy). La DMR minimiza el número de recursos redundantes, pero presenta el problema de no poder identificar el módulo fallido ya que solo es capaz de detectar que se ha producido un error. Ello hace necesario combinarlo con técnicas adicionales. Al caso de DMR aplicado a procesadores se le denomina lockstep y se suele combinar con las técnicas checkpoint y rollback recovery. El checkpoint consiste en guardar periódicamente el contexto (contenido de registros y memorias) de instantes identificados como correctos. Gracias a esto, una vez detectado y reparado un fallo es posible emplear el rollback recovery para cargar el último contexto correcto guardado. Las desventajas de estas estrategias son el tiempo requerido por ambas técnicas (checkpoint y rollback recovery) y la necesidad de elementos adicionales (como memorias auxiliares para guardar el contexto). Por otro lado, el TMR ofrece la posibilidad de detectar el módulo fallido mediante la votación por mayoría. Es decir, si tras comparar las tres salidas una de ellas presenta un estado distinto, se asume que las otras dos son correctas. Esto permite que el sistema continúe funcionando correctamente (como sistema DMR) aun cuando uno de los módulos quede inutilizado. En todo caso, el TMR solo enmascara los errores, es decir, no los corrige. Una de las desventajas más destacables de esta técnica es que incrementa el uso de recursos en más de un 300 %. También cabe la posibilidad de que la salida discrepante sea la realmente correcta (y que, por tanto, las otras dos sean incorrectas), aunque este caso es bastante improbable. Uno de los problemas que no se ha analizado con profundidad en la bibliografía es el problema de la sincronización de procesadores soft-core en sistemas TMR (o de mayor nivel de redundancia). Dicho problema reside en que, si tras un fallo se inutiliza uno de los procesadores y el sistema continúa funcionando con el resto de procesadores, una vez reparado el procesador fallido éste necesita sincronizar su contexto al nuevo estado del sistema. Una práctica bastante común en la implementación de sistemas redundantes es combinarlos con la técnica conocida como scrubbing. Esta técnica basada en la reconfiguración parcial dinámica, consiste en sobrescribir periódicamente el bitstream con una copia libre de errores apropiadamente guardada. Gracias a ella, es posible corregir los errores enmascarados por el uso de algunas técnicas de endurecimiento como la redundancia hardware. Esta copia libre de errores suele omitir los bits del bitstream correspondientes a la memoria de usuario, por lo que solo actualiza los bits relacionados con la configuración de la FPGA. Por ello, a esta técnica también se la conoce como configuration scrubbing. En toda la literatura consultada se ha detectado un vacío en cuanto a técnicas que propongan estrategias de scrubbing para la memoria de usuario. Con el objetivo de proponer alternativas innovadoras en el terreno de la tolerancia a fallos para procesadores softcore, en este trabajo de investigación se han desarrollado varias técnicas y flujos de diseño para manejar los datos de usuario a través del bitstream, pudiendo leer, escribir o copiar la información de registros o de memorias implementadas en bloques RAMs de forma autónoma. Así mismo se ha desarrollado un abanico de propuestas tanto como para estrategias lockstep como para la sincronización de sistemas TMR, de las cuales varias hacen uso de las técnicas desarrolladas para manejar las memorias de usuario a través del bitstream. Estas últimas técnicas tienen en común la minimización de utilización de recursos respecto a las estrategias tradicionales. De forma similar, se proponen dos alternativas adicionales basadas en dichas técnicas: una propuesta de scrubbing para las memorias de usuario y una para la recuperación de información en memorias implementadas en bloques RAM cuyas interfaces hayan sido inutilizadas por SEEs.Todas las propuestas han sido validadas en hardware utilizando una FPGA de Xilinx, la empresa líder en fabricación de dispositivos reconfigurables. De esta forma se proporcionan resultados sobre los impactos de las técnicas propuestas en términos de utilización de recursos, consumos energéticos y máximas frecuencias alcanzables

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

Productive Programming Systems for Heterogeneous Supercomputers

Author: Grossman Max
Publication venue
Publication date: 01/08/2017
Field of study

The majority of today's scientific and data analytics workloads are still run on relatively energy inefficient, heavyweight, general-purpose processing cores, often referred to in the literature as latency-oriented architectures. The flexibility of these architectures and the programmer aids included (e.g. large and deep cache hierarchies, branch prediction logic, pre-fetch logic) makes them flexible enough to run a wide range of applications fast. However, we have started to see growth in the use of lightweight, simpler, energy-efficient, and functionally constrained cores. These architectures are commonly referred to as throughput-oriented. Within each shared memory node, the computational backbone of future throughput-oriented HPC machines will consist of large pools of lightweight cores. The first wave of throughput-oriented computing came in the mid 2000's with the use of GPUs for general-purpose and scientific computing. Today we are entering the second wave of throughput-oriented computing, with the introduction of NVIDIA Pascal GPUs, Intel Knights Landing Xeon Phi processors, the Epiphany Co-Processor, the Sunway MPP, and other throughput-oriented architectures that enable pre-exascale computing. However, while the majority of the FLOPS in designs for future HPC systems come from throughput-oriented architectures, they are still commonly paired with latency-oriented cores which handle management functions and lightweight/un-parallelizable computational kernels. Hence, most future HPC machines will be heterogeneous in their processing cores. However, the heterogeneity of future machines will not be limited to the processing elements. Indeed, heterogeneity will also exist in the storage, networking, memory, and software stacks of future supercomputers. As a result, it will be necessary to combine many different programming models and libraries in a single application. How to do so in a programmable and well-performing manner is an open research question. This thesis addresses this question using two approaches. First, we explore using managed runtimes on HPC platforms. As a result of their high-level programming models, these managed runtimes have a long history of supporting data analytics workloads on commodity hardware, but often come with overheads which make them less common in the HPC domain. Managed runtimes are also not supported natively on throughput-oriented architectures. Second, we explore the use of a modular programming model and work-stealing runtime to compose the programming and scheduling of multiple third-party HPC libraries. This approach leverages existing investment in HPC libraries, unifies the scheduling of work on a platform, and is designed to quickly support new programming model and runtime extensions. In support of these two approaches, this thesis also makes novel contributions in tooling for future supercomputers. We demonstrate the value of checkpoints as a software development tool on current and future HPC machines, and present novel techniques in performance prediction across heterogeneous cores

DSpace at Rice University

Understanding and Improving the Latency of DRAM-Based Memory Systems

Author: Chang Kevin K.
Publication venue
Publication date: 21/12/2017
Field of study

Over the past two decades, the storage capacity and access bandwidth of main memory have improved tremendously, by 128x and 20x, respectively. These improvements are mainly due to the continuous technology scaling of DRAM (dynamic random-access memory), which has been used as the physical substrate for main memory. In stark contrast with capacity and bandwidth, DRAM latency has remained almost constant, reducing by only 1.3x in the same time frame. Therefore, long DRAM latency continues to be a critical performance bottleneck in modern systems. Increasing core counts, and the emergence of increasingly more data-intensive and latency-critical applications further stress the importance of providing low-latency memory access. In this dissertation, we identify three main problems that contribute significantly to long latency of DRAM accesses. To address these problems, we present a series of new techniques. Our new techniques significantly improve both system performance and energy efficiency. We also examine the critical relationship between supply voltage and latency in modern DRAM chips and develop new mechanisms that exploit this voltage-latency trade-off to improve energy efficiency. The key conclusion of this dissertation is that augmenting DRAM architecture with simple and low-cost features, and developing a better understanding of manufactured DRAM chips together lead to significant memory latency reduction as well as energy efficiency improvement. We hope and believe that the proposed architectural techniques and the detailed experimental data and observations on real commodity DRAM chips presented in this dissertation will enable development of other new mechanisms to improve the performance, energy efficiency, or reliability of future memory systems.Comment: PhD Dissertatio

arXiv.org e-Print Archive

FigShare

An innovative vision system for industrial applications

Author: Ribalda Delgado Ricardo
Publication venue
Publication date: 01/01/2015
Field of study

Tesis doctoral inédita leída en la Universidad Autónoma de Madrid, Escuela Politécnica Superior, Departamento de Tecnología Electrónica y de las Comunicaciones. Fecha de lectura: 20-11-2015A pesar de que los sistemas de visión por computadora ocupan un puesto predominante en nuestra sociedad, su estructura no sigue ningún estándar. La implementación de aplicaciones de visión requiere de plataformas de alto rendimiento tales como GPUs o FPGAs y el uso de sensores de imagen con características muy distintas a las de la electrónica de consumo. En la actualidad, cada fabricante y equipo de investigación desarrollan sus plataformas de visión de forma independiente y sin ningún tipo de intercompatibilidad. En esta tesis se presenta una nueva plataforma de visión por computador utilizable en un amplio espectro de aplicaciones. Las características de dicha plataforma se han definido tras la implementación de tres aplicaciones de visión, basadas en: SOC, FPGA y GPU, respectivamente. Como resultado, se ha definido una plataforma modular con los siguientes componentes intercambiables: Sensor, procesador de imágenes ”al vuelo”, unidad de procesado principal, acelerador hardware y pila de software. Asimismo, se presenta un algoritmo para realizar transformaciones geométricas, sintetizable en FPGA y con una latencia de tan solo 90 líneas horizontales. Todos los elementos software de esta plataforma están desarrollados con licencias de Software Libre; durante el trascurso de esta tesis se han contribuido y aceptado más de 200 cambios a distintos proyectos de Software Libre, tales como: Linux, YoctoProject y U-boot, entre otros, promoviendo el ecosistema necesario para la creación de una comunidad alrededor de esta tesis.Tras la implementación de la plataforma en un producto comercial, Qtechnology QT5022, y su uso en varias aplicaciones industriales se ha demostrado que es posible el uso de una plataforma genérica de visión que permita reutilizar elementos y comparar resultados objetivamenteDespite the fact that computer vision systems place an important role in our society, its structure does not follow any standard. The implementation of computer vision application require high performance platforms, such as GPUs or FPGAs, and very specialized image sensors. Nowadays, each manufacturer and research lab develops their own vision platform independently without considering any inter-compatibility. This Thesis introduces a new computer vision platform that can be used in a wide spectrum of applications. The characteristics of the platform has been defined after the implementation of three different computer vision applications, based on: SOC, FPGA and GPU respectively. As a result, a new modular platform has been defined with the following interchangeably elements: Sensor, Image Processing Pipeline, Processing Unit, Acceleration unit and Computer Vision Stack. This thesis also presents an FPGA synthetizable algorithm for performing geometric transformations on the fly, with a latency under 90 horizontal lines. All the software elements of this platform have an Open Source licence; over the course of this thesis, more than 200 patches have been contributed and accepted into different Open Source projects like the Linux Kernel, Yocto Project and U-boot, among others, promoting the required ecosystem for the creation of a community around this novel system. The platform has been validated in an industrial product, Qtechnology QT5022, used on diverse industrial applications; demonstrating the great advantages of a generic computer vision system as a platform for reusing elements and comparing results objectivel

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo