29 research outputs found

    PIM-Enclave: Bringing Confidential Computation Inside Memory

    Full text link
    Demand for data-intensive workloads and confidential computing are the prominent research directions shaping the future of cloud computing. Computer architectures are evolving to accommodate the computing of large data better. Protecting the computation of sensitive data is also an imperative yet challenging objective; processor-supported secure enclaves serve as the key element in confidential computing in the cloud. However, side-channel attacks are threatening their security boundaries. The current processor architectures consume a considerable portion of its cycles in moving data. Near data computation is a promising approach that minimizes redundant data movement by placing computation inside storage. In this paper, we present a novel design for Processing-In-Memory (PIM) as a data-intensive workload accelerator for confidential computing. Based on our observation that moving computation closer to memory can achieve efficiency of computation and confidentiality of the processed information simultaneously, we study the advantages of confidential computing \emph{inside} memory. We then explain our security model and programming model developed for PIM-based computation offloading. We construct our findings into a software-hardware co-design, which we call PIM-Enclave. Our design illustrates the advantages of PIM-based confidential computing acceleration. Our evaluation shows PIM-Enclave can provide a side-channel resistant secure computation offloading and run data-intensive applications with negligible performance overhead compared to baseline PIM model

    Exploring New Computing Paradigms for Data-Intensive Applications

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Runtime-assisted optimizations in the on-chip memory hierarchy

    Get PDF
    Following Moore's Law, the number of transistors on chip has been increasing exponentially, which has led to the increasing complexity of modern processors. As a result, the efficient programming of such systems has become more difficult. Many programming models have been developed to answer this issue. Of particular interest are task-based programming models that employ simple annotations to define parallel work in an application. The information available at the level of the runtime systems associated with these programming models offers great potential for improving hardware design. Moreover, due to technological limitations, Moore's Law is predicted to eventually come to an end, so novel paradigms are necessary to maintain the current performance improvement trends. The main goal of this thesis is to exploit the knowledge about a parallel application available at the runtime system level to improve the design of the on-chip memory hierarchy. The coupling of the runtime system and the microprocessor enables a better hardware design without hurting the programmability. The first contribution is a set of insertion policies for shared last-level caches that exploit information about tasks and task data dependencies. The intuition behind this proposal revolves around the observation that parallel threads exhibit different memory access patterns. Even within the same thread, accesses to different variables often follow distinct patterns. The proposed policies insert cache lines into different logical positions depending on the dependency type and task type to which the corresponding memory request belongs. The second proposal optimizes the execution of reductions, defined as a programming pattern that combines input data to form the resulting reduction variable. This is achieved with a runtime-assisted technique for performing reductions in the processor's cache hierarchy. The proposal's goal is to be a universally applicable solution regardless of the reduction variable type, size and access pattern. On the software level, the programming model is extended to let a programmer specify the reduction variables for tasks, as well as the desired cache level where a certain reduction will be performed. The source-to-source compiler and the runtime system are extended to translate and forward this information to the underlying hardware. On the hardware level, private and shared caches are equipped with functional units and the accompanying logic to perform reductions at the cache level. This design avoids unnecessary data movements to the core and back as the data is operated at the place where it resides. The third contribution is a runtime-assisted prioritization scheme for memory requests inside the on-chip memory hierarchy. The proposal is based on the notion of a critical path in the context of parallel codes and a known fact that accelerating critical tasks reduces the execution time of the whole application. In the context of this work, task criticality is observed at a level of a task type as it enables simple annotation by the programmer. The acceleration of critical tasks is achieved by the prioritization of corresponding memory requests in the microprocessor.Siguiendo la ley de Moore, el número de transistores en los chips ha crecido exponencialmente, lo que ha comportado una mayor complejidad en los procesadores modernos y, como resultado, de la dificultad de la programación eficiente de estos sistemas. Se han desarrollado muchos modelos de programación para resolver este problema; un ejemplo particular son los modelos de programación basados en tareas, que emplean anotaciones sencillas para definir los Trabajos paralelos de una aplicación. La información de que disponen los sistemas en tiempo de ejecución (runtime systems) asociada con estos modelos de programación ofrece un enorme potencial para la mejora del diseño del hardware. Por otro lado, las limitaciones tecnológicas hacen que la ley de Moore pueda dejar de cumplirse próximamente, por lo que se necesitan paradigmas nuevos para mantener las tendencias actuales de mejora de rendimiento. El objetivo principal de esta tesis es aprovechar el conocimiento de las aplicaciones paral·leles de que dispone el runtime system para mejorar el diseño de la jerarquía de memoria del chip. El acoplamiento del runtime system junto con el microprocesador permite realizar mejores diseños hardware sin afectar Negativamente en la programabilidad de dichos sistemas. La primera contribución de esta tesis consiste en un conjunto de políticas de inserción para las memorias caché compartidas de último nivel que aprovecha la información de las tareas y las dependencias de datos entre estas. La intuición tras esta propuesta se basa en la observación de que los hilos de ejecución paralelos muestran distintos patrones de acceso a memoria e, incluso dentro del mismo hilo, los accesos a diferentes variables a menudo siguen patrones distintos. Las políticas que se proponen insertan líneas de caché en posiciones lógicas diferentes en función de los tipos de dependencia y tarea a los que corresponde la petición de memoria. La segunda propuesta optimiza la ejecución de las reducciones, que se definen como un patrón de programación que combina datos de entrada para conseguir la variable de reducción como resultado. Esto se consigue mediante una técnica asistida por el runtime system para la realización de reducciones en la jerarquía de la caché del procesador, con el objetivo de ser una solución aplicable de forma universal sin depender del tipo de la variable de la reducción, su tamaño o el patrón de acceso. A nivel de software, el modelo de programación se extiende para que el programador especifique las variables de reducción de las tareas, así como el nivel de caché escogido para que se realice una determinada reducción. El compilador fuente a Fuente (compilador source-to-source) y el runtime ssytem se modifican para que traduzcan y pasen esta información al hardware subyacente, evitando así movimientos de datos innecesarios hacia y desde el núcleo del procesador, al realizarse la operación donde se encuentran los datos de la misma. La tercera contribución proporciona un esquema de priorización asistido por el runtime system para peticiones de memoria dentro de la jerarquía de memoria del chip. La propuesta se basa en la noción de camino crítico en el contexto de los códigos paralelos y en el hecho conocido de que acelerar tareas críticas reduce el tiempo de ejecución de la aplicación completa. En el contexto de este trabajo, la criticidad de las tareas se considera a nivel del tipo de tarea ya que permite que el programador las indique mediante anotaciones sencillas. La aceleración de las tareas críticas se consigue priorizando las correspondientes peticiones de memoria en el microprocesador.Seguint la llei de Moore, el nombre de transistors que contenen els xips ha patit un creixement exponencial, fet que ha provocat un augment de la complexitat dels processadors moderns i, per tant, de la dificultat de la programació eficient d’aquests sistemes. Per intentar solucionar-ho, s’han desenvolupat diversos models de programació; un exemple particular en són els models basats en tasques, que fan servir anotacions senzilles per definir treballs paral·lels dins d’una aplicació. La informació que hi ha al nivell dels sistemes en temps d’execució (runtime systems) associada amb aquests models de programació ofereix un gran potencial a l’hora de millorar el disseny del maquinari. D’altra banda, les limitacions tecnològiques fan que la llei de Moore pugui deixar de complir-se properament, per la qual cosa calen nous paradigmes per mantenir les tendències actuals en la millora de rendiment. L’objectiu principal d’aquesta tesi és aprofitar els coneixements que el runtime System té d’una aplicació paral·lela per millorar el disseny de la jerarquia de memòria dins el xip. L’acoblament del runtime system i el microprocessador permet millorar el disseny del maquinari sense malmetre la programabilitat d’aquests sistemes. La primera contribució d’aquesta tesi consisteix en un conjunt de polítiques d’inserció a les memòries cau (cache memories) compartides d’últim nivell que aprofita informació sobre tasques i les dependències de dades entre aquestes. La intuïció que hi ha al darrere d’aquesta proposta es basa en el fet que els fils d’execució paral·lels mostren diferents patrons d’accés a la memòria; fins i tot dins el mateix fil, els accessos a variables diferents sovint segueixen patrons diferents. Les polítiques que s’hi proposen insereixen línies de la memòria cau a diferents ubicacions lògiques en funció dels tipus de dependència i de tasca als quals correspon la petició de memòria. La segona proposta optimitza l’execució de les reduccions, que es defineixen com un patró de programació que combina dades d’entrada per aconseguir la variable de reducció com a resultat. Això s’aconsegueix mitjançant una tècnica assistida pel runtime system per dur a terme reduccions en la jerarquia de la memòria cau del processador, amb l’objectiu que la proposta sigui aplicable de manera universal, sense dependre del tipus de la variable a la qual es realitza la reducció, la seva mida o el patró d’accés. A nivell de programari, es realitza una extensió del model de programació per facilitar que el programador especifiqui les variables de les reduccions que usaran les tasques, així com el nivell de memòria cau desitjat on s’hauria de realitzar una certa reducció. El compilador font a font (compilador source-to-source) i el runtime system s’amplien per traduir i passar aquesta informació al maquinari subjacent. A nivell de maquinari, les memòries cau privades i compartides s’equipen amb unitats funcionals i la lògica corresponent per poder dur a terme les reduccions a la pròpia memòria cau, evitant així moviments de dades innecessaris entre el nucli del processador i la jerarquia de memòria. La tercera contribució proporciona un esquema de priorització assistit pel runtime System per peticions de memòria dins de la jerarquia de memòria del xip. La proposta es basa en la noció de camí crític en el context dels codis paral·lels i en el fet conegut que l’acceleració de les tasques que formen part del camí crític redueix el temps d’execució de l’aplicació sencera. En el context d’aquest treball, la criticitat de les tasques s’observa al nivell del seu tipus ja que permet que el programador les indiqui mitjançant anotacions senzilles. L’acceleració de les tasques crítiques s’aconsegueix prioritzant les corresponents peticions de memòria dins el microprocessador

    Domain Specific Computing in Tightly-Coupled Heterogeneous Systems

    Get PDF
    Over the past several decades, researchers and programmers across many disciplines have relied on Moores law and Dennard scaling for increases in compute capability in modern processors. However, recent data suggest that the number of transistors per square inch on integrated circuits is losing pace with Moores laws projection due to the breakdown of Dennard scaling at smaller semiconductor process nodes. This has signaled the beginning of a new “golden age in computer architecture” in which the paradigm will be shifted from improving traditional processor performance for general tasks to architecting hardware that executes a class of applications in a high-performing manner. This shift will be paved, in part, by making compute systems more heterogeneous and investigating domain specific architectures. However, the notion of domain specific architectures raises many research questions. Specifically, what constitutes a domain? How does one architect hardware for a specific domain? In this dissertation, we present our work towards domain specific computing. We start by constructing a guiding definition for our target domain and then creating a benchmark suite of applications based on our domain definition. We then use quantitative metrics from the literature to characterize our domain in order to gain insights regarding what would be most beneficial in hardware targeted specifically for the domain. From the characterization, we learn that data movement is a particularly salient aspect of our domain. Motivated by this fact, we evaluate our target platform, the Intel HARPv2 CPU+FPGA system, for architecting domain specific hardware through a portability and performance evaluation. To guide the creation of domain specific hardware for this platform, we create a novel tool to quantify spatial and temporal locality. We apply this tool to our benchmark suite and use the generated outputs as features to an unsupervised clustering algorithm. We posit that the resulting clusters represent sub-domains within our originally specified domain; specifically, these clusters inform whether a kernel of computation should be designed as a widely vectorized or deeply pipelined compute unit. Using the lessons learned from the domain characterization and hardware platform evaluation, we outline our process of designing hardware for our domain, and empirically verify that our prediction regarding a wide or deep kernel implementation is correct

    Applications in Electronics Pervading Industry, Environment and Society

    Get PDF
    This book features the manuscripts accepted for the Special Issue “Applications in Electronics Pervading Industry, Environment and Society—Sensing Systems and Pervasive Intelligence” of the MDPI journal Sensors. Most of the papers come from a selection of the best papers of the 2019 edition of the “Applications in Electronics Pervading Industry, Environment and Society” (APPLEPIES) Conference, which was held in November 2019. All these papers have been significantly enhanced with novel experimental results. The papers give an overview of the trends in research and development activities concerning the pervasive application of electronics in industry, the environment, and society. The focus of these papers is on cyber physical systems (CPS), with research proposals for new sensor acquisition and ADC (analog to digital converter) methods, high-speed communication systems, cybersecurity, big data management, and data processing including emerging machine learning techniques. Physical implementation aspects are discussed as well as the trade-off found between functional performance and hardware/system costs

    Algorithm-Hardware Co-Design for Performance-driven Embedded Genomics

    Get PDF
    PhD ThesisGenomics includes development of techniques for diagnosis, prognosis and therapy of over 6000 known genetic disorders. It is a major driver in the transformation of medicine from the reactive form to the personalized, predictive, preventive and participatory (P4) form. The availability of genome is an essential prerequisite to genomics and is obtained from the sequencing and analysis pipelines of the whole genome sequencing (WGS). The advent of second generation sequencing (SGS), significantly, reduced the sequencing costs leading to voluminous research in genomics. SGS technologies, however, generate massive volumes of data in the form of reads, which are fragmentations of the real genome. The performance requirements associated with mapping reads to the reference genome (RG), in order to reassemble the original genome, now, stands disproportionate to the available computational capabilities. Conventionally, the hardware resources used are made of homogeneous many-core architecture employing complex general-purpose CPU cores. Although these cores provide high-performance, a data-centric approach is required to identify alternate hardware systems more suitable for affordable and sustainable genome analysis. Most state-of-the-art genomic tools are performance oriented and do not address the crucial aspect of energy consumption. Although algorithmic innovations have reduced runtime on conventional hardware, the energy consumption has scaled poorly. The associated monetary and environmental costs have made it a major bottleneck to translational genomics. This thesis is concerned with the development and validation of read mappers for embedded genomics paradigm, aiming to provide a portable and energy-efficient hardware solution to the reassembly pipeline. It applies the algorithmhardware co-design approach to bridge the saturation point arrived in algorithmic innovations with emerging low-power/energy heterogeneous embedded platforms. Essential to embedded paradigm is the ability to use heterogeneous hardware resources. Graphical processing units (GPU) are, often, available in most modern devices alongside CPU but, conventionally, state-of-the-art read mappers are not tuned to use both together. The first part of the thesis develops a Cross-platfOrm Read mApper using opencL (CORAL) that can distribute workload on all available devices for high performance. OpenCL framework mitigates the need for designing separate kernels for CPU and GPU. It implements a verification-aware filtration algorithm for rapid pruning and identification of candidate locations for mapping reads to the RG. Mapping reads on embedded platforms decreases performance due to architectural differences such as limited on-chip/off-chip memory, smaller bandwidths and simpler cores. To mitigate performance degradation, in second part of the thesis, we propose a REad maPper for heterogeneoUs sysTEms (REPUTE) which uses an efficient dynamic programming (DP) based filtration methodology. Using algorithm-hardware co-design and kernel level optimizations to reduce its memory footprint, REPUTE demonstrated significant energy savings on HiKey970 embedded platform with acceptable performance. The third part of the thesis concentrates on mapping the whole genome on an embedded platform. We propose a Pyopencl based tooL for gEnomic workloaDs tarGeting Embedded platfoRms (PLEDGER) which includes two novel contributions. The first one proposes a novel preprocessing strategy to generate low-memory footprint (LMF) data structure to fit all human chromosomes at the cost of performance. Second contribution is LMF DP-based filtration method to work in conjunction with the proposed data structures. To mitigate performance degradation, the kernel employs several optimisations including extensive usage of bit-vector operations. Extensive experiments using real human reads were carried out with state-of-the-art read mappers on 5 different platforms for CORAL, REPUTE and PLEDGER. The results show that embedded genomics provides significant energy savings with similar performance compared to conventional CPU-based platforms

    Novel computational techniques for mapping and classifying Next-Generation Sequencing data

    Get PDF
    Since their emergence around 2006, Next-Generation Sequencing technologies have been revolutionizing biological and medical research. Quickly obtaining an extensive amount of short or long reads of DNA sequence from almost any biological sample enables detecting genomic variants, revealing the composition of species in a metagenome, deciphering cancer biology, decoding the evolution of living or extinct species, or understanding human migration patterns and human history in general. The pace at which the throughput of sequencing technologies is increasing surpasses the growth of storage and computer capacities, which creates new computational challenges in NGS data processing. In this thesis, we present novel computational techniques for read mapping and taxonomic classification. With more than a hundred of published mappers, read mapping might be considered fully solved. However, the vast majority of mappers follow the same paradigm and only little attention has been paid to non-standard mapping approaches. Here, we propound the so-called dynamic mapping that we show to significantly improve the resulting alignments compared to traditional mapping approaches. Dynamic mapping is based on exploiting the information from previously computed alignments, helping to improve the mapping of subsequent reads. We provide the first comprehensive overview of this method and demonstrate its qualities using Dynamic Mapping Simulator, a pipeline that compares various dynamic mapping scenarios to static mapping and iterative referencing. An important component of a dynamic mapper is an online consensus caller, i.e., a program collecting alignment statistics and guiding updates of the reference in the online fashion. We provide Ococo, the first online consensus caller that implements a smart statistics for individual genomic positions using compact bit counters. Beyond its application to dynamic mapping, Ococo can be employed as an online SNP caller in various analysis pipelines, enabling SNP calling from a stream without saving the alignments on disk. Metagenomic classification of NGS reads is another major topic studied in the thesis. Having a database with thousands of reference genomes placed on a taxonomic tree, the task is to rapidly assign a huge amount of NGS reads to tree nodes, and possibly estimate the relative abundance of involved species. In this thesis, we propose improved computational techniques for this task. In a series of experiments, we show that spaced seeds consistently improve the classification accuracy. We provide Seed-Kraken, a spaced seed extension of Kraken, the most popular classifier at present. Furthermore, we suggest ProPhyle, a new indexing strategy based on a BWT-index, obtaining a much smaller and more informative index compared to Kraken. We provide a modified version of BWA that improves the BWT-index for a quick k-mer look-up

    Dynamic partial reconfiguration management for high performance and reliability in FPGAs

    Get PDF
    Modern Field-Programmable Gate Arrays (FPGAs) are no longer used to implement small “glue logic” circuitries. The high-density of reconfigurable logic resources in today’s FPGAs enable the implementation of large systems in a single chip. FPGAs are highly flexible devices; their functionality can be altered by simply loading a new binary file in their configuration memory. While the flexibility of FPGAs is comparable to General-Purpose Processors (GPPs), in the sense that different functions can be performed using the same hardware, the performance gain that can be achieved using FPGAs can be orders of magnitudes higher as FPGAs offer the ability for customisation of parallel computational architectures. Dynamic Partial Reconfiguration (DPR) allows for changing the functionality of certain blocks on the chip while the rest of the FPGA is operational. DPR has sparked the interest of researchers to explore new computational platforms where computational tasks are off-loaded from a main CPU to be executed using dedicated reconfigurable hardware accelerators configured on demand at run-time. By having a battery of custom accelerators which can be swapped in and out of the FPGA at runtime, a higher computational density can be achieved compared to static systems where the accelerators are bound to fixed locations within the chip. Furthermore, the ability of relocating these accelerators across several locations on the chip allows for the implementation of adaptive systems which can mitigate emerging faults in the FPGA chip when operating in harsh environments. By porting the appropriate fault mitigation techniques in such computational platforms, the advantages of FPGAs can be harnessed in different applications in space and military electronics where FPGAs are usually seen as unreliable devices due to their sensitivity to radiation and extreme environmental conditions. In light of the above, this thesis investigates the deployment of DPR as: 1) a method for enhancing performance by efficient exploitation of the FPGA resources, and 2) a method for enhancing the reliability of systems intended to operate in harsh environments. Achieving optimal performance in such systems requires an efficient internal configuration management system to manage the reconfiguration and execution of the reconfigurable modules in the FPGA. In addition, the system needs to support “fault-resilience” features by integrating parameterisable fault detection and recovery capabilities to meet the reliability standard of fault-tolerant applications. This thesis addresses all the design and implementation aspects of an Internal Configuration Manger (ICM) which supports a novel bitstream relocation model to enable the placement of relocatable accelerators across several locations on the FPGA chip. In addition to supporting all the configuration capabilities required to implement a Reconfigurable Operating System (ROS), the proposed ICM also supports the novel multiple-clone configuration technique which allows for cloning several instances of the same hardware accelerator at the same time resulting in much shorter configuration time compared to traditional configuration techniques. A faulttolerant (FT) version of the proposed ICM which supports a comprehensive faultrecovery scheme is also introduced in this thesis. The proposed FT-ICM is designed with a much smaller area footprint compared to Triple Modular Redundancy (TMR) hardening techniques while keeping a comparable level of fault-resilience. The capabilities of the proposed ICM system are demonstrated with two novel applications. The first application demonstrates a proof-of-concept reliable FPGA server solution used for executing encryption/decryption queries. The proposed server deploys bitstream relocation and modular redundancy to mitigate both permanent and transient faults in the device. It also deploys a novel Built-In Self- Test (BIST) diagnosis scheme, specifically designed to detect emerging permanent faults in the system at run-time. The second application is a data mining application where DPR is used to increase the computational density of a system used to implement the Frequent Itemset Mining (FIM) problem
    corecore