5 research outputs found

    Separation logic for high-level synthesis

    Get PDF
    High-level synthesis (HLS) promises a significant shortening of the digital hardware design cycle by raising the abstraction level of the design entry to high-level languages such as C/C++. However, applications using dynamic, pointer-based data structures remain difficult to implement well, yet such constructs are widely used in software. Automated optimisations that leverage the memory bandwidth of dedicated hardware implementations by distributing the application data over separate on-chip memories and parallelise the implementation are often ineffective in the presence of dynamic data structures, due to the lack of an automated analysis that disambiguates pointer-based memory accesses. This thesis takes a step towards closing this gap. We explore recent advances in separation logic, a rigorous mathematical framework that enables formal reasoning about the memory access of heap-manipulating programs. We develop a static analysis that automatically splits heap-allocated data structures into provably disjoint regions. Our algorithm focuses on dynamic data structures accessed in loops and is accompanied by automated source-to-source transformations which enable loop parallelisation and physical memory partitioning by off-the-shelf HLS tools. We then extend the scope of our technique to pointer-based memory-intensive implementations that require access to an off-chip memory. The extended HLS design aid generates parallel on-chip multi-cache architectures. It uses the disjointness property of memory accesses to support non-overlapping memory regions by private caches. It also identifies regions which are shared after parallelisation and which are supported by parallel caches with a coherency mechanism and synchronisation, resulting in automatically specialised memory systems. We show up to 15x acceleration from heap partitioning, parallelisation and the insertion of the custom cache system in demonstrably practical applications.Open Acces

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.Postprint (published version

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto básico de la arquitectura mono-nucleo en los procesadores de propósito general se ajusta bien a un modelo de programación secuencial. La integración de multiples núcleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotación del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es difícil de conseguir usando unicamente multicores de propósito general. La aparición de aceleradores tipo streaming y de los correspondientes modelos de programación han mejorado esta situación proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea básica detrás del diseño de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computación paralela y comunicación entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas características especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran número de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computación en forma de flujos. La disposición de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenación de los patrones de las aplicaciones de acceso a datos puede que no sean muy útiles para lograr un rendimiento óptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tamaño y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podría ser eliminado empleando técnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitectónicas de los aceleradores streaming con diseños de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Diseño de aceleradores de aplicaciones específicas con diseños de memoria personalizados, ii) diseño de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de diseño para dispositivos orientados a flujos con las memorias estándar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computación Blacksmith permite la adopción a nivel de hardware de un front-end de aplicación específico utilizando una GPU como back-end. Esto permite maximizar la explotación de la localidad de datos y el paralelismo a nivel de datos de una aplicación mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el diseño de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicación en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators ¿ similar to the other processors ¿ consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture

    HLS-lohkojen evaluointi ASIC-piirien toteutusvuossa

    Get PDF
    Digital systems continue growing in complexity, but the design and verification productivity has not been able to improve in the same manner, which has led to a productivity gap. Raising the abstraction level of the design with high-level synthesis (HLS) has been proposed to increase productivity. However, at the higher abstraction level, the designer has less control on the generated register-transfer level (RTL) code, which might cause problems later in the design flow. Moreover, certain design steps might be impractical to carry out with HLS. This thesis work investigates if HLS is compliant with an existing ASIC implementation flow. The research is conducted by creating an IP (intellectual property) block with a modern HLS tool and passing the generated RTL code through the various steps in the flow. The quality of results and design effort are also compared to the manually coded RTL implementation of the same IP. The HLS tool and the generated RTL code are found mostly compliant with the existing flow, but a few problems are identified in the ECOs (engineering change orders) and technology-specific component instantiation. The HLS design has almost equal physical area with the hand-written RTL design, and it meets the given timing constraints. Design effort with HLS is estimated 20-50% smaller compared to traditional RTL design, and the C++ code contains 60% fewer lines of code than the manually written VHDL code

    Profilage, caractérisation et partitionnement fonctionnel dans une plate-forme de conception de systèmes embarqués

    Get PDF
    RÉSUMÉ La complexité architecturale des systèmes embarqués augmente constamment et ceux-ci comprennent maintenant plusieurs processeurs, bus, périphériques et accélérateurs matériels. Les méthodologies présentement utilisées par l'industrie pour la conception des systèmes embarqués n'arrivent pas à suivre cette évolution. Des méthodologies de niveau système ont été proposées pour hausser le niveau d'abstraction de la conception des systèmes embarqués. Une telle méthodologie comporte une plate-forme virtuelle qui permet d'allouer des composants, d'y assigner la fonctionnalité de l'application et de simuler l'architecture résultante à un niveau transactionnel. Une méthodologie de niveau système peut accélérer la conception des systèmes embarqués en partant d'une spécification exécutable, en explorant automatiquement l'espace de conception et en synthétisant une architecture optimisée pour l'application. Cependant, les méthodologies de niveau système existantes ont plusieurs lacunes. Elles supposent typiquement que l'application est modélisée avec un modèle de calcul restrictif et n'automatisent pas la synthèse des modules de l'application vers des blocs matériels. Elles n'intègrent pas un profilage non-intrusif de l'application ou d'une architecture qui l'implémente. Leurs méthodes d'estimation n'automatisent pas la caractérisation de l'application ou de la plate-forme. Ces méthodologies considèrent séparément les problèmes de l'allocation des processeurs, de l'assignation des tâches aux processeurs et du choix d'une topologie de communication. Nous présentons une méthodologie de niveau système pour la conception, l'exploration architecturale et la synthèse des systèmes embarqués basée sur la technologie Space Code- sign� et sa plate-forme virtuelle SPACE. Cette méthodologie répond aux problématiques soulevées car elle combine un modèle de calcul plus expressif, une méthode de synthèse matérielle automatisée des modules d'une spécification SystemC, un profilage non-intrusif au niveau système, une méthode de caractérisation automatisée de l'application et du système d'exploitation temps-réel (RTOS), ainsi que des heuristiques pour une formulation unifiée du problème d'exploration architecturale. Ainsi, nous avons défini pour notre méthodologie un nouveau modèle de calcul, les réseaux de processus temps-réel (RTPN) qui sont une extension des réseaux de processus Kahn. Cette extension permet de modéliser des aspects importants du traitement temps-réel tels que la scrutation, les senseurs échantillonnés, les périphériques d'entrée/sortie et les contraintes temps-réel. La sémantique dénotationnelle des RTPN est définie afin de vérifier si le raffinement d'une spécification exécutable SystemC vers une implémentation concrète est fonctionnellement correct. Notre méthodologie inclut une méthode automatisée de raffinement des communications transactionnelles vers des protocoles précis au cycle et à la broche près ainsi que la génération automatique de blocs matériels pour les modules de l'application. Cette méthode permet, conjointement avec une méthode de génération de code embarqué incluant un RTOS, de générer une implémentation de l'application qui peut être simulée avec la plate-forme virtuelle ou synthétisée et exécutée sur la cible finale. Une nouvelle méthode de profilage au niveau système est appliquée à une telle simulation, ce qui permet d'extraire non-intrusivement des données sur la performance des modules, des processeurs, du RTOS, des bus et des mémoires. Une nouvelle méthode automatisée permet de caractériser, par des simulations profilées, à la fois la fonctionnalité de l'application et les implémentations logicielles et matérielles de ses modules. Les périphériques et les bus de la plate-forme virtuelle ont également été caractérisées et une nouvelle méthode automatise la caractérisation du RTOS. Ces caractérisations configurent un simulateur de performance à haut niveau qui estime précisément et très rapidement la performance d'un ensemble d'architectures pour l'application en tenant compte de la contention sur les bus et de l'ordonnancement des tâches sur les processeurs. Cette caractérisation mène également à une estimation précise et rapide des besoins en ressources matérielles. Nous présentons une formulation du problème d'exploration architecturale qui combine le partitionnement logiciel/matériel, l'allocation des processeurs, l'assignation des tâches aux processeurs et le choix d'une topologie de communication. L'exploration architecturale évalue les architectures selon des critères de performance et de coût matériel à l'aide de notre méthode d'estimation. Nous présentons pour la première fois une analyse combinatoire de ce problème et sa formulation comme un problème de recherche locale, pour la résolution duquel nous définissons des heuristiques basées sur un recuit simulé adaptatif et sur une recherche tabou réactive. L'architecture retenue par l'exploration architecturale peut ensuite être synthétisée vers une implémentation finale dans un flot de conception RTL bien établi. La méthodologie dans son ensemble est appliquée à trois études de cas : un système de guidage d'un astromobile, un décodeur JPEG avec détection de peau et un encodeur/décodeur WiMAX. ----------ABSTRACT Embedded systems have increasingly complex architectures and are now composed of several processors, buses, peripherals and hardware accelerators. Embedded system design methodologies currently used in industry are not keeping up with this evolution. System-level methodologies have been proposed in order to raise the level of abstraction of embedded system design. Such a methodology includes a virtual platform in which components can be allocated while application tasks can be bound to allocated components for a transaction-level simulation of the resulting architecture. A system-level methodology can accelerate embedded system design by using an executable specification, automating design space exploration and synthesizing an optimized architecture for the application. However, current system-level methodologies have several shortcomings. They typically assume that the application is modeled with a restrictive model of computation and do not automate the synthesis of hardware blocks from application modules. They do not support a non-intrusive profiling of the application or of an architecture implementing the application. Their estimation methods do not automate the characterization of the application or of the platform. These methodologies consider processor allocation, task binding to processors and the choice of a communication topology to be separate problems instead of being different aspects of a single problem. We present a system-level methodology for the design, architectural exploration and synthesis of embedded systems based on the Space Codesign� technology and its SPACE virtual platform. This methodology tackles these problems by combining a more expressive model of computation, a method for the automated synthesis of hardware blocks from a SystemC specification's modules, a non-intrusive system-level profiling, a method for the automated characterization of the application and of the real-time operating system (RTOS), as well as heuristics for a unified formulation of the architectural exploration problem. We have thus defined for our methodology a novel model of computation, called real-time process networks (RTPN), which is an extension of Kahn process networks. This extension enables the modeling of important aspects of real-time processing, such as polling, sensor sampling, input/output peripherals and real-time constraints. We define the denotational semantics of RTPNs, which is used to verify the functional correctness of a refinement from a SystemC executable specification to a concrete implementation. Our methodology includes an automated refinement from transaction-level communications to cycle- and pin-accurate protocols as well as an automated generation of hardware blocks from application modules. This method enables, when combined with an embedded software generation method which includes a RTOS, the generation of an implementation of the application, which can be simulated with the virtual platform or synthesized and executed on the final target. A novel profiling method is applied to such simulations in order to non-intrusively extract data on the performance of modules, processors, RTOS, buses and memories. A novel automated method characterizes, through profiled simulations, both the application functionality and the software and hardware implementations of its modules. The devices and buses of the virtual platform have also been characterized and a novel method automates the characterization of the RTOS. These characterizations configure a high-level performance simulator for an accurate and very fast estimation of the performance of several candidate architectures for the application, taking into account bus contention and task scheduling on processors. This characterization also powers a fast and accurate estimation of required hardware resources. A formulation of the architectural exploration problem is given such that it combines hardware/software partitioning, processor allocation, task binding on processors and the selection of a communication topology. This architectural exploration evaluates architectures for criteria of performance and hardware cost with our estimation method. We present for the first time a combinatorial analysis of this problem and its formulation as a local search problem, for which heuristics based on adaptative simulated annealing and reactive tabu search are defined. The architecture selected by the architectural exploration can then be synthesized towards a final implementation in a well-established RTL design flow. The methodology as a whole has been applied to three case studies: a rover guiding system, a JPEG decoder with skin detection and a WiMAX encoder/decoder
    corecore