2 research outputs found

    MEMORY INTERFACE SYNTHESIS FOR FPGA-BASED COMPUTING

    Get PDF
    This dissertation describes a methodology for the generation of a custom memory interface and associated direct memory access (DMA) controller for FPGA-based kernels that have a regular access pattern. The interface provides explicit support for the following features: (1) memory latency hiding, (2) static access scheduling, and (3) data reuse. The target platform is a multi-FPGA platform, the Convey HC-1, which has an advanced memory system that presents the user logic with three critical design challenges: the memory system itself does not perform caching or prefetching, memory operations are arbitrarily reordered, and the memory performance depends on the access order provided by the user logic. The objective of the interface is to reconcile the three problems described above and maximize overall interface performance. This dissertation proposes three memory access orders, explores buffering and blocking techniques, and exploits data reuse for the synthesis of custom memory interfaces for specific types of kernels. We evaluate our techniques with two types of benchmark kernels: matrix-vector multiplication and 6- and 27-point stencil operations. Experimental results show the proposed memory interface designs that combine memory latency hiding, access scheduling and data reuse achieve an overall performance speedup of 1.6 for matrix-vector multiplication, 2.2 for a 6-point stencil, and 9.5 for a 27-point stencil as compared to using a na茂ve memory interface

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto b谩sico de la arquitectura mono-nucleo en los procesadores de prop贸sito general se ajusta bien a un modelo de programaci贸n secuencial. La integraci贸n de multiples n煤cleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotaci贸n del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es dif铆cil de conseguir usando unicamente multicores de prop贸sito general. La aparici贸n de aceleradores tipo streaming y de los correspondientes modelos de programaci贸n han mejorado esta situaci贸n proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea b谩sica detr谩s del dise帽o de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computaci贸n paralela y comunicaci贸n entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas caracter铆sticas especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran n煤mero de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computaci贸n en forma de flujos. La disposici贸n de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenaci贸n de los patrones de las aplicaciones de acceso a datos puede que no sean muy 煤tiles para lograr un rendimiento 贸ptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tama帽o y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podr铆a ser eliminado empleando t茅cnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitect贸nicas de los aceleradores streaming con dise帽os de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Dise帽o de aceleradores de aplicaciones espec铆ficas con dise帽os de memoria personalizados, ii) dise帽o de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de dise帽o para dispositivos orientados a flujos con las memorias est谩ndar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computaci贸n Blacksmith permite la adopci贸n a nivel de hardware de un front-end de aplicaci贸n espec铆fico utilizando una GPU como back-end. Esto permite maximizar la explotaci贸n de la localidad de datos y el paralelismo a nivel de datos de una aplicaci贸n mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el dise帽o de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicaci贸n en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators 驴 similar to the other processors 驴 consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture
    corecore