5 research outputs found
Recommended from our members
Accelerating Similarly Structured Data
The failure of Dennard scaling [Bohr, 2007] and the rapid growth of data produced and consumed daily [NetApp, 2012] have made mitigating the dark silicon phenomena [Esmaeilzadeh et al., 2011] and providing fast computation for processing large volumes and expansive variety of data while consuming minimal energy the utmost important challenges for modern computer architecture. This thesis introduces the concept that grouping data structures that are previously defined in software and processing them with an accelerator can significantly improve the application performance and energy efficiency.
To measure the potential performance benefits of this hypothesis, this research starts out by examining the cache impacts on accelerating commonly used data structures and its applicability to popular benchmarks. We found that accelerating similarly structured data can provide substantial benefits, however, most popular benchmark suites do not contain shared acceleration targets and therefore cannot obtain significant performance or energy improvements via a handful of accelerators. To further examine this hypothesis in an environment where the common data structures are widely used, we choose to target database application domain, using tables and columns as the similarly structured data, accelerating the processing of such data, and evaluate the performance and energy efficiency. Given that data partitioning is widely used for database applications to improve cache locality, we architect and design a streaming data partitioning accelerator to assess the feasibility of big data acceleration. The results show that we are able to achieve an order of magnitude improvement in partitioning performance and energy. To improve upon the present ad-hoc communications between accelerators and general-purpose processors [Vo et al., 2013], we also architect and evaluate a streaming framework that can be used for the data parti- tioner and other streaming accelerators alike. The streaming framework can provide at least 5 GB/s per stream per thread using software control, and is able to elegantly handle interrupts and context switches using a simple save/restore. As a final evaluation of this hypothesis, we architect a class of domain-specific database processors, or Database Processing Units (DPUs), to further improve the performance and energy efficiency of database applications. As a case study, we design and implement one DPU, called Q100, to execute industry standard analytic database queries. Despite Q100's sensitivity to communication bandwidth on-chip and off-chip, we find that the low-power configuration of Q100 is able to provide three orders of magnitude in energy efficiency over a state of the art software Database Management System (DBMS), while the high-performance configuration is able to outperform the same DBMS by 70X.
Based on these experiments, we conclude that grouping similarly structured data and processing it with accelerators vastly improve application performance and energy efficiency for a given application domain. This is primarily due to the fact that creating specialized encapsulated instruction and data accesses and datapaths allows us to mitigate unnecessary data movement, take advantage of data and pipeline parallelism, and consequently provide substantial energy savings while obtaining significant performance gains
Data layout types : a type-based approach to automatic data layout transformations for improved SIMD vectorisation
The increasing complexity of modern hardware requires sophisticated programming
techniques for programs to run efficiently. At the same time, increased power of
modern hardware enables more advanced analyses to be included in compilers. This
thesis focuses on one particular optimisation technique that improves utilisation
of vector units. The foundation of this technique is the ability to chose memory
mappings for data structures of a given program.
Usually programming languages use a fixed layout for logical data structures
in physical memory. Such a static mapping often has a negative effect on usability
of vector units. In this thesis we consider a compiler for a programming language
that allows every data structure in a program to have its own data layout. We
make sure that data layouts across the program are sound, and most importantly
we solve a problem of automatic data layout reconstruction. To consistently do this,
we formulate this as a type inference problem, where type encodes a data layout
for a given structure as well as implied program transformations. We prove that
type-implied transformations preserve semantics of the original programs and we
demonstrate significant performance improvements when targeting SIMD-capable
architectures
Χρήση μοντέλου παράλληλου προγραμματισμού για σύνθεση αρχιτεκτονικών
The problem of automatically generating hardware modules from high level application representations has been at the forefront of EDA research during the last few years. In this Dissertation we introduce a methodology to automatically synthesize hardware accelerators from OpenCL applications. OpenCL is a recent industry supported standard for writing programs that execute on multicore platforms and accelerators such as GPUs. Our methodology maps OpenCL kernels into hardware accelerators based on architectural templates that explicitly decouple computation from memory communication whenever this is possible. The templates can be tuned to provide a wide repertoire of accelerators that meet user performance requirements and FPGA device characteristics. Furthermore a set of high- and low-level compiler optimizations is applied to generate optimized accelerators. Our experimental evaluation shows that the generated accelerators are tuned efficiently to match the applications memory access pattern and computational complexity and to achieve user performance requirements. An important objective of our tool is to expand the FPGA development user base to software engineers thereby expanding the scope of FPGAs beyond the realm of hardware design.To πρόβλημα της αυτόματης δημιουργίας μονάδων υλικό από παραστάσεις υψηλού επιπέδου εφαρμογής είναι στην πρώτη γραμμή της EDA έρευνας κατά τη διάρκεια των τελευταίων ετών. Σε αυτή την διατριβή παρουσιάζουμε μια μεθοδολογία για τη αυτόματη σύνθεση επιταχυντές υλικού από εφαρμογές OpenCL. OpenCL είναι ένα πρόσφατο πρότυπο για τη σύνταξη των προγραμμάτων που εκτελούνται σε πλατφόρμες πολλαπλών πυρήνων και επιταχυντές όπως GPUs. Η μεθοδολογία μας μετατρέπει προγράμματα OpenCL σε επιταχυντές υλικού με βάση αρχιτεκτονικά πρότυπα που ρητά αποσυνδέει τους υπολογισμούς από την μεταφορά δεδομένων από/προς την μνήμη όποτε αυτό είναι δυνατό. Τα πρότυπα μπορούν να συντονιστούν ώστε να παρέχουν ένα ευρύ ρεπερτόριο από επιταχυντές που πληρούν τις απαιτήσεις απόδοσης των χρηστών και τα χαρακτηριστικά της συσκευής FPGA. Επιπλέον ένα σύνολο υψηλής και χαμηλής στάθμης βελτιστοποιήσεις μεταγλωττιστή εφαρμόζεται για να παράγει βελτιστοποιημένα επιταχυντές. Η πειραματική αξιολόγηση δείχνει ότι οι επιταχυντές που δημιουργούνται αποτελεσματικά συντονισμένοι για να ταιριάζει με το μοτίβο πρόσβασης στην μνήμη κάθε εφαρμογής και την υπολογιστική πολυπλοκότητα και να επιτύχουν τις απαιτήσεις απόδοσης των χρηστών. Ένας σημαντικός στόχος του εργαλείου μας είναι η επέκταση της βάσης χρηστών πλατφόρμες FPGA για μηχανικούς λογισμικού ώστε να γίνει ανάπτυξη FPGA συστήματα από μηχανικούς λογισμικού χωρίς την ανάγκη για εμπειρία σχεδιασμού υλικού
Specialization and reconfiguration of lightweight mobile processors for data-parallel applications
The worldwide utilization of mobile devices makes the segment of low power mobile processors leading in the entire computer industry. Customers demand low-cost, high-performance and energy-efficient mobile devices, which execute sophisticated mobile applications such as multimedia and 3D games. State-of-the-art mobile devices already utilize chip multiprocessors (CMP) with dedicated accelerators that exploit data-level parallelism (DLP) in these applications. Such heterogeneous system design enable the mobile processors to deliver the desired performance and efficiency. The heterogeneity however increases the processors complexity and manufacturing cost when adding extra special-purpose hardware for the accelerators. In this thesis, we propose new hardware techniques that leverage the available resources of a mobile CMP to achieve cost-effective acceleration of DLP workloads.
Our techniques are inspired by classic vector architectures and the latest reconfigurable architectures, which both achieve high power efficiency when running DLP workloads. The high requirement of additional resources for these two architectures limits their applicability beyond high-performance computers. To achieve their advantages in mobile devices, we propose techniques that: 1) specialize the lightweight mobile cores for classic vector execution of DLP workloads; 2) dynamically tune the number of cores for the specialized execution; and 3) reconfigure a bulk of the existing general purpose execution resources into a compute hardware accelerator. Specialization enables one or more cores to process configurable large vector operands with new special purpose vector instructions. Reconfiguration goes one step further and allow the compute hardware in mobile cores to dynamically implement the entire functionality of diverse compute algorithms.
The proposed specialization and reconfiguration techniques are applicable to a diverse range of general purpose processors available in mobile devices nowadays. However, we chose to implement and evaluate them on a lightweight processor based on the Explicit Data Graph Execution architecture, which we find promising for the research of low-power processors. The implemented techniques improve the mobile processor performance and the efficiency on its existing general purpose resources. The processor with enabled specialization/reconfiguration techniques efficiently exploits DLP without the extra cost of special-purpose accelerators.La utilización de dispositivos móviles a nivel mundial hace que el segmento de procesadores móviles de bajo consumo lidere la industria de computación. Los clientes piden dispositivos móviles de bajo coste, alto rendimiento y bajo consumo, que ejecuten aplicaciones móviles sofisticadas, tales como multimedia y juegos 3D.Los dispositivos móviles más avanzados utilizan chips con multiprocesadores (CMP) con aceleradores dedicados que explotan el paralelismo a nivel de datos (DLP) en estas aplicaciones. Tal diseño de sistemas heterogéneos permite a los procesadores móviles ofrecer el rendimiento y la eficiencia deseada. La heterogeneidad sin embargo aumenta la complejidad y el coste de fabricación de los procesadores al agregar hardware de propósito específico adicional para implementar los aceleradores. En esta tesis se proponen nuevas técnicas de hardware que aprovechan los recursos disponibles en un CMP móvil para lograr una aceleración con bajo coste de las aplicaciones con DLP. Nuestras técnicas están inspiradas por los procesadores vectoriales clásicos y por las recientes arquitecturas reconfigurables, pues ambas logran alta eficiencia en potencia al ejecutar cargas de trabajo DLP. Pero la alta exigencia de recursos adicionales que estas dos arquitecturas necesitan, limita sus aplicabilidad más allá de las computadoras de alto rendimiento. Para lograr sus ventajas en dispositivos móviles, en esta tesis se proponen técnicas que: 1) especializan núcleos móviles ligeros para la ejecución vectorial clásica de cargas de trabajo DLP; 2) ajustan dinámicamente el número de núcleos de ejecución especializada; y 3) reconfiguran en bloque los recursos existentes de ejecución de propósito general en un acelerador hardware de computación. La especialización permite a uno o más núcleos procesar cantidades configurables de operandos vectoriales largos con nuevas instrucciones vectoriales. La reconfiguración da un paso más y permite que el hardware de cómputo en los núcleos móviles ejecute dinámicamente toda la funcionalidad de diversos algoritmos informáticos. Las técnicas de especialización y reconfiguración propuestas son aplicables a diversos procesadores de propósito general disponibles en los dispositivos móviles de hoy en día. Sin embargo, en esta tesis se ha optado por implementarlas y evaluarlas en un procesador ligero basado en la arquitectura "Explicit Data Graph Execution", que encontramos prometedora para la investigación de procesadores de baja potencia. Las técnicas aplicadas mejoraran el rendimiento del procesador móvil y la eficiencia energética de sus recursos para propósito general ya existentes. El procesador con técnicas de especialización/reconfiguración habilitadas explota eficientemente el DLP sin el coste adicional de los aceleradores de propósito especial
RISPP: A Run-time Adaptive Reconfigurable Embedded Processor
This Ph.D. thesis describes a new approach for adaptive processors using a reconfigurable fabric (embedded FPGA) to implement application-specific accelerators. A novel modular Special Instruction composition is presented along with a run-time system that exploits the provided adaptivity. The approach was simulated and prototyped using and FPGA. Comparisons with state-of-the-art appl.-specific and reconf. processors demonstrate significant improvements according the performance and efficiency