107 research outputs found

    Locality analysis and its hardware implications for graph pattern mining

    Get PDF
    En aquest treball hem abordat l'acceleració d'aplicacions GPM des de la perspectiva oferta per l'arquitectura NDP. Hem desenvolupat una nova eina de simulació, basada en la integració de dos coneguts simuladors: ZSim (per als cores i les caches) i Ramulator (per a la memòria). Hem hagut de dissenyar específicament aquesta integració perquè la implementació disponible per a la utilització conjunta de tots dos simuladors no aprofita les tècniques que fa servir ZSim per reduir la pèrdua de precisió. Després hem implementat al simulador un accelerador GPM que utilitza l'arquitectura NDP (NDMiner), que representa l'estat de l'art. L'eina de simulació permet realitzar un detallat ``profiling'' de NDMiner, molt útil per identificar els seus punts febles. D'aquesta manera, el simulador facilita el disseny d'estratègies per millorar el rendiment de l'accelerador. Mitjançant una sèrie dexperiments en simulació, hem elaborat una sèrie de propostes concretes per solucionar els problemes detectats i millorar NDMiner.En este trabajo, hemos abordado la aceleración de aplicaciones GPM desde la perspectiva ofrecida por la arquitectura NDP. Hemos desarrollado una nueva herramienta de simulación, basada en la integración de dos conocidos simuladores: ZSim (para los cores y las caches) y Ramulator (para la memoria). Hemos tenido que diseñar específicamente esta integración porque la implementación disponible para la utilización conjunta de ambos simuladores no aprovecha las técnicas que usa ZSim para reducir la pérdida de precisión. Luego hemos implementado en el simulador un acelerador GPM que utiliza la arquitectura NDP (NDMiner), entendemos que representa el estado-del-arte al respecto. La herramienta de simulación permite realizar un detallado ``profiling'' de NDMiner, muy útil para identificar sus puntos débiles. De esta forma, el simulador facilita el diseño de estrategias para mejorar el rendimiento del acelerador. Mediante una serie de experimentos en simulación, hemos elaborado una serie de propuestas concretas para solucionar los problemas detectados y mejorar NDMiner.In this work, we have addressed the acceleration of GPM applications from the perspective offered by the NDP architecture. We have developed a new simulation tool, based on the integration of two well-known simulators: ZSim (for the cores and the caches) and Ramulator (for the memory). The need to carry out this integration arises from the fact that the implementation available for the joint use of both simulators does not take advantage of the techniques that ZSim uses to reduce the loss of precision. We have implemented in simulation a state-of-the-art GPM accelerator based on the NDP architecture (NDMiner). The new simulation tool allows a detailed NDMiner profiling to identify its weak points. Therefore, it helps to design strategies that alleviate those bottlenecks and improve their performance. Consequently, after realizing experiments with the new simulator, we have elaborated a series of concrete proposals to solve some of the problems detected and to improve NDMiner.Outgoin

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Get PDF
    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ×2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimization’s robustness by ×2.35

    Customizing the Computation Capabilities of Microprocessors.

    Full text link
    Designers of microprocessor-based systems must constantly improve performance and increase computational efficiency in their designs to create value. To this end, it is increasingly common to see computation accelerators in general-purpose processor designs. Computation accelerators collapse portions of an application's dataflow graph, reducing the critical path of computations, easing the burden on processor resources, and reducing energy consumption in systems. There are many problems associated with adding accelerators to microprocessors, though. Design of accelerators, architectural integration, and software support all present major challenges. This dissertation tackles these challenges in the context of accelerators targeting acyclic and cyclic patterns of computation. First, a technique to identify critical computation subgraphs within an application set is presented. This technique is hardware-cognizant and effectively generates a set of instruction set extensions given a domain of target applications. Next, several general-purpose accelerator structures are quantitatively designed using critical subgraph analysis for a broad application set. The next challenge is architectural integration of accelerators. Traditionally, software invokes accelerators by statically encoding new instructions into the application binary. This is incredibly costly, though, requiring many portions of hardware and software to be redesigned. This dissertation develops strategies to utilize accelerators, without changing the instruction set. In the proposed approach, the microarchitecture translates applications at run-time, replacing computation subgraphs with microcode to utilize accelerators. We explore the tradeoffs in performing difficult aspects of the translation at compile-time, while retaining run-time replacement. This culminates in a simple microarchitectural interface that supports a plug-and-play model for integrating accelerators into a pre-designed microprocessor. Software support is the last challenge in dealing with computation accelerators. The primary issue is difficulty in generating high-quality code utilizing accelerators. Hand-written assembly code is standard in industry, and if compiler support does exist, simple greedy algorithms are common. In this work, we investigate more thorough techniques for compiling for computation accelerators. Where greedy heuristics only explore one possible solution, the techniques in this dissertation explore the entire design space, when possible. Intelligent pruning methods ensure that compilation is both tractable and scalable.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57633/2/ntclark_1.pd

    GraphMineSuite: Enabling High-Performance and Programmable Graph Mining Algorithms with Set Algebra

    Full text link
    We propose GraphMineSuite (GMS): the first benchmarking suite for graph mining that facilitates evaluating and constructing high-performance graph mining algorithms. First, GMS comes with a benchmark specification based on extensive literature review, prescribing representative problems, algorithms, and datasets. Second, GMS offers a carefully designed software platform for seamless testing of different fine-grained elements of graph mining algorithms, such as graph representations or algorithm subroutines. The platform includes parallel implementations of more than 40 considered baselines, and it facilitates developing complex and fast mining algorithms. High modularity is possible by harnessing set algebra operations such as set intersection and difference, which enables breaking complex graph mining algorithms into simple building blocks that can be separately experimented with. GMS is supported with a broad concurrency analysis for portability in performance insights, and a novel performance metric to assess the throughput of graph mining algorithms, enabling more insightful evaluation. As use cases, we harness GMS to rapidly redesign and accelerate state-of-the-art baselines of core graph mining problems: degeneracy reordering (by up to >2x), maximal clique listing (by up to >9x), k-clique listing (by 1.1x), and subgraph isomorphism (by up to 2.5x), also obtaining better theoretical performance bounds
    corecore