Search CORE

152 research outputs found

Design and Implementation of a Domain Specific Language for Deep Learning

Author: Huang Xiao Bing
Publication venue: UWM Digital Commons
Publication date: 01/05/2018
Field of study

\textit {Deep Learning} (DL) has found great success in well-diversified areas such as machine vision, speech recognition, big data analysis, and multimedia understanding recently. However, the existing state-of-the-art DL frameworks, e.g. Caffe2, Theano, TensorFlow, MxNet, Torch7, and CNTK, are programming libraries with fixed user interfaces, internal representations, and execution environments. Modifying the code of DL layers or data structure is very challenging without in-depth understanding of the underlying implementation. The optimization of the code and execution in these tools is often limited and relies on the specific DL computation graph manipulation and scheduling that lack systematic and universal strategies. Furthermore, most of these tools demand many dependencies beside the tool itself and require to be built to some specific platforms for DL training or inference. \\\\ \noindent This dissertation presents {\it DeepDSL}, a \textit {domain specific language} (DSL) embedded in Scala, that compiles DL networks encoded with DeepDSL to efficient, compact, and portable Java source programs for DL training and inference. DeepDSL represents DL networks as abstract tensor functions, performs symbolic gradient derivations to generate the Intermediate Representation (IR), optimizes the IR expressions, and compiles the optimized IR expressions to cross-platform Java code that is easily modifiable and debuggable. Also, the code directly runs on GPU without additional dependencies except a small set of \textit{JNI} (Java Native Interface) wrappers for invoking the underneath GPU libraries. Moreover, DeepDSL provides static analysis for memory consumption and error detection. \\\\ \noindent DeepDSL\footnote{Our previous results are reported in~\cite{zhao2017}; design and implementation details are summarized in~\cite{Zhao2018}.} has been evaluated with many current state-of-the-art DL networks (e.g. Alexnet, GoogleNet, VGG, Overfeat, and Deep Residual Network). While the DSL code is highly compact with less than 100 lines for each of the network, the Java source code generated by the DeepDSL compiler is highly efficient. Our experiments show that the output java source has very competitive runtime performance and memory efficiency compared to the existing DL frameworks

University of Wisconsin-Milwaukee

Abstractions and performance optimisations for finite element methods

Author: Sun Tianjiao
Publication venue: Computing, Imperial College London
Publication date: 01/01/2022
Field of study

Finding numerical solutions to partial differential equations (PDEs) is an essential task in the discipline of scientific computing. In designing software tools for this task, one of the ultimate goals is to balance the needs for generality, ease to use and high performance. Domain-specific systems based on code generation techniques, such as Firedrake, attempt to address this problem with a design consisting of a hierarchy of abstractions, where the users can specify the mathematical problems via a high-level, descriptive interface, which is progressively lowered through the intermediate abstractions. Well-designed abstraction layers are essential to enable performing code transformations and optimisations robustly and efficiently, generating high-performance code without user intervention. This thesis discusses several topics on the design of the abstraction layers of Firedrake, and presents the benefit of its software architecture by providing examples of various optimising code transformations at the appropriate abstraction layers. In particular, we discuss the advantage of describing the local assembly stage of a finite element solver in an intermediate representation based on symbolic tensor algebra. We successfully lift specific loop optimisations, previously implemented by rewriting ASTs of the local assembly kernels, to this higher-level tensor language, improving the compilation speed and optimisation effectiveness. The global assembly phase involves the application of local assembly kernels on a collection of entities of an unstructured mesh. We redesign the abstraction to express the global assembly loop nests using tools and concepts based on the polyhedral model. This enables us to implement the cross-element vectorisation algorithm that delivers stable vectorisation performance on CPUs automatically. This abstraction also improves the portability of Firedrake, as we demonstrate targeting GPU devices transparently from the same software stack.Open Acces

Spiral - Imperial College Digital Repository

A massively parallel GPU-accelerated model for analysis of fully nonlinear free surface waves

Author: Axelsson
Bingham
Brodtkorb
Cai
Demmel
Engsig-Karup
Engsig-Karup
Fructus
Göddeke
Hillis
Hwu
Li
McKee
Nickolls
Owens
Panchang
Patterson
Rienecker
Svendsen
Thompson
Trottenberg
Wallin
Wong
Young
Zakharov
Publication venue: 'Wiley'
Publication date: 01/01/2011
Field of study

Crossref

Online Research Database In Technology

Viability of Numerical Full-Wave Techniques in Telecommunication Channel Modelling

Author: Roman Novak
Publication venue: 'Croatian Communications and Information Society'
Publication date: 01/01/2020
Field of study

In telecommunication channel modelling the wavelength is small compared to the physical features of interest, therefore deterministic ray tracing techniques provide solutions that are more efficient, faster and still within time constraints than current numerical full-wave techniques. Solving fundamental Maxwell's equations is at the core of computational electrodynamics and best suited for modelling electrical field interactions with physical objects where characteristic dimensions of a computing domain is on the order of a few wavelengths in size. However, extreme communication speeds, wireless access points closer to the user and smaller pico and femto cells will require increased accuracy in predicting and planning wireless signals, testing the accuracy limits of the ray tracing methods. The increased computing capabilities and the demand for better characterization of communication channels that span smaller geographical areas make numerical full-wave techniques attractive alternative even for larger problems. The paper surveys ways of overcoming excessive time requirements of numerical full-wave techniques while providing acceptable channel modelling accuracy for the smallest radio cells and possibly wider. We identify several research paths that could lead to improved channel modelling, including numerical algorithm adaptations for large-scale problems, alternative finite-difference approaches, such as meshless methods, and dedicated parallel hardware, possibly as a realization of a dataflow machine

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Productive and efficient computational science through domain-specific abstractions

Author: Rathgeber Florian
Publication venue: Computing, Imperial College London
Publication date: 01/11/2014
Field of study

In an ideal world, scientific applications are computationally efficient, maintainable and composable and allow scientists to work very productively. We argue that these goals are achievable for a specific application field by choosing suitable domain-specific abstractions that encapsulate domain knowledge with a high degree of expressiveness. This thesis demonstrates the design and composition of domain-specific abstractions by abstracting the stages a scientist goes through in formulating a problem of numerically solving a partial differential equation. Domain knowledge is used to transform this problem into a different, lower level representation and decompose it into parts which can be solved using existing tools. A system for the portable solution of partial differential equations using the finite element method on unstructured meshes is formulated, in which contributions from different scientific communities are composed to solve sophisticated problems. The concrete implementations of these domain-specific abstractions are Firedrake and PyOP2. Firedrake allows scientists to describe variational forms and discretisations for linear and non-linear finite element problems symbolically, in a notation very close to their mathematical models. PyOP2 abstracts the performance-portable parallel execution of local computations over the mesh on a range of hardware architectures, targeting multi-core CPUs, GPUs and accelerators. Thereby, a separation of concerns is achieved, in which Firedrake encapsulates domain knowledge about the finite element method separately from its efficient parallel execution in PyOP2, which in turn is completely agnostic to the higher abstraction layer. As a consequence of the composability of those abstractions, optimised implementations for different hardware architectures can be automatically generated without any changes to a single high-level source. Performance matches or exceeds what is realistically attainable by hand-written code. Firedrake and PyOP2 are combined to form a tool chain that is demonstrated to be competitive with or faster than available alternatives on a wide range of different finite element problems.Open Acces

Spiral - Imperial College Digital Repository

A method to improve interest point detection and its GPU implementation

Author: Karuppannan Gunashekhar Prabakar
Publication venue: LSU Digital Commons
Publication date: 01/01/2012
Field of study

Interest point detection is an important low-level image processing technique with a wide range of applications. The point detectors have to be robust under affine, scale and photometric changes. There are many scale and affine invariant point detectors but they are not robust to high illumination changes. Many affine invariant interest point detectors and region descriptors, work on the points detected using scale invariant operators. Since the performance of those detectors depends on the performance of the scale invariant detectors, it is important that the scale invariant initial stage detectors should have good robustness. It is therefore important to design a detector that is very robust to illumination because illumination changes are the most common. In this research the illumination problem has been taken as the main focus and have developed a scale invariant detector that has good robustness to illumination changes. In the paper [6] it has been proved that by using contrast stretching technique the performance of the Harris operator improved considerably for illumination variations. In this research the same contrast stretching function has been incorporated into two different scale invariant operators to make them illumination invariant. The performances of the algorithms are compared with the Harris-Laplace and Hessian-Laplace algorithms [15]

Louisiana State University

Astaroth: Ohjelmistokirjasto stensiililaskentaan grafiikkasuorittimilla

Author: Pekkilä Johannes
Publication venue
Publication date: 17/06/2019
Field of study

Graphics processing units (GPUs) are coprocessors, which offer higher throughput and better power efficiency than central processing units in dataparallel tasks. For this reason, graphics processors provide a good platform for high-performance computing. However, programming GPUs such that all the available performance is utilized requires in-depth knowledge of the architecture of the hardware. Additionally, the problem of high-order stencil computations on GPUs in challenging multiphysics applications has not been adequately explored in previous work. In this thesis, we address these issues by presenting a library, an efficient algorithm and a domain-specific language for solving stencil computations within a structured grid. We tested our implementation by simulating magnetohydrodynamics, which involved the computation of first, second, and cross partial derivatives using second-, fourth-, sixth-, and eight-order finite differences with single and double precision. The running time of our integration kernel was 2.8–9.1 times slower than the theoretical minimum time, which it would take to read the computational domain and write it back to device memory exactly once, without taking into account the effects of finite caches or arithmetic operations on performance. Additionally, we made a performance comparison with a CPU solver widely used for scientific computations, which we benchmarked on a total of 24 cores of two Intel Xeon E5-2690 v3 processors. Our solver, benchmarked on a Tesla P100 PCIe GPU, outperformed the CPU solver by factors of 6.7 and 10.4 when using single and double precision, respectively.Grafiikkasuorittimet ovat apusuorittimia, jotka tarjoavat rinnakkain laskettavissa tehtävissä parempaa suoritus- ja energiatehokkuutta kuin keskussuorittimet. Tästä syystä grafiikkasuorittimet tarjoavat hyvän alustan suurteholaskennan tarpeisiin. Toisaalta grafiikkasuorittimen ohjelmointi siten, että kaikki tarjolla oleva suorituskyky saadaan hyödynnettyä, vaatii syvällistä asiantuntemusta ohjelmoitavan laitteiston arkkitehtuurista. Korkean asteen stensiililaskentaa haastavissa fysiikkasovelluksissa ei ole myöskään tutkittu laajalti aiemmissa julkaisuissa. Tässä työssä otamme kantaa näihin ongelmiin esittelemällä ohjelmistokirjaston, tehokkaan algoritmin, sekä tehtävään räätälöidyn ohjelmointikielen stensiililaskujen ratkaisemiseen säännöllisessä hilassa. Testasimme toteutustamme simuloimalla magnetohydrodynamiikkaa, johon kuului ensimmäisen ja toisen kertaluvun derivaattojen lisäksi ristiderivaattojen ratkaisutoisen, neljännen, kuudennen ja kahdeksannen kertaluvun differenssimenetelmällä käyttäen sekä 32- että 64-bittisiä liukulukuja. Integrointifunktiomme suoritusaika oli 2.8–9.1 kertaa hitaampi kuin teoreettinen vähimmäisajoaika, joka menisi laskennallisen alueen lukemiseen ja kirjoittamiseen apusuorittimen muistista täsmälleen kerran, ottamatta huomioon äärellisen välimuistin tai laskennan vaikutusta suoritusaikaan. Vertasimme kirjastomme suoritusaikaa laajalti tieteellisessä laskennassa käytettyyn keskussuorittimille tarkoitettuun ratkaisijaan, jonka ajoimme kokonaisuudessaan 24:llä ytimellä kahdella Intel Xeon E5-2690 v3 -suorittimella. Tähän ratkaisijaan verrattuna Tesla P100 PCIe -grafiikkasuorittimella ajettu ratkaisijamme oli 6.7 ja 10.4 kertaa nopeampi 32- ja 64-bittisillä liukuluvuilla laskettaessa, tässä järjestyksessä

Aaltodoc Publication Archive