174 research outputs found
When Deep Learning Meets Polyhedral Theory: A Survey
In the past decade, deep learning became the prevalent methodology for
predictive modeling thanks to the remarkable accuracy of deep neural networks
in tasks such as computer vision and natural language processing. Meanwhile,
the structure of neural networks converged back to simpler representations
based on piecewise constant and piecewise linear functions such as the
Rectified Linear Unit (ReLU), which became the most commonly used type of
activation function in neural networks. That made certain types of network
structure \unicode{x2014}such as the typical fully-connected feedforward
neural network\unicode{x2014} amenable to analysis through polyhedral theory
and to the application of methodologies such as Linear Programming (LP) and
Mixed-Integer Linear Programming (MILP) for a variety of purposes. In this
paper, we survey the main topics emerging from this fast-paced area of work,
which bring a fresh perspective to understanding neural networks in more detail
as well as to applying linear optimization techniques to train, verify, and
reduce the size of such networks
Tools for efficient Deep Learning
In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption.
We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work.
This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C.
Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets.
All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning
Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices
Structured parallelism discovery with hybrid static-dynamic analysis and evaluation technique
Parallel computer architectures have dominated the computing landscape for the
past two decades; a trend that is only expected to continue and intensify, with increasing specialization and heterogeneity. This creates huge pressure across the software
stack to produce programming languages, libraries, frameworks and tools which will
efficiently exploit the capabilities of parallel computers, not only for new software, but
also revitalizing existing sequential code. Automatic parallelization, despite decades of
research, has had limited success in transforming sequential software to take advantage
of efficient parallel execution. This thesis investigates three approaches that use commutativity analysis as the enabler for parallelization. This has the potential to overcome
limitations of traditional techniques.
We introduce the concept of liveness-based commutativity for sequential loops.
We examine the use of a practical analysis utilizing liveness-based commutativity in a
symbolic execution framework. Symbolic execution represents input values as groups
of constraints, consequently deriving the output as a function of the input and enabling
the identification of further program properties. We employ this feature to develop an
analysis and discern commutativity properties between loop iterations. We study the
application of this approach on loops taken from real-world programs in the OLDEN
and NAS Parallel Benchmark (NPB) suites, and identify its limitations and related
overheads.
Informed by these findings, we develop Dynamic Commutativity Analysis (DCA), a
new technique that leverages profiling information from program execution with specific
input sets. Using profiling information, we track liveness information and detect loop
commutativity by examining the code’s live-out values. We evaluate DCA against almost
1400 loops of the NPB suite, discovering 86% of them as parallelizable. Comparing
our results against dependence-based methods, we match the detection efficacy of two
dynamic and outperform three static approaches, respectively. Additionally, DCA is
able to automatically detect parallelism in loops which iterate over Pointer-Linked
Data Structures (PLDSs), taken from wide range of benchmarks used in the literature,
where all other techniques we considered failed. Parallelizing the discovered loops, our
methodology achieves an average speedup of 3.6× across NPB (and up to 55×) and up
to 36.9× for the PLDS-based loops on a 72-core host. We also demonstrate that our
methodology, despite relying on specific input values for profiling each program, is able
to correctly identify parallelism that is valid for all potential input sets.
Lastly, we develop a methodology to utilize liveness-based commutativity, as implemented in DCA, to detect latent loop parallelism in the shape of patterns. Our approach
applies a series of transformations which subsequently enable multiple applications
of DCA over the generated multi-loop code section and match its loop commutativity
outcomes against the expected criteria for each pattern. Applying our methodology on
sets of sequential loops, we are able to identify well-known parallel patterns (i.e., maps,
reduction and scans). This extends the scope of parallelism detection to loops, such
as those performing scan operations, which cannot be determined as parallelizable by
simply evaluating liveness-based commutativity conditions on their original form
LIPIcs, Volume 274, ESA 2023, Complete Volume
LIPIcs, Volume 274, ESA 2023, Complete Volum
Fundamentals
Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
2021-2022, University of Memphis bulletin
University of Memphis bulletin containing the graduate catalog for 2021-2022.https://digitalcommons.memphis.edu/speccoll-ua-pub-bulletins/1441/thumbnail.jp
Compact Lens Technologies: Curved Image Sensor and Volumetric Imaging Efficiency
Compact image systems bring up people\u27s attention in the field of target recognition, surveillance, situation awareness or even photography. Conventional metrics assess image system based on image quality without considering systems\u27 volume. More comprehensive metrics, such as General Image-Quality Equation and the Targeting Task Performance metric, incorporates all image system components from object, lenses to detector and even imaging processing algorithm. All these key factors prohibit these metrics from being applied to image system in a convenient manner. Here, we propose a simple metric, volumetric imaging efficiency, considering both image quality and volume. Only concentrate on optical lenses enables the metric being implemented onto conventional bulk optics and flat optics efficiently. Curved image sensor with monocentric lenses shows an exceptional performance based on our metric but potentially challenging in fabrication due to conventional flat substrate process. Normally, this can be done with inorganic photodetector array and perform bending as the last step or organic photodetector being directly deposited on a curved plastic substrate. Inorganic method utilizes state-of-the-art CMOS technology but the interconnects suffer great strain and stress after bending and ultimately runs the risk of device failure while organic device ensures minimum strain, but the fabrication is not compatible with CMOS technology, thus a pattern transfer method is involved for contacts deposition. Here, we introduce both techniques and addressing their challenges. For inorganic device, several interconnect deposition methods are developed and both 1D and 2D bending test are performed to test their stretchability. For organic device, without CMOS circuity, we developed a new type of photodetector, frustrated organic photodetector (F-OPD), which enables single pixel selection by biasing device in different directions. A total of 45 devices are fabricated and perform as an input of 30X30 detectors array. A variety of noise sources are discussed and applied to each pixel. The image is then restored by color leveling and 2-points or 3-points non-uniformity correction (NUC). The results are compared with original figure as a proof of concept showing the capability of the device being extended to an array
- …