195 research outputs found
Recommended from our members
A SIMD architecture for hard real-time systems
Emerging safety-critical systems require high-performance data-parallel architectures and, problematically, ones that can guarantee tight and safe worst-case execution times. Given the complexity of existing architectures like GPUs, it is unlikely that sufficiently accurate models and algorithms for timing analysis will emerge in the foreseeable future. This motivates a clean-slate approach to designing a real-time data-parallel architecture.
In this work I present Sim-D: a wide-SIMD architecture for hard real-time systems. Similar to GPUs, Sim-D performs hardware strip-mining to schedule the work for a compute kernel in entities called work-groups. Sim-D schedules the work for each work-group as a sequence of uninterruptible access- and execute program phases, interleaving the phases of two work-groups. By providing performance isolation between the memory- and compute resources, the execution time of each phase can be tightly bound through static analysis.
I present a predictable closed-page DRAM controller that processes requests for large 1D- and 2D blocks of data, as well as indirect indexed transfers. These large transfers coalesce the data requests of a whole work-group. For a linear 4KiB transfer over a 64-bit data bus, the utilisation provably exceeds 78% for DDR4-3200AA DRAM. For 2D blocks, a well-chosen tiling configuration can achieve near-similar efficiency. I show that bounds on the execution time of indexed transfers are pessimistic by nature, but propose a novel snoopy indexed transfer mechanism that permits more reasonable bounds when the buffer size is limited.
Finally, I present a worst-case execution time calculation algorithm for Sim-D. This algorithm is paired with two hardware work-group scheduling policies that deterministically reduce run-time variance. The worst-case execution time analysis algorithm combines static control flow analysis with a simulation-based cost model for execution and DRAM transfers. Its key novelty is the addition of a stage that considers work-group scheduling effects. I show that the work-group scheduling policies degrade performance on average by 8.9%, but permit the calculation of worst-case execution time bounds that are tight within 14.3% on average for benchmarks that avoid inefficient indexed transfers
Utilising path-vertex data to improve Monte Carlo global illumination.
Efficient techniques for photo-realistic rendering are in high demand across a wide array of industries. Notable applications include visual effects for film, entertainment and virtual reality. Less direct applications such as visualisation for architecture, lighting design and product development also rely on the synthesis of realistic and physically based illumination. Such applications assert ever increasing demands on light transport algorithms, requiring the computation of photo-realistic effects while handling complex geometry, light scattering models and illumination. Techniques based on Monte Carlo integration handle such scenarios elegantly and robustly, but despite seeing decades of focused research and wide commercial support, these methods and their derivatives still exhibit undesirable side effects that are yet to be resolved. In this thesis, Monte Carlo path tracing techniques are improved upon by utilizing path vertex data and intermediate radiance contributions readily available during rendering. This permits the development of novel progressive algorithms that render low noise global illumination while striving to maintain the desirable accuracy and convergence properties of unbiased methods. The thesis starts by presenting a discussion into optical phenomenon, physically based rendering and achieving photo realistic image synthesis. This is followed by in-depth discussion of the published theoretical and practical research in this field, with a focus on stochastic methods and modem rendering methodologies. This provides insight into the issues surrounding Monte Carlo integration both in the general and rendering specific contexts, along with an appreciation for the complexities of solving global light transport. Alternative methods that aim to address these issues are discussed, providing an insight into modem rendering paradigms and their characteristics. Thus, an understanding of the key aspects is obtained, that is necessary to build up and discuss the novel research and contributions to the field developed throughout this thesis. First, a path space filtering strategy is proposed that allows the path-based space of light transport to be classified into distinct subsets. This permits the novel combination of robust path tracing and recent progressive photon mapping algorithms to handle each subset based on the characteristics of the light transport in that space. This produces a hybrid progressive rendering technique that utilises the strengths of existing state of the art Monte Carlo and photon mapping methods to provide efficient and consistent rendering of complex scenes with vanishing bias. The second original contribution is a probabilistic image-based filtering and sample clustering framework that provides high quality previews of global illumination whilst remaining aware of high frequency detail and features in geometry, materials and the incident illumination. As will be seen, the challenges of edge-aware noise reduction are numerous and long standing, particularly when identifying high frequency features in noisy illumination signals. Discontinuities such as hard shadows and glossy reflections are commonly overlooked by progressive filtering techniques, however by dividing path space into multiple layers, once again based on utilising path vertex data, the overlapping illumination of varying intensities, colours and frequencies is more effectively handled. Thus noise is removed from each layer independent of features present in the remaining path space, effectively preserving such features
Advances in quantitative microscopy
Microscopy allows us to peer into the complex deeply shrouded world that the cells of our body grow and thrive in. With the emergence of automated digital microscopes and software for anlysing and processing the large numbers of image that they produce; quantitative microscopy approaches are now allowing us to answer ever larger and more complex biological questions. In this thesis I explore two trends. Firstly, that of using quantitative microscopy for performing unbiased screens, the advances made here include developing strategies to handle imaging data captured from physiological models, and unsupervised analysis screening data to derive unbiased biological insights. Secondly, I develop software for analysing live cell imaging data, that can now be captured at greater rates than ever before and use this to help answer key questions covering the biology of how cells make the decision to arrest or proliferate in response to DNA damage. Together this thesis represents a view of the current state of the art in high-throughput quantitative microscopy and details where the field is heading as machine learning approaches become ever more sophisticated.Open Acces
Parallel implementation of fractal image compression
Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images
as a form of information redundancy that can be eliminated to achieve compression. This
theory based on Partitioned Iterated Function Systems is presented. As an alternative to the
established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal
techniques promise faster decoding and potentially higher fidelity, but the computationally
intensive compression process has prevented commercial acceptance.
This thesis presents an algorithm mapping the problem onto a parallel processor
architecture, with the goal of reducing the encoding time. The experimental work involved
implementation of this approach on the Texas Instruments TMS320C80 parallel processor
system. Results indicate that the fractal compression process is unusually well suited to
parallelism with speed gains approximately linearly related to the number of processors used.
Parallel processing issues such as coherency, management and interfacing are discussed. The
code designed incorporates pipelining and parallelism on all conceptual and practical levels
ensuring that all resources are fully utilised, achieving close to optimal efficiency.
The computational intensity was reduced by several means, including conventional
classification of image sub-blocks by content with comparisons across class boundaries
prohibited. A faster approach adopted was to perform estimate comparisons between blocks
based on pixel value variance, identifying candidates for more time-consuming, accurate
RMS inter-block comparisons. These techniques, combined with the parallelism, allow
compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB
PSNR. This is up to an order of magnitude faster than reported for conventional sequential
processor implementations. Fractal based compression of colour images and video sequences
is also considered.
The work confirms the potential of fractal compression techniques, and demonstrates
that a parallel implementation is appropriate for addressing the compression time problem.
The processor system used in these investigations is faster than currently available PC
platforms, but the relevance lies in the anticipation that future generations of affordable
processors will exceed its performance. The advantages of fractal image compression may
then be accessible to the average computer user, leading to commercial acceptance
Improving the performance of dataflow systems for deep neural network training
Deep neural networks (DNNs) have led to significant advancements in machine learning.
With deep structure and flexible model parameterisation, they exhibit state-of-the-art accuracies for many complex tasks e.g. image recognition. To achieve this, models are trained iteratively over large datasets. This process involves expensive matrix operations, making it time-consuming to obtain converged models. To accelerate training, dataflow systems parallelise computation. A scalable approach is to use parameter server framework: it has workers that train model replicas in parallel and parameter servers that synchronise the replicas to ensure the convergence.
With distributed DNN systems, there are three challenges that determine the training completion time. In this thesis, we propose practical and effective techniques to address each of these challenges.
Since frequent model synchronisation results in high network utilisation, the parameter server approach can suffer from network bottlenecks, thus requiring decisions on resource allocation. Our idea is to use all available network bandwidth and synchronise subject to the available bandwidth. We present Ako, a DNN system that uses partial gradient exchange for synchronising replicas in a peer-to-peer fashion. We show that our technique exhibits a 25% lower convergence time than a hand-tuned parameter-server deployments.
For a long training, the compute efficiency of worker nodes is important. We argue that processing hardware should be fully utilised for the best speed-up. The key observation is it is possible to overlap the execution of several matrix operations with other workloads. We describe Crossbow, a GPU-based system that maximises hardware utilisation. By using a multi-streaming scheduler, multiple models are trained in parallel on GPU and achieve a 2.3x speed-up compared to a state-of-the-art system.
The choice of model configuration for replicas also directly determines convergence quality. Dataflow systems are used for exploring the promising configurations but provide little support for efficient exploratory workflows. We present Meta-dataflow (MDF), a dataflow model that expresses complex workflows. By taking into account all configurations as a unified workflow, MDFs efficiently reduce time spent on configuration exploration.Open Acces
Enhancing remanufacturing automation using deep learning approach
In recent years, remanufacturing has significant interest from researchers and practitioners to improve efficiency through maximum value recovery of products at end-of-life (EoL). It is a process of returning used products, known as EoL products, to as-new condition with matching or higher warranty than the new products. However, these remanufacturing processes are complex and time-consuming to implement manually, causing reduced productivity and posing dangers to personnel. These challenges require automating the various remanufacturing process stages to achieve higher throughput, reduced lead time, cost and environmental impact while maximising economic gains. Besides, as highlighted by various research groups, there is currently a shortage of adequate remanufacturing-specific technologies to achieve full automation. -- This research explores automating remanufacturing processes to improve competitiveness by analysing and developing deep learning-based models for automating different stages of the remanufacturing processes. Analysing deep learning algorithms represents a viable option to investigate and develop technologies with capabilities to overcome the outlined challenges. Deep learning involves using artificial neural networks to learn high-level abstractions in data. Deep learning (DL) models are inspired by human brains and have produced state-of-the-art results in pattern recognition, object detection and other applications. The research further investigates the empirical data of torque converter components recorded from a remanufacturing facility in Glasgow, UK, using the in-case and cross-case analysis to evaluate the remanufacturing inspection, sorting, and process control applications. -- Nevertheless, the developed algorithm helped capture, pre-process, train, deploy and evaluate the performance of the respective processes. The experimental evaluation of the in-case and cross-case analysis using model prediction accuracy, misclassification rate, and model loss highlights that the developed models achieved a high prediction accuracy of above 99.9% across the sorting, inspection and process control applications. Furthermore, a low model loss between 3x10-3 and 1.3x10-5 was obtained alongside a misclassification rate that lies between 0.01% to 0.08% across the three applications investigated, thereby highlighting the capability of the developed deep learning algorithms to perform the sorting, process control and inspection in remanufacturing. The results demonstrate the viability of adopting deep learning-based algorithms in automating remanufacturing processes, achieving safer and more efficient remanufacturing. -- Finally, this research is unique because it is the first to investigate using deep learning and qualitative torque-converter image data for modelling remanufacturing sorting, inspection and process control applications. It also delivers a custom computational model that has the potential to enhance remanufacturing automation when utilised. The findings and publications also benefit both academics and industrial practitioners. Furthermore, the model is easily adaptable to other remanufacturing applications with minor modifications to enhance process efficiency in today's workplaces.In recent years, remanufacturing has significant interest from researchers and practitioners to improve efficiency through maximum value recovery of products at end-of-life (EoL). It is a process of returning used products, known as EoL products, to as-new condition with matching or higher warranty than the new products. However, these remanufacturing processes are complex and time-consuming to implement manually, causing reduced productivity and posing dangers to personnel. These challenges require automating the various remanufacturing process stages to achieve higher throughput, reduced lead time, cost and environmental impact while maximising economic gains. Besides, as highlighted by various research groups, there is currently a shortage of adequate remanufacturing-specific technologies to achieve full automation. -- This research explores automating remanufacturing processes to improve competitiveness by analysing and developing deep learning-based models for automating different stages of the remanufacturing processes. Analysing deep learning algorithms represents a viable option to investigate and develop technologies with capabilities to overcome the outlined challenges. Deep learning involves using artificial neural networks to learn high-level abstractions in data. Deep learning (DL) models are inspired by human brains and have produced state-of-the-art results in pattern recognition, object detection and other applications. The research further investigates the empirical data of torque converter components recorded from a remanufacturing facility in Glasgow, UK, using the in-case and cross-case analysis to evaluate the remanufacturing inspection, sorting, and process control applications. -- Nevertheless, the developed algorithm helped capture, pre-process, train, deploy and evaluate the performance of the respective processes. The experimental evaluation of the in-case and cross-case analysis using model prediction accuracy, misclassification rate, and model loss highlights that the developed models achieved a high prediction accuracy of above 99.9% across the sorting, inspection and process control applications. Furthermore, a low model loss between 3x10-3 and 1.3x10-5 was obtained alongside a misclassification rate that lies between 0.01% to 0.08% across the three applications investigated, thereby highlighting the capability of the developed deep learning algorithms to perform the sorting, process control and inspection in remanufacturing. The results demonstrate the viability of adopting deep learning-based algorithms in automating remanufacturing processes, achieving safer and more efficient remanufacturing. -- Finally, this research is unique because it is the first to investigate using deep learning and qualitative torque-converter image data for modelling remanufacturing sorting, inspection and process control applications. It also delivers a custom computational model that has the potential to enhance remanufacturing automation when utilised. The findings and publications also benefit both academics and industrial practitioners. Furthermore, the model is easily adaptable to other remanufacturing applications with minor modifications to enhance process efficiency in today's workplaces
Parallel implementation of fractal image compression
Thesis (M.Sc.Eng.)-University of Natal, Durban, 2000.Fractal image compression exploits the piecewise self-similarity present in real images
as a form of information redundancy that can be eliminated to achieve compression. This
theory based on Partitioned Iterated Function Systems is presented. As an alternative to the
established JPEG, it provides a similar compression-ratio to fidelity trade-off. Fractal
techniques promise faster decoding and potentially higher fidelity, but the computationally
intensive compression process has prevented commercial acceptance.
This thesis presents an algorithm mapping the problem onto a parallel processor
architecture, with the goal of reducing the encoding time. The experimental work involved
implementation of this approach on the Texas Instruments TMS320C80 parallel processor
system. Results indicate that the fractal compression process is unusually well suited to
parallelism with speed gains approximately linearly related to the number of processors used.
Parallel processing issues such as coherency, management and interfacing are discussed. The
code designed incorporates pipelining and parallelism on all conceptual and practical levels
ensuring that all resources are fully utilised, achieving close to optimal efficiency.
The computational intensity was reduced by several means, including conventional
classification of image sub-blocks by content with comparisons across class boundaries
prohibited. A faster approach adopted was to perform estimate comparisons between blocks
based on pixel value variance, identifying candidates for more time-consuming, accurate
RMS inter-block comparisons. These techniques, combined with the parallelism, allow
compression of 512x512 pixel x 8 bit images in under 20 seconds, while maintaining a 30dB
PSNR. This is up to an order of magnitude faster than reported for conventional sequential
processor implementations. Fractal based compression of colour images and video sequences
is also considered.
The work confirms the potential of fractal compression techniques, and demonstrates
that a parallel implementation is appropriate for addressing the compression time problem.
The processor system used in these investigations is faster than currently available PC
platforms, but the relevance lies in the anticipation that future generations of affordable
processors will exceed its performance. The advantages of fractal image compression may
then be accessible to the average computer user, leading to commercial acceptance
A multi-level functional IR with rewrites for higher-level synthesis of accelerators
Specialised accelerators deliver orders of magnitude higher energy-efficiency than
general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become
the substrate of choice, because the ever-changing nature of modern workloads, such
as machine learning, demands reconfigurability. However, they are notoriously hard
to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems.
They often produce sub-optimal designs and programmers are still required to write
hardware-specific code, thus development cycles remain long.
This thesis proposes Shir, a higher-level synthesis approach for high-performance
accelerator design with a hardware-agnostic programming entry point, a multi-level
Intermediate Representation (IR), a compiler and rewrite rules for optimisation.
First, a novel, multi-level functional IR structure for accelerator design is described.
The IRs operate on different levels of abstraction, cleanly separating different hardware
concerns. They enable the expression of different forms of parallelism and standard
memory features, such as asynchronous off-chip memories or synchronous on-chip
buffers, as well as arbitration of such shared resources. Exposing these features at the
IR level is essential for achieving high performance.
Next, mechanical lowering procedures are introduced to automatically compile
a program specification through Shir’s functional IRs until low-level HDL code for
FPGA synthesis is emitted. Each lowering step gradually adds implementation details.
Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to
functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether.
This fundamental issue is solved by the application of rewrite rules.
The viability of this approach is demonstrated by running matrix multiplication
and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is
conducted, confirming the ability of the IR to exploit various hardware features. Using
rewrite rules for optimisation, it is possible to generate high-performance designs
that are competitive with highly tuned OpenCL implementations and that outperform
hardware-agnostic OpenCL code. The performance impact of the optimisations is
further evaluated showing that they are essential to achieving high performance, and
in many cases also necessary to produce hardware that fits the resource constraints
A kinematic numerical camera model for the SPOT-1 sensor
A novel method for modelling linear push-broom sensors has been developed. A numerical model which incorporates the satellite attitude and position data is used to compute the absolute orientation. This method makes a break with traditional photogrammetric practice, in that instead of using an approach based on collinearity equations, the absolute orientation is computed iteratively using a numerical multi-variable minimisation scheme. All current implementations of the model use the Powell direction-set method, but in principle, any multivariable minimisation scheme could be substituted. The numerical method has significant advantages over the collinearity approach. The number of ground control points needed to form an accurate model is reduced and the numerical approach offers a superior basis for the development of general purpose multi sensor modelling software. In order to test these assertions, a numerical model of the SPOT-1 sensor was coded and tested against a pre-existing collinearity based model. Exhaustive tests showed the numerical model, using 3 or fewer ground control points, consistendy equaled or bettered the performance of the earlier model, using between 6 and 15 ground control points, on the same test data. A general purpose sensor modelling system was developed using the code developed for the initial SPOT-1 model. Currently this system supports many rigid linear sensors systems including SPOT-1, SPOT-2, FTIR, MISR, MEOSS and ASAS. Further extensions to the system to enable it to model non-rigid linear sensors such as AVHRR and ATM are planned. Work to enable the system to perform relative orientations for a variety of sensor types is also ongoing
- …