265 research outputs found
Advancing synthesis of decision tree-based multiple classifier systems: an approximate computing case study
AbstractSo far, multiple classifier systems have been increasingly designed to take advantage of hardware features, such as high parallelism and computational power. Indeed, compared to software implementations, hardware accelerators guarantee higher throughput and lower latency. Although the combination of multiple classifiers leads to high classification accuracy, the required area overhead makes the design of a hardware accelerator unfeasible, hindering the adoption of commercial configurable devices. For this reason, in this paper, we exploit approximate computing design paradigm to trade hardware area overhead off for classification accuracy. In particular, starting from trained DT models and employing precision-scaling technique, we explore approximate decision tree variants by means of multiple objective optimization problem, demonstrating a significant performance improvement targeting field-programmable gate array devices
Implementing and evaluating candidate-based invariant generation
The discovery of inductive invariants lies at the heart of static program verification. Presently, many automatic solutions to inductive invariant generation are inflexible, only applicable to certain classes of programs, or unpredictable. An automatic technique that circumvents these deficiencies to some extent is candidate-based invariant generation , whereby a large number of candidate invariants are guessed and then proven to be inductive or rejected using a sound program analyser. This paper describes our efforts to apply candidate-based invariant generation in GPUVerify, a static checker of programs that run on GPUs. We study a set of 383 GPU programs that contain loops, drawn from a number of open source suites and vendor SDKs. Among this set, 253 benchmarks require provision of loop invariants for verification to succeed. We describe the methodology we used to incrementally improve the invariant generation capabilities of GPUVerify to handle these benchmarks, through candidate-based invariant generation , whereby potential program invariants are speculated using cheap static analysis and subsequently either refuted or proven. We also describe a set of experiments that we used to examine the effectiveness of our rules for candidate generation, assessing rules based on their generality (the extent to which they generate candidate invariants), hit rate (the extent to which the generated candidates hold), effectiveness (the extent to which provable candidates actually help in allowing verification to succeed), and influence (the extent to which the success of one generation rule depends on candidates generated by another rule). We believe that our methodology for devising and evaluation candidate generation rules may serve as a useful framework for other researchers interested in candidate-based invariant generation. The candidates produced by GPUVerify help to verify 231 of these 253 programs. An increase in precision, however, has created sluggishness in GPUVerify because more candidates are generated and hence more time is spent on computing those which are inductive invariants. To speed up this process, we have investigated four under-approximating program analyses that aim to reject false candidates quickly and a framework whereby these analyses can run in sequence or in parallel. Across two platforms, running Windows and Linux, our results show that the best combination of these techniques running sequentially speeds up invariant generation across our benchmarks by 1 . 17 × (Windows) and 1 . 01 × (Linux), with per-benchmark best speedups of 93 . 58 × (Windows) and 48 . 34 × (Linux), and worst slowdowns of 10 . 24 × (Windows) and 43 . 31 × (Linux). We find that parallelising the strategies marginally improves overall invariant generation speedups to 1 . 27 × (Windows) and 1 . 11 × (Linux), maintains good best-case speedups of 91 . 18 × (Windows) and 44 . 60 × (Linux), and, importantly, dramatically reduces worst-case slowdowns to 3 . 15 × (Windows) and 3 . 17 × (Linux)
Low-Impact Profiling of Streaming, Heterogeneous Applications
Computer engineers are continually faced with the task of translating improvements in fabrication process technology: i.e., Moore\u27s Law) into architectures that allow computer scientists to accelerate application performance. As feature-size continues to shrink, architects of commodity processors are designing increasingly more cores on a chip. While additional cores can operate independently with some tasks: e.g. the OS and user tasks), many applications see little to no improvement from adding more processor cores alone. For many applications, heterogeneous systems offer a path toward higher performance. Significant performance and power gains have been realized by combining specialized processors: e.g., Field-Programmable Gate Arrays, Graphics Processing Units) with general purpose multi-core processors. Heterogeneous applications need to be programmed differently than traditional software. One approach, stream processing, fits these systems particularly well because of the segmented memories and explicit expression of parallelism. Unfortunately, debugging and performance tools that support streaming, heterogeneous applications do not exist. This dissertation presents TimeTrial, a performance measurement system that enables performance optimization of streaming applications by profiling the application deployed on a heterogeneous system. TimeTrial performs low-impact measurements by dedicating computing resources to monitoring and by aggressively compressing performance traces into statistical summaries guided by user specification of the performance queries of interest
P3 problem and Magnolia language: Specializing array computations for emerging architectures
The problem of producing portable high-performance computing (HPC) software that is cheap to develop and maintain is called the P3 (performance, portability, productivity) problem. Good solutions to the P3 problem have been achieved when the performance profiles of the target machines have been similar. The variety of HPC architectures is, however, large and can be expected to grow larger. Software for HPC therefore needs to be highly adaptable, and there is a pressing need to provide developers with tools to produce software that can target machines with vastly different profiles. Multi-dimensional array manipulation constitutes a core component of numerous numerical methods, such as finite difference solvers of Partial Differential Equations (PDEs). The efficiency of these computations is tightly connected to traversing and distributing array data in a hardware-friendly way. The Mathematics of Arrays (MoA) allows for formally reasoning about array computations and enables systematic transformations of array-based programs, e.g., to use data layouts that fit to a specific architecture. This paper presents a programming methodology aimed for tackling the P3 problem in domains that are well-explored using Magnolia, a language designed to embody generic programming. The Magnolia programmer can restrict the semantic properties of abstract generic types and operations by defining so-called axioms. Axioms can be used to produce tests for concrete implementations of specifications, for formal verification, or to perform semantics-preserving program transformations. We leverage Magnolia's semantic specification facilities to extend the Magnolia compiler with a term rewriting system. We implement MoA's transformation rules in Magnolia, and demonstrate through a case study on a finite difference solver of PDEs how our rewriting system allows exploring the space of possible optimizations.publishedVersio
Memory models for heterogeneous systems
Heterogeneous systems, in which a CPU and an accelerator can execute together while sharing memory, are becoming popular in several computing sectors. Nowadays, programmers can split their computation into multiple specialised threads that can take advantage of each specialised component. FPGAs are popular accelerators with configurable logic for various tasks, and hardware manufacturers are developing platforms with tightly integrated multicore CPUs and FPGAs. In such tightly integrated platforms, the CPU threads and the FPGA threads access shared memory locations in a fine-grained manner. However, architectural optimisations will lead to instructions being observed out of order by different cores. The programmers must consider these reorderings for correct program executions.
Memory models can aid in reasoning about these complex systems since they can be used to explore guarantees regarding the systems' behaviours. These models are helpful for low-level programmers, compiler writers, and designers of analysis tools. Memory models are specified according to two main paradigms: operational and axiomatic. An operational model is an abstract representation of the actual machine, described by states that represent idealised components such as buffers and queues, and the legal transitions between these states. Axiomatic models define relations between memory accesses to constrain the allowed and disallowed behaviours.
This dissertation makes the following main contributions: an operational model of a CPU/FPGA system, an axiomatic one and an exploration of simulation techniques for operational models. The operational model is implemented in C and validated using all the behaviours described in the available documentation. We will see how the ambiguities from the documentation can be clarified by running tests on the hardware and consulting with the designers. Finally, to demonstrate the model's utility, we reason about a producer/consumer buffer implemented across the CPU and the FPGA.
The simulation of axiomatic models can be orders of magnitude faster than operational models. For this reason, we also provide an axiomatic version of the memory model. This model allows us to generate small concurrent programs to reveal whether a specific memory model behaviour can occur. However, synthesising a single test for the FPGA requires significant time and prevents us from directly running many tests. To overcome this issue, we develop a soft-core processor that allows us to quickly run large numbers of such tests and gain higher confidence in the accuracy of our models.
The simulation of the operational model faces a path-explosion problem that limits the exploration of large models. Observing that program analysis tools tackle a similar path-explosion problem, we investigate the idea of reducing the decision problem of ``whether a given memory model allows a given behaviour'' to the decision problem of ``whether a given C program is safe'', which can be handled by a variety of off-the-shelf tools. Using this approach, we can simulate our model more deeply and gain more confidence in its accuracy.Open Acces
Three-Dimensional Photoacoustic Computed Tomography: Imaging Models and Reconstruction Algorithms
Photoacoustic computed tomography: PACT), also known as optoacoustic tomography, is a rapidly emerging imaging modality that holds great promise for a wide range of biomedical imaging applications. Much effort has been devoted to the investigation of imaging physics and the optimization of experimental designs. Meanwhile, a variety of image reconstruction algorithms have been developed for the purpose of computed tomography. Most of these algorithms assume full knowledge of the acoustic pressure function on a measurement surface that either encloses the object or extends to infinity, which poses many difficulties for practical applications. To overcome these limitations, iterative image reconstruction algorithms have been actively investigated. However, little work has been conducted on imaging models that incorporate the characteristics of data acquisition systems. Moreover, when applying to experimental data, most studies simplify the inherent three-dimensional wave propagation as two-dimensional imaging models by introducing heuristic assumptions on the transducer responses and/or the object structures. One important reason is because three-dimensional image reconstruction is computationally burdensome. The inaccurate imaging models severely limit the performance of iterative image reconstruction algorithms in practice. In the dissertation, we propose a framework to construct imaging models that incorporate the characteristics of ultrasonic transducers. Based on the imaging models, we systematically investigate various iterative image reconstruction algorithms, including advanced algorithms that employ total variation-norm regularization. In order to accelerate three-dimensional image reconstruction, we develop parallel implementations on graphic processing units. In addition, we derive a fast Fourier-transform based analytical image reconstruction formula. By use of iterative image reconstruction algorithms based on the proposed imaging models, PACT imaging scanners can have a compact size while maintaining high spatial resolution. The research demonstrates, for the first time, the feasibility and advantages of iterative image reconstruction algorithms in three-dimensional PACT
FlashVideo: A Framework for Swift Inference in Text-to-Video Generation
In the evolving field of machine learning, video generation has witnessed
significant advancements with autoregressive-based transformer models and
diffusion models, known for synthesizing dynamic and realistic scenes. However,
these models often face challenges with prolonged inference times, even for
generating short video clips such as GIFs. This paper introduces FlashVideo, a
novel framework tailored for swift Text-to-Video generation. FlashVideo
represents the first successful adaptation of the RetNet architecture for video
generation, bringing a unique approach to the field. Leveraging the
RetNet-based architecture, FlashVideo reduces the time complexity of inference
from to for a sequence of length ,
significantly accelerating inference speed. Additionally, we adopt a
redundant-free frame interpolation method, enhancing the efficiency of frame
interpolation. Our comprehensive experiments demonstrate that FlashVideo
achieves a efficiency improvement over a traditional
autoregressive-based transformer model, and its inference speed is of the same
order of magnitude as that of BERT-based transformer models
An evaluation of the GAMA/StarPU frameworks for heterogeneous platforms : the progressive photon mapping algorithm
Dissertação de mestrado em Engenharia InformáticaRecent evolution of high performance computing moved towards heterogeneous platforms:
multiple devices with different architectures, characteristics and programming models, share
application workloads. To aid the programmer to efficiently explore these heterogeneous
platforms several frameworks have been under development. These dynamically manage the
available computing resources through workload scheduling and data distribution, dealing
with the inherent difficulties of different programming models and memory accesses. Among
other frameworks, these include GAMA and StarPU.
The GAMA framework aims to unify the multiple execution and memory models of
each different device in a computer system, into a single, hardware agnostic model. It was
designed to efficiently manage resources with both regular and irregular applications, and
currently only supports conventional CPU devices and CUDA-enabled accelerators. StarPU
has similar goals and features with a wider user based community, but it lacks a single
programming model.
The main goal of this dissertation was an in-depth evaluation of a heterogeneous framework
using a complex application as a case study. GAMA provided the starting vehicle
for training, while StarPU was the selected framework for a thorough evaluation. The progressive
photon mapping irregular algorithm was the selected case study. The evaluation
goal was to assert the StarPU effectiveness with a robust irregular application, and make a
high-level comparison with the still under development GAMA, to provide some guidelines
for GAMA improvement.
Results show that two main factors contribute to the performance of applications written
with StarPU: the consideration of data transfers in the performance model, and chosen
scheduler. The study also allowed some caveats to be found within the StarPU API. Although
this have no effect on performance, they present a challenge for new coming developers.
Both these analysis resulted in a better understanding of the framework, and a comparative
analysis with GAMA could be made, pointing out the aspects where GAMA could be further
improved upon.A recente evolução da computação de alto desempenho é em direção ao uso de plataformas
heterogéneas: múltiplos dispositivos com diferentes arquiteturas, características e modelos
de programação, partilhando a carga computacional das aplicações. De modo a ajudar o
programador a explorar eficientemente estas plataformas, várias frameworks têm sido desenvolvidas.
Estas frameworks gerem os recursos computacionais disponíveis, tratando das
dificuldades inerentes dos diferentes modelos de programação e acessos à memória. Entre
outras frameworks, estas incluem o GAMA e o StarPU.
O GAMA tem o objetivo de unificar os múltiplos modelos de execução e memória de
cada dispositivo diferente num sistema computacional, transformando-os num único modelo,
independente do hardware utilizado. A framework foi desenhada de forma a gerir eficientemente
os recursos, tanto para aplicações regulares como irregulares, e atualmente suporta
apenas CPUs convencionais e aceleradores CUDA. O StarPU tem objetivos e funcionalidades
idênticos, e também uma comunidade mais alargada, mas não possui um modelo de
programação único
O objetivo principal desta dissertação foi uma avaliação profunda de uma framework
heterogénea, usando uma aplicação complexa como caso de estudo. O GAMA serviu como
ponto de partida para treino e ambientação, enquanto que o StarPU foi a framework selecionada
para uma avaliação mais profunda. O algoritmo irregular de progressive photon
mapping foi o caso de estudo escolhido. O objetivo da avaliação foi determinar a eficácia
do StarPU com uma aplicação robusta, e fazer uma análise de alto nível com o GAMA,
que ainda está em desenvolvimento, para forma a providenciar algumas sugestões para o seu
melhoramento.
Os resultados mostram que são dois os principais factores que contribuem para a performance
de aplicação escritas com auxílio do StarPU: a avaliação dos tempos de transferência
de dados no modelo de performance, e a escolha do escalonador. O estudo permitiu também
avaliar algumas lacunas na API do StarPU. Embora estas não tenham efeitos visíveis na eficiencia da framework, eles tornam-se um desafio para recém-chegados ao StarPU. Ambas estas
análisos resultaram numa melhor compreensão da framework, e numa análise comparativa
com o GAMA, onde são apontados os possíveis aspectos que o este tem a melhorar.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin | Portuga
- …