265 research outputs found

    Advancing synthesis of decision tree-based multiple classifier systems: an approximate computing case study

    Get PDF
    AbstractSo far, multiple classifier systems have been increasingly designed to take advantage of hardware features, such as high parallelism and computational power. Indeed, compared to software implementations, hardware accelerators guarantee higher throughput and lower latency. Although the combination of multiple classifiers leads to high classification accuracy, the required area overhead makes the design of a hardware accelerator unfeasible, hindering the adoption of commercial configurable devices. For this reason, in this paper, we exploit approximate computing design paradigm to trade hardware area overhead off for classification accuracy. In particular, starting from trained DT models and employing precision-scaling technique, we explore approximate decision tree variants by means of multiple objective optimization problem, demonstrating a significant performance improvement targeting field-programmable gate array devices

    Implementing and evaluating candidate-based invariant generation

    Get PDF
    The discovery of inductive invariants lies at the heart of static program verification. Presently, many automatic solutions to inductive invariant generation are inflexible, only applicable to certain classes of programs, or unpredictable. An automatic technique that circumvents these deficiencies to some extent is candidate-based invariant generation , whereby a large number of candidate invariants are guessed and then proven to be inductive or rejected using a sound program analyser. This paper describes our efforts to apply candidate-based invariant generation in GPUVerify, a static checker of programs that run on GPUs. We study a set of 383 GPU programs that contain loops, drawn from a number of open source suites and vendor SDKs. Among this set, 253 benchmarks require provision of loop invariants for verification to succeed. We describe the methodology we used to incrementally improve the invariant generation capabilities of GPUVerify to handle these benchmarks, through candidate-based invariant generation , whereby potential program invariants are speculated using cheap static analysis and subsequently either refuted or proven. We also describe a set of experiments that we used to examine the effectiveness of our rules for candidate generation, assessing rules based on their generality (the extent to which they generate candidate invariants), hit rate (the extent to which the generated candidates hold), effectiveness (the extent to which provable candidates actually help in allowing verification to succeed), and influence (the extent to which the success of one generation rule depends on candidates generated by another rule). We believe that our methodology for devising and evaluation candidate generation rules may serve as a useful framework for other researchers interested in candidate-based invariant generation. The candidates produced by GPUVerify help to verify 231 of these 253 programs. An increase in precision, however, has created sluggishness in GPUVerify because more candidates are generated and hence more time is spent on computing those which are inductive invariants. To speed up this process, we have investigated four under-approximating program analyses that aim to reject false candidates quickly and a framework whereby these analyses can run in sequence or in parallel. Across two platforms, running Windows and Linux, our results show that the best combination of these techniques running sequentially speeds up invariant generation across our benchmarks by 1 . 17 × (Windows) and 1 . 01 × (Linux), with per-benchmark best speedups of 93 . 58 × (Windows) and 48 . 34 × (Linux), and worst slowdowns of 10 . 24 × (Windows) and 43 . 31 × (Linux). We find that parallelising the strategies marginally improves overall invariant generation speedups to 1 . 27 × (Windows) and 1 . 11 × (Linux), maintains good best-case speedups of 91 . 18 × (Windows) and 44 . 60 × (Linux), and, importantly, dramatically reduces worst-case slowdowns to 3 . 15 × (Windows) and 3 . 17 × (Linux)

    Low-Impact Profiling of Streaming, Heterogeneous Applications

    Get PDF
    Computer engineers are continually faced with the task of translating improvements in fabrication process technology: i.e., Moore\u27s Law) into architectures that allow computer scientists to accelerate application performance. As feature-size continues to shrink, architects of commodity processors are designing increasingly more cores on a chip. While additional cores can operate independently with some tasks: e.g. the OS and user tasks), many applications see little to no improvement from adding more processor cores alone. For many applications, heterogeneous systems offer a path toward higher performance. Significant performance and power gains have been realized by combining specialized processors: e.g., Field-Programmable Gate Arrays, Graphics Processing Units) with general purpose multi-core processors. Heterogeneous applications need to be programmed differently than traditional software. One approach, stream processing, fits these systems particularly well because of the segmented memories and explicit expression of parallelism. Unfortunately, debugging and performance tools that support streaming, heterogeneous applications do not exist. This dissertation presents TimeTrial, a performance measurement system that enables performance optimization of streaming applications by profiling the application deployed on a heterogeneous system. TimeTrial performs low-impact measurements by dedicating computing resources to monitoring and by aggressively compressing performance traces into statistical summaries guided by user specification of the performance queries of interest

    P3 problem and Magnolia language: Specializing array computations for emerging architectures

    Get PDF
    The problem of producing portable high-performance computing (HPC) software that is cheap to develop and maintain is called the P3 (performance, portability, productivity) problem. Good solutions to the P3 problem have been achieved when the performance profiles of the target machines have been similar. The variety of HPC architectures is, however, large and can be expected to grow larger. Software for HPC therefore needs to be highly adaptable, and there is a pressing need to provide developers with tools to produce software that can target machines with vastly different profiles. Multi-dimensional array manipulation constitutes a core component of numerous numerical methods, such as finite difference solvers of Partial Differential Equations (PDEs). The efficiency of these computations is tightly connected to traversing and distributing array data in a hardware-friendly way. The Mathematics of Arrays (MoA) allows for formally reasoning about array computations and enables systematic transformations of array-based programs, e.g., to use data layouts that fit to a specific architecture. This paper presents a programming methodology aimed for tackling the P3 problem in domains that are well-explored using Magnolia, a language designed to embody generic programming. The Magnolia programmer can restrict the semantic properties of abstract generic types and operations by defining so-called axioms. Axioms can be used to produce tests for concrete implementations of specifications, for formal verification, or to perform semantics-preserving program transformations. We leverage Magnolia's semantic specification facilities to extend the Magnolia compiler with a term rewriting system. We implement MoA's transformation rules in Magnolia, and demonstrate through a case study on a finite difference solver of PDEs how our rewriting system allows exploring the space of possible optimizations.publishedVersio

    Memory models for heterogeneous systems

    Get PDF
    Heterogeneous systems, in which a CPU and an accelerator can execute together while sharing memory, are becoming popular in several computing sectors. Nowadays, programmers can split their computation into multiple specialised threads that can take advantage of each specialised component. FPGAs are popular accelerators with configurable logic for various tasks, and hardware manufacturers are developing platforms with tightly integrated multicore CPUs and FPGAs. In such tightly integrated platforms, the CPU threads and the FPGA threads access shared memory locations in a fine-grained manner. However, architectural optimisations will lead to instructions being observed out of order by different cores. The programmers must consider these reorderings for correct program executions. Memory models can aid in reasoning about these complex systems since they can be used to explore guarantees regarding the systems' behaviours. These models are helpful for low-level programmers, compiler writers, and designers of analysis tools. Memory models are specified according to two main paradigms: operational and axiomatic. An operational model is an abstract representation of the actual machine, described by states that represent idealised components such as buffers and queues, and the legal transitions between these states. Axiomatic models define relations between memory accesses to constrain the allowed and disallowed behaviours. This dissertation makes the following main contributions: an operational model of a CPU/FPGA system, an axiomatic one and an exploration of simulation techniques for operational models. The operational model is implemented in C and validated using all the behaviours described in the available documentation. We will see how the ambiguities from the documentation can be clarified by running tests on the hardware and consulting with the designers. Finally, to demonstrate the model's utility, we reason about a producer/consumer buffer implemented across the CPU and the FPGA. The simulation of axiomatic models can be orders of magnitude faster than operational models. For this reason, we also provide an axiomatic version of the memory model. This model allows us to generate small concurrent programs to reveal whether a specific memory model behaviour can occur. However, synthesising a single test for the FPGA requires significant time and prevents us from directly running many tests. To overcome this issue, we develop a soft-core processor that allows us to quickly run large numbers of such tests and gain higher confidence in the accuracy of our models. The simulation of the operational model faces a path-explosion problem that limits the exploration of large models. Observing that program analysis tools tackle a similar path-explosion problem, we investigate the idea of reducing the decision problem of ``whether a given memory model allows a given behaviour'' to the decision problem of ``whether a given C program is safe'', which can be handled by a variety of off-the-shelf tools. Using this approach, we can simulate our model more deeply and gain more confidence in its accuracy.Open Acces

    Three-Dimensional Photoacoustic Computed Tomography: Imaging Models and Reconstruction Algorithms

    Get PDF
    Photoacoustic computed tomography: PACT), also known as optoacoustic tomography, is a rapidly emerging imaging modality that holds great promise for a wide range of biomedical imaging applications. Much effort has been devoted to the investigation of imaging physics and the optimization of experimental designs. Meanwhile, a variety of image reconstruction algorithms have been developed for the purpose of computed tomography. Most of these algorithms assume full knowledge of the acoustic pressure function on a measurement surface that either encloses the object or extends to infinity, which poses many difficulties for practical applications. To overcome these limitations, iterative image reconstruction algorithms have been actively investigated. However, little work has been conducted on imaging models that incorporate the characteristics of data acquisition systems. Moreover, when applying to experimental data, most studies simplify the inherent three-dimensional wave propagation as two-dimensional imaging models by introducing heuristic assumptions on the transducer responses and/or the object structures. One important reason is because three-dimensional image reconstruction is computationally burdensome. The inaccurate imaging models severely limit the performance of iterative image reconstruction algorithms in practice. In the dissertation, we propose a framework to construct imaging models that incorporate the characteristics of ultrasonic transducers. Based on the imaging models, we systematically investigate various iterative image reconstruction algorithms, including advanced algorithms that employ total variation-norm regularization. In order to accelerate three-dimensional image reconstruction, we develop parallel implementations on graphic processing units. In addition, we derive a fast Fourier-transform based analytical image reconstruction formula. By use of iterative image reconstruction algorithms based on the proposed imaging models, PACT imaging scanners can have a compact size while maintaining high spatial resolution. The research demonstrates, for the first time, the feasibility and advantages of iterative image reconstruction algorithms in three-dimensional PACT

    FlashVideo: A Framework for Swift Inference in Text-to-Video Generation

    Full text link
    In the evolving field of machine learning, video generation has witnessed significant advancements with autoregressive-based transformer models and diffusion models, known for synthesizing dynamic and realistic scenes. However, these models often face challenges with prolonged inference times, even for generating short video clips such as GIFs. This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo represents the first successful adaptation of the RetNet architecture for video generation, bringing a unique approach to the field. Leveraging the RetNet-based architecture, FlashVideo reduces the time complexity of inference from O(L2)\mathcal{O}(L^2) to O(L)\mathcal{O}(L) for a sequence of length LL, significantly accelerating inference speed. Additionally, we adopt a redundant-free frame interpolation method, enhancing the efficiency of frame interpolation. Our comprehensive experiments demonstrate that FlashVideo achieves a ×9.17\times9.17 efficiency improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models

    An evaluation of the GAMA/StarPU frameworks for heterogeneous platforms : the progressive photon mapping algorithm

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaRecent evolution of high performance computing moved towards heterogeneous platforms: multiple devices with different architectures, characteristics and programming models, share application workloads. To aid the programmer to efficiently explore these heterogeneous platforms several frameworks have been under development. These dynamically manage the available computing resources through workload scheduling and data distribution, dealing with the inherent difficulties of different programming models and memory accesses. Among other frameworks, these include GAMA and StarPU. The GAMA framework aims to unify the multiple execution and memory models of each different device in a computer system, into a single, hardware agnostic model. It was designed to efficiently manage resources with both regular and irregular applications, and currently only supports conventional CPU devices and CUDA-enabled accelerators. StarPU has similar goals and features with a wider user based community, but it lacks a single programming model. The main goal of this dissertation was an in-depth evaluation of a heterogeneous framework using a complex application as a case study. GAMA provided the starting vehicle for training, while StarPU was the selected framework for a thorough evaluation. The progressive photon mapping irregular algorithm was the selected case study. The evaluation goal was to assert the StarPU effectiveness with a robust irregular application, and make a high-level comparison with the still under development GAMA, to provide some guidelines for GAMA improvement. Results show that two main factors contribute to the performance of applications written with StarPU: the consideration of data transfers in the performance model, and chosen scheduler. The study also allowed some caveats to be found within the StarPU API. Although this have no effect on performance, they present a challenge for new coming developers. Both these analysis resulted in a better understanding of the framework, and a comparative analysis with GAMA could be made, pointing out the aspects where GAMA could be further improved upon.A recente evolução da computação de alto desempenho é em direção ao uso de plataformas heterogéneas: múltiplos dispositivos com diferentes arquiteturas, características e modelos de programação, partilhando a carga computacional das aplicações. De modo a ajudar o programador a explorar eficientemente estas plataformas, várias frameworks têm sido desenvolvidas. Estas frameworks gerem os recursos computacionais disponíveis, tratando das dificuldades inerentes dos diferentes modelos de programação e acessos à memória. Entre outras frameworks, estas incluem o GAMA e o StarPU. O GAMA tem o objetivo de unificar os múltiplos modelos de execução e memória de cada dispositivo diferente num sistema computacional, transformando-os num único modelo, independente do hardware utilizado. A framework foi desenhada de forma a gerir eficientemente os recursos, tanto para aplicações regulares como irregulares, e atualmente suporta apenas CPUs convencionais e aceleradores CUDA. O StarPU tem objetivos e funcionalidades idênticos, e também uma comunidade mais alargada, mas não possui um modelo de programação único O objetivo principal desta dissertação foi uma avaliação profunda de uma framework heterogénea, usando uma aplicação complexa como caso de estudo. O GAMA serviu como ponto de partida para treino e ambientação, enquanto que o StarPU foi a framework selecionada para uma avaliação mais profunda. O algoritmo irregular de progressive photon mapping foi o caso de estudo escolhido. O objetivo da avaliação foi determinar a eficácia do StarPU com uma aplicação robusta, e fazer uma análise de alto nível com o GAMA, que ainda está em desenvolvimento, para forma a providenciar algumas sugestões para o seu melhoramento. Os resultados mostram que são dois os principais factores que contribuem para a performance de aplicação escritas com auxílio do StarPU: a avaliação dos tempos de transferência de dados no modelo de performance, e a escolha do escalonador. O estudo permitiu também avaliar algumas lacunas na API do StarPU. Embora estas não tenham efeitos visíveis na eficiencia da framework, eles tornam-se um desafio para recém-chegados ao StarPU. Ambas estas análisos resultaram numa melhor compreensão da framework, e numa análise comparativa com o GAMA, onde são apontados os possíveis aspectos que o este tem a melhorar.Fundação para a Ciência e a Tecnologia (FCT) - Program UT Austin | Portuga
    corecore