388 research outputs found

    Low Power Processor Architectures and Contemporary Techniques for Power Optimization – A Review

    Get PDF
    The technological evolution has increased the number of transistors for a given die area significantly and increased the switching speed from few MHz to GHz range. Such inversely proportional decline in size and boost in performance consequently demands shrinking of supply voltage and effective power dissipation in chips with millions of transistors. This has triggered substantial amount of research in power reduction techniques into almost every aspect of the chip and particularly the processor cores contained in the chip. This paper presents an overview of techniques for achieving the power efficiency mainly at the processor core level but also visits related domains such as buses and memories. There are various processor parameters and features such as supply voltage, clock frequency, cache and pipelining which can be optimized to reduce the power consumption of the processor. This paper discusses various ways in which these parameters can be optimized. Also, emerging power efficient processor architectures are overviewed and research activities are discussed which should help reader identify how these factors in a processor contribute to power consumption. Some of these concepts have been already established whereas others are still active research areas. © 2009 ACADEMY PUBLISHER

    High-level synthesis of fine-grained weakly consistent C concurrency

    Get PDF
    High-level synthesis (HLS) is the process of automatically compiling high-level programs into a netlist (collection of gates). Given an input program, HLS tools exploit its inherent parallelism and pipelining opportunities to generate efficient customised hardware. C-based programs are the most popular input for HLS tools, but these tools historically only synthesise sequential C programs. As the appeal for software concurrency rises, HLS tools are beginning to synthesise concurrent C programs, such as C/C++ pthreads and OpenCL. Although supporting software concurrency leads to better hardware parallelism, shared memory synchronisation is typically serialised to ensure correct memory behaviour, via locks. Locks are safety resources that ensure exclusive access of shared memory, eliminating data races and providing synchronisation guarantees for programmers.  As an alternative to lock-based synchronisation, the C memory model also defines the possibility of lock-free synchronisation via fine-grained atomic operations (`atomics'). However, most HLS tools either do not support atomics at all or implement atomics using locks. Instead, we treat the synthesis of atomics as a scheduling problem. We show that we can augment the intra-thread memory constraints during memory scheduling of concurrent programs to support atomics. On average, hardware generated by our method is 7.5x faster than the state-of-the-art, for our set of experiments. Our method of synthesising atomics enables several unique possibilities. Chiefly, we are capable of supporting weakly consistent (`weak') atomics, which necessitate fewer ordering constraints compared to sequentially consistent (SC) atomics. However, implementing weak atomics is complex and error-prone and hence we formally verify our methods via automated model checking to ensure our generated hardware is correct. Furthermore, since the C memory model defines memory behaviour globally, we can globally analyse the entire program to generate its memory constraints. Additionally, we can also support loop pipelining by extending our methods to generate inter-iteration memory constraints. On average, weak atomics, global analysis and loop pipelining improve performance by 1.6x, 3.4x and 1.4x respectively, for our set of experiments. Finally, we present a case study of a real-world example via an HLS-based Google PageRank algorithm, whose performance improves by 4.4x via lock-free streaming and work-stealing.Open Acces

    Distributed systems : architecture-driven specification using extended LOTOS

    Get PDF
    The thesis uses the LOTOS language (ISO International Standard ISO 8807) as a basis for the formal specification of distributed systems. Contributions are made to two key research areas: architecture-driven specification and LOTOS language extensions. The notion of architecture-driven specification is to guide the specification process by providing a reference-base of pre-defined domain-specific components. The thesis builds an infra-structure of architectural elements, and provides Extended LOTOS (XL) definitions of these elements. The thesis develops Extended LOTOS (XI.) for the specification of distributed systems. XL- is LOTOS enhanced with features for the formal specification of quantitative timing. probabilistic and priority requirements. For distributed systems, the specification of these ‘performance’ requirements, ran be as important as the specification of the associated functional requirements. To support quantitative timing features, the XL semantics define a global, discrete clock which can be used both to force events to occur at specific times, and to measure Intervals between event occurrences. XL introduces time policy operators ASAP (as soon as possible’ corresponding to “maximal progress semantics") and ALAP (late as possible'). Special internal transitions are introduced in XL semantics for the specification of probability, Conformance relations based on a notion of probabilization, together with a testing framework, are defined to support reasoning about probabilistic XL specifications. Priority within the XL semantics ensures that permitted events with the highest priority weighting of their class are allowed first. Both functional and performance specification play important roles in CIM (Computer Integrated Manufacturing) systems. The thesis uses a CIM system known as the CIM- OSA lntegrating Infrastructure as a case study of architecture-driven specification using XL. The thesis thus constitutes a step in the evolution of distributed system specification methods that have both an architectural basis and a formal basis

    Hardware Design and Implementation of Role-Based Cryptography

    Get PDF
    Traditional public key cryptographic methods provide access control to sensitive data by allowing the message sender to grant a single recipient permission to read the encrypted message. The Need2KnowÂź system (N2K) improves upon these methods by providing role-based access control. N2K defines data access permissions similar to those of a multi-user file system, but N2K strictly enforces access through cryptographic standards. Since custom hardware can efficiently implement many cryptographic algorithms and can provide additional security, N2K stands to benefit greatly from a hardware implementation. To this end, the main N2K algorithm, the Key Protection Module (KPM), is being specified in VHDL. The design is being built and tested incrementally: this first phase implements the core control logic of the KPM without integrating its cryptographic sub-modules. Both RTL simulation and formal verification are used to test the design. This is the first N2K implementation in hardware, and it promises to provide an accelerated and secured alternative to the software-based system. A hardware implementation is a necessary step toward highly secure and flexible deployments of the N2K system

    An Empirical Evaluation of the Effectiveness of JML Assertions as Test Oracles

    Get PDF
    Test oracles remain one of the least understood aspects of the modern testing process. An oracle is a mechanism used by software testers and software engineers for determining whether a test has passed or failed. One widely-supported approach to oracles is the use of runtime assertion checking during the testing activity. Method invariants,pre- and postconditions help detect bugs during runtime. While assertions are supported by virtually all programming environments, are used widely in practice, and are often assumed to be effective as test oracles, there are few empirical studies of their efficacy in this role. In this thesis, we present the results of an experiment we conducted to help understand this question. To do this, we studied seven of the core Java classes that had been annotated by others with assertions in the Java Modeling Language, used the muJava mutation analysis tool to create mutant implementations of these classes, exercised them with input-only (i.e., no oracle) test suites that achieve branch coverage, and used a machine learning tool, Weka, to determine which annotations were effective at ``killing\u27\u27 these mutants. We also evaluate how effective the ``null oracle\u27\u27 (in our case, the Java runtime system) is at catching these bugs. The results of our study are interesting, and help provide software engineers with insight into situations in which assertions can be relied upon to find bugs, and situations in which assertions may need to be augmented with other approaches to test oracles

    Elastic Program Transformations:Automatically Optimizing the Reliability/Performance Trade-off in Systems Software

    Get PDF
    Performance and reliability are important yet conflicting properties of systems software. Software today often crashes, has security vulnerabilities and data loss, while many techniques that could address such issues remain unused due to performance concerns. This thesis presents elastic program transformations, a set of techniques to manage the trade-off between reliability and performance in an optimal way given the software's use case. Our work is based on the following insights: - Program transformations can be used to tailor many software properties: they can make software easier to verify, safer against security attacks, and faster to test. Developers can write software once and use transformations to subsequently specialize it for different use cases and environments. - Many classes of transformations are elastic: they can be applied selectively to parts of the software, and both their cost and effect scale accordingly. - The trade-off is governed by the Pareto Principle: the right choice of program transformations yields 80% of the benefit for 20% of the cost. This thesis makes four contributions that use these insights: 1. We developed Overify, a strategy and technique for choosing compiler optimizations to make programs easier to verify, rather than faster to run. Overify demonstrates that program transformations have the power to reduce verification time by up to 95x. 2. We show that known program transformations to detect memory errors and undefined behavior can be elastic. We developed ASAP, a technique and tool to apply these transformations selectively to make a program as secure as possible, while staying within a maximum overhead budget specified by the developer. For a maximum overhead of 5%, ASAP can protect 87% of the critical locations in CPU-intensive programs. 3. We improve the performance of fuzz testing tools that rely on program transformations to observe the software under test. We present FUSS, a technique and tool for focusing these transformations where they are needed, improving the time to find bugs by up to 3.2x. 4. We show that elasticity is a useful design principle for program transformations. In the BinKungfu project, we develop a novel elastic transformation to enforce control-flow integrity. Exploiting elasticity reduces the number of programs that violate performance constraints by 59%. The essence of our work is identifying elasticity as a first-class property of program transformations. This thesis explains how elasticity manifests and how Overify, ASAP, FUSS and BinKungfu exploit it automatically. We implemented each technique in a prototype, released it as open-source software, and evaluated it to show that it obtains more favorable trade-offs between reliability and performance than what was previously possible

    A Framework for the Design and Analysis of High-Performance Applications on FPGAs using Partial Reconfiguration

    Get PDF
    The field-programmable gate array (FPGA) is a dynamically reconfigurable digital logic chip used to implement custom hardware. The large densities of modern FPGAs and the capability of the on-thely reconfiguration has made the FPGA a viable alternative to fixed logic hardware chips such as the ASIC. In high-performance computing, FPGAs are used as co-processors to speed up computationally intensive processes or as autonomous systems that realize a complete hardware application. However, due to the limited capacity of FPGA logic resources, denser FPGAs must be purchased if more logic resources are required to realize all the functions of a complex application. Alternatively, partial reconfiguration (PR) can be used to swap, on demand, idle components of the application with active components. This research uses PR to swap components to improve the performance of the application given the limited logic resources available with smaller but economical FPGAs. The swap is called ”resource sharing PR”. In a pipelined design of multiple hardware modules (pipeline stages), resource sharing PR is a technique that uses PR to improve the performance of pipeline bottlenecks. This is done by reconfiguring other pipeline stages, typically those that are idle waiting for data from a bottleneck, into an additional parallel bottleneck module. The target pipeline of this research is a two-stage “slow-toast” pipeline where the flow of data traversing the pipeline transitions from a relatively slow, bottleneck stage to a fast stage. A two stage pipeline that combines FPGA-based hardware implementations of well-known Bioinformatics search algorithms, the X! Tandem algorithm and the Smith-Waterman algorithm, is implemented for this research; the implemented pipeline demonstrates that characteristics of these algorithm. The experimental results show that, in a database of unknown peptide spectra, when matching spectra with 388 peaks or greater, performing resource sharing PR to instantiate a parallel X! Tandem module is worth the cost for PR. In addition, from timings gathered during experiments, a general formula was derived for determining the value of performing PR upon a fast module

    Reconfiguration of field programmable logic in embedded systems

    Get PDF
    • 

    corecore