54 research outputs found

    A Structured Design Methodology for High Performance VLSI Arrays

    Get PDF
    abstract: The geometric growth in the integrated circuit technology due to transistor scaling also with system-on-chip design strategy, the complexity of the integrated circuit has increased manifold. Short time to market with high reliability and performance is one of the most competitive challenges. Both custom and ASIC design methodologies have evolved over the time to cope with this but the high manual labor in custom and statistic design in ASIC are still causes of concern. This work proposes a new circuit design strategy that focuses mostly on arrayed structures like TLB, RF, Cache, IPCAM etc. that reduces the manual effort to a great extent and also makes the design regular, repetitive still achieving high performance. The method proposes making the complete design custom schematic but using the standard cells. This requires adding some custom cells to the already exhaustive library to optimize the design for performance. Once schematic is finalized, the designer places these standard cells in a spreadsheet, placing closely the cells in the critical paths. A Perl script then generates Cadence Encounter compatible placement file. The design is then routed in Encounter. Since designer is the best judge of the circuit architecture, placement by the designer will allow achieve most optimal design. Several designs like IPCAM, issue logic, TLB, RF and Cache designs were carried out and the performance were compared against the fully custom and ASIC flow. The TLB, RF and Cache were the part of the HEMES microprocessor.Dissertation/ThesisPh.D. Electrical Engineering 201

    Energy-Efficient Cache Coherence for Embedded Multi-Processor Systems through Application-Driven Snoop Filtering

    Get PDF
    We present a novel methodology for power reduction in embedded multiprocessor systems. Maintaining local caches coherent in bus-based multiprocessor systems results in significantly elevated power consumption, as the bus snooping protocols result in local cache lookups for each memory reference placed on the common bus. Such a conservative approach is warranted in general-purpose systems, where no prior knowledge regarding the communication structure between threads or processes is available. In such a general-purpose context the assumption is that each memory request is potentially a reference to a shared memory region, which may result in cache inconsistency, if no correcting activities are undertaken. The approach we propose exploits the fact that in embedded systems, important knowledge is available to the system designers regarding communication activities between tasks allocated to the different processor nodes. We demonstrate how the snoop-related cache probing activity can be drastically reduced by identifying in a deterministic way all the shared memory regions and the communication patterns between the processor nodes. Cache snoop activity is enabled only for the fraction of the bus transactions, which refer to locations belonging to known shared memory region for each processor node; for the remaining larger part of memory references known to be of no relation to the given processor node, snoop probings in the local cache are completely disabled, thus saving a large amount of power. The required hardware support is not only cost-efficient, but is also software programmable, which allows the system software to dynamically customize the cache coherence controller to the needs of different tasks or even different parts of the same program. The experiments which we have performed on a number of important applications demonstrate the effectiveness of the proposed approach

    A DYNAMIC HETEROGENEOUS MULTI-CORE ARCHITECTURE

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Software-assisted cache mechanisms for embedded systems

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 120-135).Embedded systems are increasingly using on-chip caches as part of their on-chip memory system. This thesis presents cache mechanisms to improve cache performance and provide opportunities to improve data availability that can lead to more predictable cache performance. The first cache mechanism presented is an intelligent cache replacement policy that utilizes information about dead data and data that is very frequently used. This mechanism is analyzed theoretically to show that the number of misses using intelligent cache replacement is guaranteed to be no more than the number of misses using traditional LRU replacement. Hardware and software-assisted mechanisms to implement intelligent cache replacement are presented and evaluated. The second cache mechanism presented is that of cache partitioning which exploits disjoint access sequences that do not overlap in the memory space. A theoretical result is proven that shows that modifying an access sequence into a concatenation of disjoint access sequences is guaranteed to improve the cache hit rate. Partitioning mechanisms inspired by the concept of disjoint sequences are designed and evaluated. A profit-based analysis, annotation, and simulation framework has been implemented to evaluate the cache mechanisms. This framework takes a compiled benchmark program and a set of program inputs and evaluates various cache mechanisms to provide a range of possible performance improvement scenarios. The proposed cache mechanisms have been evaluated using this framework by measuring cache miss rates and Instructions Per Clock (IPC) information. The results show that the proposed cache mechanisms show promise in improving cache performance and predictability with a modest increase in silicon area.by Prabhat Jain.Ph.D

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Tracing the Compositional Process. Sound art that rewrites its own past: formation, praxis and a computer framework

    Get PDF
    The domain of this thesis is electroacoustic computer-based music and sound art. It investigates a facet of composition which is often neglected or ill-defined: the process of composing itself and its embedding in time. Previous research mostly focused on instrumental composition or, when electronic music was included, the computer was treated as a tool which would eventually be subtracted from the equation. The aim was either to explain a resultant piece of music by reconstructing the intention of the composer, or to explain human creativity by building a model of the mind. Our aim instead is to understand composition as an irreducible unfolding of material traces which takes place in its own temporality. This understanding is formalised as a software framework that traces creation time as a version graph of transactions. The instantiation and manipulation of any musical structure implemented within this framework is thereby automatically stored in a database. Not only can it be queried ex post by an external researcher—providing a new quality for the empirical analysis of the activity of composing—but it is an integral part of the composition environment. Therefore it can recursively become a source for the ongoing composition and introduce new ways of aesthetic expression. The framework aims to unify creation and performance time, fixed and generative composition, human and algorithmic “writing”, a writing that includes indeterminate elements which condense as concurrent vertices in the version graph. The second major contribution is a critical epistemological discourse on the question of ob- servability and the function of observation. Our goal is to explore a new direction of artistic research which is characterised by a mixed methodology of theoretical writing, technological development and artistic practice. The form of the thesis is an exercise in becoming process-like itself, wherein the epistemic thing is generated by translating the gaps between these three levels. This is my idea of the new aesthetics: That through the operation of a re-entry one may establish a sort of process “form”, yielding works which go beyond a categorical either “sound-in-itself” or “conceptualism”. Exemplary processes are revealed by deconstructing a series of existing pieces, as well as through the successful application of the new framework in the creation of new pieces

    On the Distribution of Control in Asynchronous Processor Architectures

    Get PDF
    Institute for Computing Systems ArchitectureThe effective performance of computer systems is to a large measure determined by the synergy between the processor architecture, the instruction set and the compiler. In the past, the sequencing of information within processor architectures has normally been synchronous: controlled centrally by a clock. However, this global signal could possibly limit the future gains in performance that can potentially be achieved through improvements in implementation technology. This thesis investigates the effects of relaxing this strict synchrony by distributing control within processor architectures through the use of a novel asynchronous design model known as a micronet. The impact of asynchronous control on the performance of a RISC-style processor is explored at different levels. Firstly, improvements in the performance of individual instructions by exploiting actual run-time behaviours are demonstrated. Secondly, it is shown that micronets are able to exploit further (both spatial and temporal) instructionlevel parallelism (ILP) efficiently through the distribution of control to datapath resources. Finally, exposing fine-grain concurrency within a datapath can only be of benefit to a computer system if it can easily be exploited by the compiler. Although compilers for micronet-based asynchronous processors may be considered to be more complex than their synchronous counterparts, it is shown that the variable execution time of an instruction does not adversely affect the compiler's ability to schedule code efficiently. In conclusion, the modelling of a processor's datapath as a micronet permits the exploitation of both finegrain ILP and actual run-time delays, thus leading to the efficient utilisation of functional units and in turn resulting in an improvement in overall system performance

    Language design for distributed stream processing

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 149-152).Applications that combine live data streams with embedded, parallel, and distributed processing are becoming more commonplace. WaveScript is a domain-specific language that brings high-level, type-safe, garbage-collected programming to these domains. This is made possible by three primary implementation techniques, each of which leverages characteristics of the streaming domain. First, WaveScript employs an evaluation strategy that uses a combination of interpretation and reification to partially evaluate programs into stream dataflow graphs. Second, we use profile-driven compilation to enable many optimizations that are normally only available in the synchronous (rather than asynchronous) dataflow domain. Finally, an empirical, profile-driven approach also allows us to compute practical partitions of dataflow graphs, spreading them across embedded nodes and more powerful servers. We have used our language to build and deploy applications, including a sensor-network for the acoustic localization of wild animals such as the Yellow-Bellied marmot. We evaluate WaveScript's performance on this application, showing that it yields good performance on both embedded and desktop-class machines. Our language allowed us to implement the application rapidly, while outperforming a previous C implementation by over 35%, using fewer than half the lines of code. We evaluate the contribution of our optimizations to this success. We also evaluate WaveScript's ability to extract parallelism from this and other applications.by Ryan Rhodes Newton.Ph.D

    Improving the Performance of User-level Runtime Systems for Concurrent Applications

    Get PDF
    Concurrency is an essential part of many modern large-scale software systems. Applications must handle millions of simultaneous requests from millions of connected devices. Handling such a large number of concurrent requests requires runtime systems that efficiently man- age concurrency and communication among tasks in an application across multiple cores. Existing low-level programming techniques provide scalable solutions with low overhead, but require non-linear control flow. Alternative approaches to concurrent programming, such as Erlang and Go, support linear control flow by mapping multiple user-level execution entities across multiple kernel threads (M:N threading). However, these systems provide comprehensive execution environments that make it difficult to assess the performance impact of user-level runtimes in isolation. This thesis presents a nimble M:N user-level threading runtime that closes this con- ceptual gap and provides a software infrastructure to precisely study the performance impact of user-level threading. Multiple design alternatives are presented and evaluated for scheduling, I/O multiplexing, and synchronization components of the runtime. The performance of the runtime is evaluated in comparison to event-driven software, system- level threading, and other user-level threading runtimes. An experimental evaluation is conducted using benchmark programs, as well as the popular Memcached application. The user-level runtime supports high levels of concurrency without sacrificing application performance. In addition, the user-level scheduling problem is studied in the context of an existing actor runtime that maps multiple actors to multiple kernel-level threads. In particular, two locality-aware work-stealing schedulers are proposed and evaluated. It is shown that locality-aware scheduling can significantly improve the performance of a class of applications with a high level of concurrency. In general, the performance and resource utilization of large-scale concurrent applications depends on the level of concurrency that can be expressed by the programming model. This fundamental effect is studied by refining and customizing existing concurrency models

    Parallel Simulation of Individual-Based, Physiologically-Structured Population and Predator-Prey Ecology Models

    Get PDF
    Utilizing as testbeds physiologically-structured, individual-based models for fish and Daphnia populations, techniques for the parallelization of the simulation are developed and analyzed. The techniques developed are generally applicable to individual-based models. For rapidly reproducing populations like Daphnia which are load balanced, then global birth combining is required. Super-scalar speedup was observed in simulations on multi-core desktop computers. The two populations are combined via a size-structured predation module into a predator-prey system with sharing of resource weighted by relative mass. The individual-based structure requires multiple stages to complete predation. Two different styles of parallelization are presented. The first distributes both populations. It decouples the populations for parallel simulation by compiling, at each stage, tables of information for each of the distributed predators. Predation is completed for all fish at one time. This method is found to be generally applicable, has near perfect scaling with increasing processors, and improves performance as the workload to communications ratio improves with increasing numbers of predator cohorts. But it does not take best advantage of our testbed models. The second design decouples the workload for parallel simulation by duplicating the predator population on all nodes. This reduces communications to simple parallel reductions similar to the population models, but increases the number of cycles required for predation. The performance of the population models is mimicked. Finally, the extinction and persistence behaviors of the predator-prey model are analyzed. The roles of the predation parameters, individual models, and initial populations are determined. In the presence of density-dependent mortality moderating the prey population, competition via resource of the larger fish versus the smaller is found to be a vital control to prevent extinction of prey population. If unconstrained, the juvenile fish classes can — through their rapid initial growth and predation upon the juvenile prey classes — push the prey population to extinction. Persistence of the predator-prey community is thus threatened when the fish population is dominated by juveniles. Conversely, the presence of larger fish moderates the juveniles and stabilizes the community via competition for shared resource
    corecore