54 research outputs found
A Structured Design Methodology for High Performance VLSI Arrays
abstract: The geometric growth in the integrated circuit technology due to transistor scaling also with system-on-chip design strategy, the complexity of the integrated circuit has increased manifold. Short time to market with high reliability and performance is one of the most competitive challenges. Both custom and ASIC design methodologies have evolved over the time to cope with this but the high manual labor in custom and statistic design in ASIC are still causes of concern. This work proposes a new circuit design strategy that focuses mostly on arrayed structures like TLB, RF, Cache, IPCAM etc. that reduces the manual effort to a great extent and also makes the design regular, repetitive still achieving high performance. The method proposes making the complete design custom schematic but using the standard cells. This requires adding some custom cells to the already exhaustive library to optimize the design for performance. Once schematic is finalized, the designer places these standard cells in a spreadsheet, placing closely the cells in the critical paths. A Perl script then generates Cadence Encounter compatible placement file. The design is then routed in Encounter. Since designer is the best judge of the circuit architecture, placement by the designer will allow achieve most optimal design. Several designs like IPCAM, issue logic, TLB, RF and Cache designs were carried out and the performance were compared against the fully custom and ASIC flow. The TLB, RF and Cache were the part of the HEMES microprocessor.Dissertation/ThesisPh.D. Electrical Engineering 201
Energy-Efficient Cache Coherence for Embedded Multi-Processor Systems through Application-Driven Snoop Filtering
We present a novel methodology for power reduction in embedded multiprocessor systems. Maintaining local caches coherent in bus-based multiprocessor systems results in significantly elevated power consumption, as the bus snooping protocols result in local cache lookups for each memory reference placed on the common bus. Such a conservative approach is warranted in general-purpose systems, where no prior knowledge regarding the communication structure between threads or processes is available. In such a general-purpose context the assumption is that each memory request is potentially a reference to a shared memory region, which may result in cache inconsistency, if no correcting activities are undertaken. The approach we propose exploits the fact that in embedded systems, important knowledge is available to the system designers regarding communication activities between tasks allocated to the different processor nodes. We demonstrate how the snoop-related cache probing activity can be drastically reduced by identifying in a deterministic way all the shared memory regions and the communication patterns between the processor nodes. Cache snoop activity is enabled only for the fraction of the bus transactions, which refer to locations belonging to known shared memory region for each processor node; for the remaining larger part of memory references known to be of no relation to the given processor node, snoop probings in the local cache are completely disabled, thus saving a large amount of power. The required hardware support is not only cost-efficient, but is also software programmable, which allows the system software to dynamically customize the cache coherence controller to the needs of different tasks or even different parts of the same program. The experiments which we have performed on a number of important applications demonstrate the effectiveness of the proposed approach
Software-assisted cache mechanisms for embedded systems
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2008.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (leaves 120-135).Embedded systems are increasingly using on-chip caches as part of their on-chip memory system. This thesis presents cache mechanisms to improve cache performance and provide opportunities to improve data availability that can lead to more predictable cache performance. The first cache mechanism presented is an intelligent cache replacement policy that utilizes information about dead data and data that is very frequently used. This mechanism is analyzed theoretically to show that the number of misses using intelligent cache replacement is guaranteed to be no more than the number of misses using traditional LRU replacement. Hardware and software-assisted mechanisms to implement intelligent cache replacement are presented and evaluated. The second cache mechanism presented is that of cache partitioning which exploits disjoint access sequences that do not overlap in the memory space. A theoretical result is proven that shows that modifying an access sequence into a concatenation of disjoint access sequences is guaranteed to improve the cache hit rate. Partitioning mechanisms inspired by the concept of disjoint sequences are designed and evaluated. A profit-based analysis, annotation, and simulation framework has been implemented to evaluate the cache mechanisms. This framework takes a compiled benchmark program and a set of program inputs and evaluates various cache mechanisms to provide a range of possible performance improvement scenarios. The proposed cache mechanisms have been evaluated using this framework by measuring cache miss rates and Instructions Per Clock (IPC) information. The results show that the proposed cache mechanisms show promise in improving cache performance and predictability with a modest increase in silicon area.by Prabhat Jain.Ph.D
Scalable and fault-tolerant data stream processing on multi-core architectures
With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state.
While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures.
Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces
Tracing the Compositional Process. Sound art that rewrites its own past: formation, praxis and a computer framework
The domain of this thesis is electroacoustic computer-based music and sound art. It investigates
a facet of composition which is often neglected or ill-defined: the process of composing itself
and its embedding in time. Previous research mostly focused on instrumental composition or,
when electronic music was included, the computer was treated as a tool which would eventually
be subtracted from the equation. The aim was either to explain a resultant piece of music by
reconstructing the intention of the composer, or to explain human creativity by building a model
of the mind.
Our aim instead is to understand composition as an irreducible unfolding of material traces which
takes place in its own temporality. This understanding is formalised as a software framework
that traces creation time as a version graph of transactions. The instantiation and manipulation
of any musical structure implemented within this framework is thereby automatically stored
in a database. Not only can it be queried ex post by an external researcher—providing a new
quality for the empirical analysis of the activity of composing—but it is an integral part of
the composition environment. Therefore it can recursively become a source for the ongoing
composition and introduce new ways of aesthetic expression. The framework aims to unify
creation and performance time, fixed and generative composition, human and algorithmic
“writing”, a writing that includes indeterminate elements which condense as concurrent vertices
in the version graph.
The second major contribution is a critical epistemological discourse on the question of ob-
servability and the function of observation. Our goal is to explore a new direction of artistic
research which is characterised by a mixed methodology of theoretical writing, technological
development and artistic practice. The form of the thesis is an exercise in becoming process-like
itself, wherein the epistemic thing is generated by translating the gaps between these three levels.
This is my idea of the new aesthetics: That through the operation of a re-entry one may establish
a sort of process “form”, yielding works which go beyond a categorical either “sound-in-itself”
or “conceptualism”.
Exemplary processes are revealed by deconstructing a series of existing pieces, as well as
through the successful application of the new framework in the creation of new pieces
On the Distribution of Control in Asynchronous Processor Architectures
Institute for Computing Systems ArchitectureThe effective performance of computer systems is to a large measure
determined by the synergy between the processor architecture, the
instruction set and the compiler. In the past, the sequencing of
information within processor architectures has normally been
synchronous: controlled centrally by a clock. However, this global
signal could possibly limit the future gains in performance that can
potentially be achieved through improvements in implementation
technology.
This thesis investigates the effects of relaxing this strict synchrony
by distributing control within processor architectures through the use
of a novel asynchronous design model known as a micronet. The impact
of asynchronous control on the performance of a RISC-style processor
is explored at different levels. Firstly, improvements in the
performance of individual instructions by exploiting actual run-time
behaviours are demonstrated. Secondly, it is shown that micronets are
able to exploit further (both spatial and temporal) instructionlevel
parallelism (ILP) efficiently through the distribution of control to
datapath resources. Finally, exposing fine-grain concurrency within a
datapath can only be of benefit to a computer system if it can easily
be exploited by the compiler. Although compilers for micronet-based
asynchronous processors may be considered to be more complex than
their synchronous counterparts, it is shown that the variable
execution time of an instruction does not adversely affect the
compiler's ability to schedule code efficiently. In conclusion, the
modelling of a processor's datapath as a micronet permits the
exploitation of both finegrain ILP and actual run-time delays, thus
leading to the efficient utilisation of functional units and in turn
resulting in an improvement in overall system performance
Language design for distributed stream processing
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 149-152).Applications that combine live data streams with embedded, parallel, and distributed processing are becoming more commonplace. WaveScript is a domain-specific language that brings high-level, type-safe, garbage-collected programming to these domains. This is made possible by three primary implementation techniques, each of which leverages characteristics of the streaming domain. First, WaveScript employs an evaluation strategy that uses a combination of interpretation and reification to partially evaluate programs into stream dataflow graphs. Second, we use profile-driven compilation to enable many optimizations that are normally only available in the synchronous (rather than asynchronous) dataflow domain. Finally, an empirical, profile-driven approach also allows us to compute practical partitions of dataflow graphs, spreading them across embedded nodes and more powerful servers. We have used our language to build and deploy applications, including a sensor-network for the acoustic localization of wild animals such as the Yellow-Bellied marmot. We evaluate WaveScript's performance on this application, showing that it yields good performance on both embedded and desktop-class machines. Our language allowed us to implement the application rapidly, while outperforming a previous C implementation by over 35%, using fewer than half the lines of code. We evaluate the contribution of our optimizations to this success. We also evaluate WaveScript's ability to extract parallelism from this and other applications.by Ryan Rhodes Newton.Ph.D
Improving the Performance of User-level Runtime Systems for Concurrent Applications
Concurrency is an essential part of many modern large-scale software systems. Applications must handle millions of simultaneous requests from millions of connected devices. Handling
such a large number of concurrent requests requires runtime systems that efficiently man-
age concurrency and communication among tasks in an application across multiple cores.
Existing low-level programming techniques provide scalable solutions with low overhead,
but require non-linear control flow. Alternative approaches to concurrent programming,
such as Erlang and Go, support linear control flow by mapping multiple user-level execution
entities across multiple kernel threads (M:N threading). However, these systems provide
comprehensive execution environments that make it difficult to assess the performance
impact of user-level runtimes in isolation.
This thesis presents a nimble M:N user-level threading runtime that closes this con-
ceptual gap and provides a software infrastructure to precisely study the performance
impact of user-level threading. Multiple design alternatives are presented and evaluated
for scheduling, I/O multiplexing, and synchronization components of the runtime. The
performance of the runtime is evaluated in comparison to event-driven software, system-
level threading, and other user-level threading runtimes. An experimental evaluation is
conducted using benchmark programs, as well as the popular Memcached application.
The user-level runtime supports high levels of concurrency without sacrificing application
performance. In addition, the user-level scheduling problem is studied in the context of
an existing actor runtime that maps multiple actors to multiple kernel-level threads. In
particular, two locality-aware work-stealing schedulers are proposed and evaluated. It is
shown that locality-aware scheduling can significantly improve the performance of a class
of applications with a high level of concurrency. In general, the performance and resource
utilization of large-scale concurrent applications depends on the level of concurrency that
can be expressed by the programming model. This fundamental effect is studied by refining
and customizing existing concurrency models
Parallel Simulation of Individual-Based, Physiologically-Structured Population and Predator-Prey Ecology Models
Utilizing as testbeds physiologically-structured, individual-based models for fish and Daphnia populations, techniques for the parallelization of the simulation are developed and analyzed. The techniques developed are generally applicable to individual-based models. For rapidly reproducing populations like Daphnia which are load balanced, then global birth combining is required. Super-scalar speedup was observed in simulations on multi-core desktop computers.
The two populations are combined via a size-structured predation module into a predator-prey system with sharing of resource weighted by relative mass. The individual-based structure requires multiple stages to complete predation.
Two different styles of parallelization are presented. The first distributes both populations. It decouples the populations for parallel simulation by compiling, at each stage, tables of information for each of the distributed predators. Predation is completed for all fish at one time. This method is found to be generally applicable, has near perfect scaling with increasing processors, and improves performance as the workload to communications ratio improves with increasing numbers of predator cohorts. But it does not take best advantage of our testbed models.
The second design decouples the workload for parallel simulation by duplicating the predator population on all nodes. This reduces communications to simple parallel reductions similar to the population models, but increases the number of cycles required for predation. The performance of the population models is mimicked.
Finally, the extinction and persistence behaviors of the predator-prey model are analyzed. The roles of the predation parameters, individual models, and initial populations are determined. In the presence of density-dependent mortality moderating the prey population, competition via resource of the larger fish versus the smaller is found to be a vital control to prevent extinction of prey population. If unconstrained, the juvenile fish classes can — through their rapid initial growth and predation upon the juvenile prey classes — push the prey population to extinction. Persistence of the predator-prey community is thus threatened when the fish population is dominated by juveniles. Conversely, the presence of larger fish moderates the juveniles and stabilizes the community via competition for shared resource
- …