68 research outputs found

    Evaluation of a "Stall" Cache: An Efficient Restricted On-chip Instruction Cache

    No full text
    In this report we compare the cost and performance of a new kind of restricted instruction cache architecture -- the stall cache -- against several other conventional cache architectures. The stall cache minimizes the size of an on-chip instruction cache by caching only those instructions whose instruction fetch phase collides with the memory access phase of a preceding load or store instruction. Many existing machines provide a single cycle external cache memory [6, 17, 2]. Our results show that, under this assumption, the stall cache always outperforms an equivalent sized on-chip instruction cache, reducing external memory access stalls by approximately 10%. In addition we present results for a system using an onchip data cache, and for one with a double width data bus and short instruction prefetch buffer

    How Much Non-strictness do Lenient Programs Require?

    No full text
    Lenient languages, such as Id90, have been touted as among the best functional languages for massively parallel machines [AHN88]. Lenient evaluation combines non-strict semantics with eager evaluation [Tra91]. Non-strictness gives these languages more expressive power than strict semantics, while eager evaluation ensures the highest degree of parallelism. Unfortunately, non-strictness incurs a large overhead, as it requires dynamic scheduling and synchronization. As a result, many powerful program analysis techniques have been developed to statically determine when non-strictness is not required [CPJ85, Tra91, Sch94]. This paper studies a large set of lenient programs and quantifies the degree of non-strictness they require. We identify several forms of non-strictness, including functional, conditional, and data structure non-strictness. Surprisingly, most Id90 programs require neither functional nor conditional non-strictness. Many benchmark programs, however, make use of a limited fo..

    Predicting the Running Times of Parallel Programs by Simulation

    No full text
    Predicting the running time of a parallel program is useful for determining the optimal values for the parameters of the implementation and the optimal mapping of data on processors. However, deriving an explicit formula for the running time of a certain parallel program is a difficult task. We present a new method for the analysis of parallel programs: simulating the execution of parallel programs by following their control flow and by determining, for each processor, the sequence of send and receive operations according to the LogGP model. We developed two algorithms to simulate the LogGP communication between processors and we tested them on the blocked parallel version of the Gaussian Elimination algorithm on the Meiko CS-2 parallel machine. Our implementation showed that the LogGP simulation is able to detect the nonlinear behavior of the program running times, to indicate the differences in running times for different data layouts and to find the local optimal value of the block ..

    Active Messages Implementations for the Meiko CS-2

    No full text
    Active messages provide a low latency communication architecture which on modern parallel machines achieves more than an order of magnitude performance improvement over more traditional communication libraries. It is used by library and compiler writers to obtain the utmost performance and has been used to implement the novel parallel language Split-C. This paper discusses the experience we gained while implementing active messages on the Meiko CS-2, and discusses implementations for similar architectures. The CS-2 is an interesting experimental platform, as it resembles a cluster of Sparc workstations, each equipped with a dedicated communication co-processor. During our work we have identified two mismatches between the requirements of active message and the Meiko CS-2 architecture. First, architectures which only support efficient remote write operations (or DMA transfers as in the case of the CS-2) make it difficult to transfer both data and control as required by active messages. ..

    Abstract Optimal Broadcast and Summation in the LogP Model

    No full text
    In many distributed-memory parallel computers the only built-in communication primitive is point-to-point message transmission, and more powerful operations such asbroa&ast and synchronization must be realized using this primitive. Whhin the LogP model of parallel computation we present algorithms that yield optimal communication schedules for several broadcast and synchronization operations. Most of our algorithms arc the absolutely best possible in that not even the constant factors can be improved upon. For one particular broadcast problem, called continuous broadcast, the optimality of our algorithm is not yet completely proven, although proofs have been achieved for a certain range of pararnetem. We also devise an optimal algorithm for summin g or, more generally, applying a non~ommumtive associative binary operator to a set of operands.

    Consh: Confined Execution Environment for Internet Computations

    No full text
    The recent rapid growth of the Internet made a vast pool of resources available globally and enabled new kinds of applications raising the need for transparent remote access and for protected computing. Currently users need specialized software such as web browsers or FTP clients to access global resources. It is desirable to instead provide OS support for transparent access to these resources so that they can be accessed through standard applications such as text editors or command shells. The new applications made possible by the expanding Internet require provisions for safe and protected computing. For example, global computing projects harness the power of thousands of idle machines to solve complex problems and, similarly, and highly-flexible servers allow users to upload and execute their code on the server to perform otherwise difficult tasks. In both cases, users or servers need to execute applications which they cannot trust completely. Such untrusted applications could poten..

    Lazy Threads: Implementing a Fast Parallel Call

    No full text
    In this paper we describe lazy threads, a new approach for implementing multi-threaded execution models on conventional machines. We show how they can implement a parallel call at nearly the efficiency of a sequential call. The central idea is to specialize the representation of a parallel call so that it can execute as a parallel-ready sequential call. This allows excess parallelism to degrade into sequential calls with the attendant efficient stack management and direct transfer of control and data, yet a call that truly needs to execute in parallel, gets its own thread of control. The efficiency of lazy threads is achieved through a careful attention to storage management and a code generation strategy that allows us to represent potential parallel work with no overhead
    • …
    corecore