99,195 research outputs found
Variable-based multi-module data caches for clustered VLIW processors
Memory structures consume an important fraction of the total processor energy. One solution to reduce the energy consumed by cache memories consists of reducing their supply voltage and/or increase their threshold voltage at an expense in access time. We propose to divide the L1 data cache into two cache modules for a clustered VLIW processor consisting of two clusters. Such division is done on a variable basis so that the address of a datum determines its location. Each cache module is assigned to a cluster and can be set up as a fast power-hungry module or as a slow power-aware module. We also present compiler techniques in order to distribute variables between the two cache modules and generate code accordingly. We have explored several cache configurations using the Mediabench suite and we have observed that the best distributed cache organization outperforms traditional cache organizations by 19%-31% in energy-delay and by 11%-29% in energy-delay. In addition, we also explore a reconfigurable distributed cache, where the cache can be reconfigured on a context switch. This reconfigurable scheme further outperforms the best previous distributed organization by 3%-4%.Peer ReviewedPostprint (published version
Location and Spatial Pricing in Agricultural Markets
Agricultural markets often feature significant transport costs and spatially distributed production and processing which causes spatial imperfect competition. Spatial economics considers the firms’ decisions regarding location and spatial price strategy separately, usually on the demand side, and under restrictive assumptions. Therefore, alternative approaches are needed to explain, e.g., the location of new ethanol plants in the U.S. at peripheral as well as at central locations and the observation of different spatial price strategies in the market. We use an agent-based simulation model to analyze location and spatial pricing in a general model under multi-firm competition, two-dimensional space, and a continuum of potential price strategies. The results show, e.g., that depending on the location of a processor, different price strategies can be observed, spatial price discrimination can increase with the number of competitors, and elasticity in the producers’ supply functions can be identified as stabilizing factor of processor’s location.spatial competition, location, price discrimination, oligopsony, simulation, Industrial Organization, Research Methods/ Statistical Methods, C63, Q11, R32,
Hierarchical clustered register file organization for VLIW processors
Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods
Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors
This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs
Scaling of a large-scale simulation of synchronous slow-wave and asynchronous awake-like activity of a cortical model with long-range interconnections
Cortical synapse organization supports a range of dynamic states on multiple
spatial and temporal scales, from synchronous slow wave activity (SWA),
characteristic of deep sleep or anesthesia, to fluctuating, asynchronous
activity during wakefulness (AW). Such dynamic diversity poses a challenge for
producing efficient large-scale simulations that embody realistic metaphors of
short- and long-range synaptic connectivity. In fact, during SWA and AW
different spatial extents of the cortical tissue are active in a given timespan
and at different firing rates, which implies a wide variety of loads of local
computation and communication. A balanced evaluation of simulation performance
and robustness should therefore include tests of a variety of cortical dynamic
states. Here, we demonstrate performance scaling of our proprietary Distributed
and Plastic Spiking Neural Networks (DPSNN) simulation engine in both SWA and
AW for bidimensional grids of neural populations, which reflects the modular
organization of the cortex. We explored networks up to 192x192 modules, each
composed of 1250 integrate-and-fire neurons with spike-frequency adaptation,
and exponentially decaying inter-modular synaptic connectivity with varying
spatial decay constant. For the largest networks the total number of synapses
was over 70 billion. The execution platform included up to 64 dual-socket
nodes, each socket mounting 8 Intel Xeon Haswell processor cores @ 2.40GHz
clock rates. Network initialization time, memory usage, and execution time
showed good scaling performances from 1 to 1024 processes, implemented using
the standard Message Passing Interface (MPI) protocol. We achieved simulation
speeds of between 2.3x10^9 and 4.1x10^9 synaptic events per second for both
cortical states in the explored range of inter-modular interconnections.Comment: 22 pages, 9 figures, 4 table
The communication processor of TUMULT-64
Tumult (Twente University MULTi-processor system) is a modular extendible multi-processor system designed and implemented at the Twente University of Technology in co-operation with Oce Nederland B.V. and the Dr. Neher Laboratories (Dutch PTT). Characteristics of the hardware are: MIMD type, distributed memory, message passing, high performance, real-time and fault tolerant. A distributed real-time operating system has been realized, consisting of a multi-tasking kernel per node, inter process communication via typed messages and a distributed file system. In this paper first a brief description of the system is given, after that the architecture of the communication processor will be discussed. Reduction of the communication overhead due to message passing will be emphasized.\ud
\u
Low-complexity distributed issue queue
As technology evolves, power density significantly increases and cooling systems become more complex and expensive. The issue logic is one of the processor hotspots and, at the same time, its latency is crucial for the processor performance. We present a low-complexity FP issue logic (MB/spl I.bar/distr) that achieves high performance with small energy requirements. The MB/spl I.bar/distr scheme is based on classifying instructions and dispatching them into a set of queues depending on their data dependences. These instructions are selected for issuing based on an estimation of when their operands will be available, so the conventional wakeup activity is not required. Additionally, the functional units are distributed across the different queues. The energy required by the proposed scheme is substantially lower than that required by a conventional issue design, even if the latter has the ability of waking-up only unready operands. MB/spl I.bar/distr scheme reduces the energy-delay product by 35% and the energy-delay product by 18% with respect to a state-of-the-art approach.Peer ReviewedPostprint (published version
- …