26 research outputs found
Evolutionary system for prediction and optimization of hardware architecture performance
The design of computer architectures is a very complex problem. The multiple parameters make the number of possible combinations extremely high. Many researchers have used simulation, although it is a slow solution since evaluating a single point of the search space can take hours. In this work we propose using evolutionary multilayer perceptron (MLP)
to compute the performance of an architecture parameter settings. Instead of exploring the search space, simulating many
configurations, our method randomly selects some architecture configurations; those are simulated to obtain their performance, and then an artificial neural network is trained to predict the remaining configurations performance. Results obtained show a high accuracy of the estimations using a simple method to select the configurations we have to simulate to optimize the MLP. In order to explore the search space, we have designed
a genetic algorithm that uses the MLP as fitness function to find the niche where the best architecture configurations (those with higher performance) are located. Our models need only a small fraction of the design space, obtaining small errors and reducing required simulation by two orders of magnitude.Peer ReviewedPostprint (published version
Toward Dark Silicon in Servers
Server chips will not scale beyond a few tens to low hundreds of cores, and an increasing fraction of the chip in future technologies will be dark silicon that we cannot afford to power. Specialized multicore processors, however, can leverage the underutilized die area to overcome the initial power barrier, delivering significantly higher performance for the same bandwidth and power envelopes
Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis
Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies em-ployed in modern CPUs. In today’s hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communi-cation in private caches. Researchers normally perform ex-tensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel pro-grams, an important class of programs for which RD anal-ysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size
Database Servers on Chip Multiprocessors: Limitations and Opportunities
Prior research shows that database system performance is dominated by off-chip data stalls, resulting in a concerted effort to bring data into on-chip caches. At the same time, high levels of integration have enabled the advent of chip multiprocessors and increasingly large (and slow) on-chip caches. These two trends pose the imminent technical and research challenge of adapting high-performance data management software to a shifting hardware landscape. In this paper we characterize the performance of a commercial database server running on emerging chip multiprocessor technologies. We find that the major bottleneck of current software is data cache stalls, with L2 hit stalls rising from oblivion to become the dominant execution time component in some cases. We analyze the source of this shift and derive a list of features for future database designs to attain maximum performance
Improving OLTP Concurrency through Early Lock Release
Since the beginning of the multi-core era, database systems research has restarted focusing on increasing concurrency. Even though database systems have been able to accommodate concurrent requests, the exploding number of available cores per chip has surfaced new difficulties. More and more transactions can be served in parallel (since more threads can run simultaneously) and, thus, concurrency in a database system is more important than ever in order to exploit the available resources. In this paper, we evaluate Early Lock Release (ELR), a technique that allows early release of locks to improve concurrency level and overall throughput in OLTP. This technique has been proven to lead to a database system that can produce correct and recoverable histories but it has never been implemented in a full scale DBMS. A new action is introduced which decouples the commit action from the log flush to non-volatile storage. ELR can help us increase the concurrency and the predictability of a database system without losing correctness and recoverability. We conclude that applying ELR on a DBMS, especially with a centralized log scheme makes absolute sense, because (a) it carries negligible overhead, (b) it improves the concurrency level by allowing a transaction to acquire the necessary locks when the previous holder of the locks has finished its useful work (and not after committing to disk), and (c), as a result, the overall throughput can be increased by up to 2x for TPCC and 7x for TPCB workloads. Additionally, the variation of the waiting time of the log flush is zeroed because transactions no longer wait for a log flush before they can release their locks
Scaling Single-Program Performance on Large-Scale Chip Multiprocessors
Due to power constraints, computer architects will exploit TLP instead of ILP for future performance gains. Today, 4-8 state-of-the-art cores or 10s of smaller cores can fit on a single die. For the foreseeable future, the number of cores will likely double with each successive processor generation. Hence, CMPs with 100s of cores-so-called large-scale chip multiprocessors (LCMPs)-will become a reality after only 2 or 3 generations.
Unfortunately, simply scaling the number of on-chip cores alone will not guarantee improved performance. In addition, effectively utilizing all of the cores is also necessary. Perhaps the greatest threat to processor utilization will be the overhead incurred waiting on the memory system, especially as on-chip concurrency scales to 100s of threads. In particular, remote cache bank access and off-chip bandwidth contention are likely to be the most significant obstacles to scaling memory performance.
This paper conducts an in-depth study of CMP scalability for parallel programs. We assume a tiled CMP in which tiles contain a simple core along with a private L1 cache and a local slice of a shared L2 cache. Our study considers scaling from 1-256 cores and 4-128MB of total L2 cache, and addresses several issues related to the impact of scaling on off-chip bandwidth and on-chip communication. In particular, we find off-chip bandwidth increases linearly with core count, but the rate of increase reduces dramatically once enough L2 cache is provided to capture inter-thread sharing. Our results also show for the range 1-256 cores, there should be ample on-chip bandwidth to support the communication requirements of our benchmarks. Finally, we find that applications become off-chip limited when their L2 cache miss rates exceed some minimum threshold. Moreover, we expect off-chip overheads to dominate on-chip overheads for memory intensive programs and LCMPs with aggressive cores
Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance
Performance on multicore processors is determined largely by on-chip
cache. Computer architects have conducted numerous studies in the past
that vary core count and cache capacity as well as problem size to
understand impact on cache behavior. These studies are very costly due
to the combinatorial design spaces they must explore.
Reuse distance (RD) analysis can help architects explore multicore cache
performance more efficiently. One problem, however, is multicore RD
analysis requires measuring concurrent reuse distance (CRD) profiles
across thread-interleaved memory reference streams. Sensitivity to
memory interleaving makes CRD profiles architecture dependent,
undermining RD analysis benefits. But for parallel programs with
symmetric threads, CRD profiles vary with architecture tractably: they
change only slightly with cache capacity scaling, and shift predictably
to larger CRD values with core count scaling. This enables analysis of a
large number of multicore configurations from a small set of measured
CRD profiles.
This paper investigates using RD analysis to efficiently analyze
multicore cache performance for parallel programs, making several
contributions. First, we characterize how CRD profiles change with core
count and cache capacity. One of our findings is core count scaling
degrades locality, but the degradation only impacts last-level caches
(LLCs) below 16MB for our benchmarks and problem sizes, increasing to
128MB if problem size scales by 64x. Second, we apply reference groups
to predict CRD profiles across core count scaling, and evaluate
prediction accuracy. Finally, we use CRD profiles to analyze multicore
cache performance. We find predicted CRD profiles can estimate LLC MPKI
within 76% of simulation for configurations without pathologic cache
conflicts in 1/1200th the time to perform simulation of the full design
space