Search CORE

26 research outputs found

Evolutionary system for prediction and optimization of hardware architecture performance

Author: Castillo Pedro Angel
Cazorla Almeida Francisco Javier
Laredo Juan Luís
McKee Sally
Merelo Juan Julián
Mora Antonio
Moreto Planas Miquel
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2008
Field of study

The design of computer architectures is a very complex problem. The multiple parameters make the number of possible combinations extremely high. Many researchers have used simulation, although it is a slow solution since evaluating a single point of the search space can take hours. In this work we propose using evolutionary multilayer perceptron (MLP) to compute the performance of an architecture parameter settings. Instead of exploring the search space, simulating many configurations, our method randomly selects some architecture configurations; those are simulated to obtain their performance, and then an artificial neural network is trained to predict the remaining configurations performance. Results obtained show a high accuracy of the estimations using a simple method to select the configurations we have to simulate to optimize the MLP. In order to explore the search space, we have designed a genetic algorithm that uses the MLP as fitness function to find the niche where the best architecture configurations (those with higher performance) are located. Our models need only a small fraction of the design space, obtaining small errors and reducing required simulation by two orders of magnitude.Peer ReviewedPostprint (published version

Crossref

UPCommons. Portal del coneixement obert de la UPC

A NoC-based hybrid message-passing/shared-memory approach to CMP design

Author: Agarwal
Daemen
Forsell
Grecu
Karniadakis
Lorensen
Mario R. Casu
Massimo Ruo Roch
Maurizio Zamboni
Owens
Paulin
Radulescu
Sergio V. Tota
Snir
Tota
Publication venue: Elsevier
Publication date: 01/01/2011
Field of study

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Toward Dark Silicon in Servers

Author: Ailamaki Anastasia
Falsafi Babak
Ferdman Michael
Hardavellas Nikos
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 26/08/2011
Field of study

Server chips will not scale beyond a few tens to low hundreds of cores, and an increasing fraction of the chip in future technologies will be dark silicon that we cannot afford to power. Specialized multicore processors, however, can leverage the underutilized die area to overcome the initial power barrier, delivering significantly higher performance for the same bandwidth and power envelopes

Infoscience - École polytechnique fédérale de Lausanne

Identifying optimal multicore cache hierarchies for loop-based parallel programs via reuse distance analysis

Author: Donald Yeung
Meng-ju Wu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2012
Field of study

Understanding multicore memory behavior is crucial, but can be challenging due to the complex cache hierarchies em-ployed in modern CPUs. In today’s hierarchies, performance is determined by complicated thread interactions, such as interference in shared caches and replication and communi-cation in private caches. Researchers normally perform ex-tensive simulations to study these interactions, but this can be costly and not very insightful. An alternative is multicore reuse distance (RD) analysis, which can provide extremely rich information about multicore memory behavior. In this paper, we apply multicore RD analysis to better understand cache system design. We focus on loop-based parallel pro-grams, an important class of programs for which RD anal-ysis provides high accuracy. We propose a novel framework to identify optimal multicore cache hierarchies, and extract several new insights. We also characterize how the optimal cache hierarchies vary with core count and problem size

CiteSeerX

Crossref

Database Servers on Chip Multiprocessors: Limitations and Opportunities

Author: Ailamaki Anastassia
Falsafi Babak
Hardavellas Nikos
Johnson Ryan
Mancheril Naju
Pandis Ippokratis
Publication venue
Publication date: 10/10/2007
Field of study

Prior research shows that database system performance is dominated by off-chip data stalls, resulting in a concerted effort to bring data into on-chip caches. At the same time, high levels of integration have enabled the advent of chip multiprocessors and increasingly large (and slow) on-chip caches. These two trends pose the imminent technical and research challenge of adapting high-performance data management software to a shifting hardware landscape. In this paper we characterize the performance of a commercial database server running on emerging chip multiprocessor technologies. We find that the major bottleneck of current software is data cache stalls, with L2 hit stalls rising from oblivion to become the dominant execution time component in some cases. We analyze the source of this shift and derive a list of features for future database designs to attain maximum performance

Infoscience - École polytechnique fédérale de Lausanne

Improving OLTP Concurrency through Early Lock Release

Author: Ailamaki Anastasia
Athanassoulis Manos
Johnson Ryan
Stoica Radu
Publication venue: EPFL
Publication date: 27/09/2010
Field of study

Since the beginning of the multi-core era, database systems research has restarted focusing on increasing concurrency. Even though database systems have been able to accommodate concurrent requests, the exploding number of available cores per chip has surfaced new difficulties. More and more transactions can be served in parallel (since more threads can run simultaneously) and, thus, concurrency in a database system is more important than ever in order to exploit the available resources. In this paper, we evaluate Early Lock Release (ELR), a technique that allows early release of locks to improve concurrency level and overall throughput in OLTP. This technique has been proven to lead to a database system that can produce correct and recoverable histories but it has never been implemented in a full scale DBMS. A new action is introduced which decouples the commit action from the log flush to non-volatile storage. ELR can help us increase the concurrency and the predictability of a database system without losing correctness and recoverability. We conclude that applying ELR on a DBMS, especially with a centralized log scheme makes absolute sense, because (a) it carries negligible overhead, (b) it improves the concurrency level by allowing a transaction to acquire the necessary locks when the previous holder of the locks has finished its useful work (and not after committing to disk), and (c), as a result, the overall throughput can be increased by up to 2x for TPCC and 7x for TPCB workloads. Additionally, the variation of the waiting time of the log flush is zeroed because transactions no longer wait for a log flush before they can release their locks

Infoscience - École polytechnique fédérale de Lausanne

Scaling Single-Program Performance on Large-Scale Chip Multiprocessors

Author: Wu Meng-Ju
Yeung Donald
Publication venue
Publication date: 25/11/2009
Field of study

Due to power constraints, computer architects will exploit TLP instead of ILP for future performance gains. Today, 4-8 state-of-the-art cores or 10s of smaller cores can fit on a single die. For the foreseeable future, the number of cores will likely double with each successive processor generation. Hence, CMPs with 100s of cores-so-called large-scale chip multiprocessors (LCMPs)-will become a reality after only 2 or 3 generations. Unfortunately, simply scaling the number of on-chip cores alone will not guarantee improved performance. In addition, effectively utilizing all of the cores is also necessary. Perhaps the greatest threat to processor utilization will be the overhead incurred waiting on the memory system, especially as on-chip concurrency scales to 100s of threads. In particular, remote cache bank access and off-chip bandwidth contention are likely to be the most significant obstacles to scaling memory performance. This paper conducts an in-depth study of CMP scalability for parallel programs. We assume a tiled CMP in which tiles contain a simple core along with a private L1 cache and a local slice of a shared L2 cache. Our study considers scaling from 1-256 cores and 4-128MB of total L2 cache, and addresses several issues related to the impact of scaling on off-chip bandwidth and on-chip communication. In particular, we find off-chip bandwidth increases linearly with core count, but the rate of increase reduces dramatically once enough L2 cache is provided to capture inter-thread sharing. Our results also show for the range 1-256 cores, there should be ample on-chip bandwidth to support the communication requirements of our benchmarks. Finally, we find that applications become off-chip limited when their L2 cache miss rates exceed some minimum threshold. Moreover, we expect off-chip overheads to dominate on-chip overheads for memory intensive programs and LCMPs with aggressive cores

Digital Repository at the University of Maryland

Memory Performance Analysis for Parallel Programs Using Concurrent Reuse Distance

Author: Wu Meng-Ju
Yeung Donald
Publication venue
Publication date: 05/10/2010
Field of study

Performance on multicore processors is determined largely by on-chip cache. Computer architects have conducted numerous studies in the past that vary core count and cache capacity as well as problem size to understand impact on cache behavior. These studies are very costly due to the combinatorial design spaces they must explore. Reuse distance (RD) analysis can help architects explore multicore cache performance more efficiently. One problem, however, is multicore RD analysis requires measuring concurrent reuse distance (CRD) profiles across thread-interleaved memory reference streams. Sensitivity to memory interleaving makes CRD profiles architecture dependent, undermining RD analysis benefits. But for parallel programs with symmetric threads, CRD profiles vary with architecture tractably: they change only slightly with cache capacity scaling, and shift predictably to larger CRD values with core count scaling. This enables analysis of a large number of multicore configurations from a small set of measured CRD profiles. This paper investigates using RD analysis to efficiently analyze multicore cache performance for parallel programs, making several contributions. First, we characterize how CRD profiles change with core count and cache capacity. One of our findings is core count scaling degrades locality, but the degradation only impacts last-level caches (LLCs) below 16MB for our benchmarks and problem sizes, increasing to 128MB if problem size scales by 64x. Second, we apply reference groups to predict CRD profiles across core count scaling, and evaluate prediction accuracy. Finally, we use CRD profiles to analyze multicore cache performance. We find predicted CRD profiles can estimate LLC MPKI within 76% of simulation for configurations without pathologic cache conflicts in 1/1200th the time to perform simulation of the full design space

Digital Repository at the University of Maryland