Search CORE

1,347 research outputs found

A Study on Performance and Power Efficiency of Dense Non-Volatile Caches in Multi-Core Systems

Author: Arjomand Mohammad
Das Chita R.
Jadidi Amin
Kandemir Mahmut T.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/04/2017
Field of study

In this paper, we present a novel cache design based on Multi-Level Cell Spin-Transfer Torque RAM (MLC STTRAM) that can dynamically adapt the set capacity and associativity to use efficiently the full potential of MLC STTRAM. We exploit the asymmetric nature of the MLC storage scheme to build cache lines featuring heterogeneous performances, that is, half of the cache lines are read-friendly, while the other is write-friendly. Furthermore, we propose to opportunistically deactivate ways in underutilized sets to convert MLC to Single-Level Cell (SLC) mode, which features overall better performance and lifetime. Our ultimate goal is to build a cache architecture that combines the capacity advantages of MLC and performance/energy advantages of SLC. Our experiments show an improvement of 43% in total numbers of conflict misses, 27% in memory access latency, 12% in system performance, and 26% in LLC access energy, with a slight degradation in cache lifetime (about 7%) compared to an SLC cache

arXiv.org e-Print Archive

Crossref

Cost-Effective Cache Deployment in Mobile Heterogeneous Networks

Author: Shen Xuemin
Yang Peng
Zhang Ning
Zhang Shan
Publication venue
Publication date: 11/07/2017
Field of study

This paper investigates one of the fundamental issues in cache-enabled heterogeneous networks (HetNets): how many cache instances should be deployed at different base stations, in order to provide guaranteed service in a cost-effective manner. Specifically, we consider two-tier HetNets with hierarchical caching, where the most popular files are cached at small cell base stations (SBSs) while the less popular ones are cached at macro base stations (MBSs). For a given network cache deployment budget, the cache sizes for MBSs and SBSs are optimized to maximize network capacity while satisfying the file transmission rate requirements. As cache sizes of MBSs and SBSs affect the traffic load distribution, inter-tier traffic steering is also employed for load balancing. Based on stochastic geometry analysis, the optimal cache sizes for MBSs and SBSs are obtained, which are threshold-based with respect to cache budget in the networks constrained by SBS backhauls. Simulation results are provided to evaluate the proposed schemes and demonstrate the applications in cost-effective network deployment

arXiv.org e-Print Archive

Crossref

Texas A&M University - Corpus Christi: DSpace Repository

Modeling Data Center Co-Tenancy Performance Interference

Author: Kuang Wei
Publication venue: Digital Commons @ Michigan Tech
Publication date: 01/01/2018
Field of study

A multi-core machine allows executing several applications simultaneously. Those jobs are scheduled on different cores and compete for shared resources such as the last level cache and memory bandwidth. Such competitions might cause performance degradation. Data centers often utilize virtualization to provide a certain level of performance isolation. However, some of the shared resources cannot be divided, even in a virtualized system, to ensure complete isolation. If the performance degradation of co-tenancy is not known to the cloud administrator, a data center often has to dedicate a whole machine for a latency-sensitive application to guarantee its quality of service. Co-run scheduling attempts to make good utilization of resources by scheduling compatible jobs into one machine while maintaining their service level agreements. An ideal co-run scheduling scheme requires accurate contention modeling. Recent studies for co-run modeling and scheduling have made steady progress to predict performance for two co-run applications sharing a specific system. This thesis advances co-tenancy modeling in three aspects. First, with an accurate co-run modeling for one system at hand, we propose a regression model to transfer the knowledge and create a model for a new system with different hardware configuration. Second, by examining those programs that yield high prediction errors, we further leverage clustering techniques to create a model for each group of applications that show similar behavior. Clustering helps improve the prediction accuracy of those pathological cases. Third, existing research is typically focused on modeling two application co-run cases. We extend a two-core model to a three- and four-core model by introducing a light-weight micro-kernel that emulates a complicated benchmark through program instrumentation. Our experimental evaluation shows that our cross-architecture model achieves an average prediction error less than 2% for pairwise co-runs across the SPECCPU2006 benchmark suite. For more than two application co-tenancy modeling, we show that our model is more scalable and can achieve an average prediction error of 2-3%

Michigan Technological University

A performance focused, development friendly and model aided parallelization strategy for scientific applications

Author: Joshi Anagha S.
Publication venue: Clemson University Libraries
Publication date: 01/12/2016
Field of study

The amelioration of high performance computing platforms has provided unprecedented computing power with the evolution of multi-core CPUs, massively parallel architectures such as General Purpose Graphics Processing Units (GPGPUs) and Many Integrated Core (MIC) architectures such as Intel\u27s Xeon phi coprocessor. However, it is a great challenge to leverage capabilities of such advanced supercomputing hardware, as it requires efficient and effective parallelization of scientific applications. This task is difficult mainly due to complexity of scientific algorithms coupled with the variety of available hardware and disparate programming models. To address the aforementioned challenges, this thesis presents a parallelization strategy to accelerate scientific applications that maximizes the opportunities of achieving speedup while minimizing the development efforts. Parallelization is a three step process (1) choose a compatible combination of architecture and parallel programming language, (2) translate base code/algorithm to a parallel language and (3) optimize and tune the application. In this research, a quantitative comparison of run time for various implementations of k-means algorithm, is used to establish that native languages (OpenMP, MPI, CUDA) perform better on respective architectures as opposed to vendor-neutral languages such as OpenCL. A qualitative model is used to select an optimal architecture for a given application by aligning the capabilities of accelerators with characteristics of the application. Once the optimal architecture is chosen, the corresponding native language is employed. This approach provides the best performance with reasonable accuracy (78%) of predicting a fitting combination, while eliminating the need for exploring different architectures individually. It reduces the required development efforts considerably as the application need not be re-written in multiple languages. The focus can be solely on optimization and tuning to achieve the best performance on available architectures with minimized investment in terms of cost and efforts. To verify the prediction accuracy of the qualitative model, the OpenDwarfs benchmark suite, which implements the Berkeley\u27s dwarfs in OpenCL, is used. A dwarf is an algorithmic method that captures a pattern of computation and communication. For the purpose of this research, the focus is on 9 application from various algorithmic domains that cover the seven dwarfs of symbolic computation, which were identified by Phillip Colella, as omnipresent in scientific and engineering applications. To validate the parallelization strategy collectively, a case study is undertaken. This case study involves parallelization of the Lower Upper Decomposition for the Gaussian Elimination algorithm from the linear algebra domain, using conventional trial and error methods as well as the proposed \u27Architecture First, Language Later\u27\u27 strategy. The development efforts incurred are contrasted for both methods. The aforesaid proposed strategy is observed to reduce the development efforts by an average of 50%

Clemson University: TigerPrints

JCSP: Joint Caching and Service Placement for Edge Computing Systems

Author: Casale G
Gao Y
Publication venue: 'Center for Open Science'
Publication date: 24/04/2022
Field of study

With constrained resources, what, where, and how to cache at the edge is one of the key challenges for edge computing systems. The cached items include not only the application data contents but also the local caching of edge services that handle incoming requests. However, current systems separate the contents and services without considering the latency interplay of caching and queueing. Therefore, in this paper, we propose a novel class of stochastic models that enable the optimization of content caching and service placement decisions jointly. We first explain how to apply layered queueing networks (LQNs) models for edge service placement and show that combining this with genetic algorithms provides higher accuracy in resource allocation than an established baseline. Next, we extend LQNs with caching components to establish a joint modeling method for content caching and service placement (JCSP) and present analytical methods to analyze the resulting model. Finally, we simulate real-world Azure traces to evaluate the JCSP method and find that JCSP achieves up to 35% improvement in response time and 500MB reduction in memory usage than baseline heuristics for edge caching resource allocation

Spiral - Imperial College Digital Repository

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

IMP: Indirect Memory Prefetcher

Author: Devadas Srinivas
Hughes Christopher J.
Satish Nadathur
Yu Xiangyao
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2015
Field of study

Machine learning, graph analytics and sparse linear algebra-based applications are dominated by irregular memory accesses resulting from following edges in a graph or non-zero elements in a sparse matrix. These accesses have little temporal or spatial locality, and thus incur long memory stalls and large bandwidth requirements. A traditional streaming or striding prefetcher cannot capture these irregular access patterns. A majority of these irregular accesses come from indirect patterns of the form A[B[i]]. We propose an efficient hardware indirect memory prefetcher (IMP) to capture this access pattern and hide latency. We also propose a partial cacheline accessing mechanism for these prefetches to reduce the network and DRAM bandwidth pressure from the lack of spatial locality. Evaluated on 7 applications, IMP shows 56% speedup on average (up to 2.3×) compared to a baseline 64 core system with streaming prefetchers. This is within 23% of an idealized system. With partial cacheline accessing, we see another 9.4% speedup on average (up to 46.6%).Intel Science and Technology Center for Big Dat

DSpace@MIT

Crossref