68 research outputs found

    Scale-Out Processors

    Get PDF
    Global-scale online services, such as Google’s Web search and Facebook’s social networking, run in large-scale datacenters. Due to their massive scale, these services are designed to scale out (or distribute) their respective loads and datasets across thousands of servers in datacenters. The growing demand for online services forced service providers to build networks of datacenters, which require an enormous capital outlay for infrastructure, hardware, and power consumption. Consequently, efficiency has become a major concern in the design and operation of such datacenters, with processor efficiency being of, utmost importance, due to the significant contribution of processors to the overall datacenter performance and cost. Scale-out workloads, which are behind today’s online services, serve independent requests, and have large instruction footprints and little data locality. As such, they benefit from processor designs that feature many cores and a modestly sized Last-Level Cache (LLC), a fast access path to the LLC, and high-bandwidth interfaces to memory. Existing server-class processors with large LLCs and a handful of aggressive out-of-order cores are inefficient in executing scale-out workloads. Moreover, the scaling trajectory for these processors leads to even lower efficiency in future technology nodes. This thesis presents a family of throughput-optimal processors, called Scale-Out Processors, for the efficient execution of scale-out workloads. A unique feature of Scale-Out Processors is that they consist of multiple stand-alone modules, called pods, wherein each module is a server running an operating system and a full software stack. To design a throughput-optimal processor, we developed a methodology based on performance density, defined as throughput per unit area, to quantify how effectively an architecture uses the silicon real estate. The proposed methodology derives a performance-density optimal processor building block (i.e., pod), which tightly couples a number of cores to a small LLC via a fast interconnect. Scale-Out Processors simply consist of multiple pods with no inter-pod connectivity or coherence. Moreover, they deliver the highest throughput in today’s technology and afford near-ideal scalability as process technology advances. We demonstrate that Scale-Out Processors improve datacenters’ efficiency by 4.4x-7.1x over datacenters designed using existing server-class processors

    BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking

    Full text link
    Data generation is a key issue in big data benchmarking that aims to generate application-specific data sets to meet the 4V requirements of big data. Specifically, big data generators need to generate scalable data (Volume) of different types (Variety) under controllable generation rates (Velocity) while keeping the important characteristics of raw data (Veracity). This gives rise to various new challenges about how we design generators efficiently and successfully. To date, most existing techniques can only generate limited types of data and support specific big data systems such as Hadoop. Hence we develop a tool, called Big Data Generator Suite (BDGS), to efficiently generate scalable big data while employing data models derived from real data to preserve data veracity. The effectiveness of BDGS is demonstrated by developing six data generators covering three representative data types (structured, semi-structured and unstructured) and three data sources (text, graph, and table data)

    The Implications of Diverse Applications and Scalable Data Sets in Benchmarking Big Data Systems

    Full text link
    Now we live in an era of big data, and big data applications are becoming more and more pervasive. How to benchmark data center computer systems running big data applications (in short big data systems) is a hot topic. In this paper, we focus on measuring the performance impacts of diverse applications and scalable volumes of data sets on big data systems. For four typical data analysis applications---an important class of big data applications, we find two major results through experiments: first, the data scale has a significant impact on the performance of big data systems, so we must provide scalable volumes of data sets in big data benchmarks. Second, for the four applications, even all of them use the simple algorithms, the performance trends are different with increasing data scales, and hence we must consider not only variety of data sets but also variety of applications in benchmarking big data systems.Comment: 16 pages, 3 figure

    Comprehensive characterization of an open source document search engine

    Get PDF
    This work performs a thorough characterization and analysis of the open source Lucene search library. The article describes in detail the architecture, functionality, and micro-architectural behavior of the search engine, and investigates prominent online document search research issues. In particular, we study how intra-server index partitioning affects the response time and throughput, explore the potential use of low power servers for document search, and examine the sources of performance degradation ands the causes of tail latencies. Some of our main conclusions are the following: (a) intra-server index partitioning can reduce tail latencies but with diminishing benefits as incoming query traffic increases, (b) low power servers given enough partitioning can provide same average and tail response times as conventional high performance servers, (c) index search is a CPU-intensive cache-friendly application, and (d) C-states are the main culprits for performance degradation in document search.Web of Science162art. no. 1

    STRAIGHT アーキテクチャへのレジスタキャッシュ適用による電力削減効果

    Get PDF
    情報社会を支えるマイクロプロセッサは, 半導体技術の成長に伴い成長を続けている. チップ上のトランジスタ数は集積技術の発展に従い増加しており, アーキテクチャ技術はこのトランジスタ資源の利用により, 成長を続けている. 2000 年頃よりプロセッサは電力, 配線遅延などの問題が指摘され, 従来のパイプライン幅や発行実行幅に利用していたトランジスタ資源は, コア数やメモリ帯域に挙げられる様なプロセッサ自身の実行部分に活用してきた. しかしコア数増大のアプローチによるメニーコアプロセッサにも性能向上面において限界が訪れようとしている事が指摘されている. 加えてチップ上のトランジスタを全て駆動する事が出来ない, 所謂ダークシリコン問題があり, メニーコアプロセッサの様にコア数増大のアプローチを続けると, チップ上のトランジスタの過半数が同時駆動されない事が予想されている. 加えて, 従来のアーキテクチャを根本的に転換し, プロセッサの成長戦略を続けていくためには各コアのシングルスレッド実行する性能を上げつつも電力は抑える事が求められている.そこで我々はこのダークシリコン問題を打破する性能/電力比を得る為に, 制御を軽量化しつつも高いシングルスレッド能力を有する STRAIGHT アーキテクチャを提案している. STRAIGHTアーキテクチャはライトワンスマナーに従う十分な論理レジスタ空間を持つ事で, 従来のプロセッサにおいて主要な制御であるレジスタリネーミングを排除する. 更に豊富な物理レジスタを持たせる事で, 電力のオーバヘッドであるフリーリスト管理から解放される. また, 命令ウィンドウ幅の拡張による演算器の稼動率向上を狙いシングルスレッド性能の向上を目指す. このSTRAIGHT アーキテクチャは初期評価で 30%の向上を示し, 性能/電力は 18%の向上を示した.しかしこれはあくまで既存のプロセッサを STRAIGHT に見立てて行った評価である.本研究では STRAIGHT 専用のシミュレータによる詳細な評価の中で, STRAIGHT アーキテクチャにおいて電力のオーバヘットであるレジスタファイルとスケジューラの最適化に関して評価を行う. レジスタファイルに対しては, バイパスネットワークによるレジスタ値の再利用を補助するレジスタキャッシュの導入により, レジスタファイル自体の消費電力を 33.1%削減し, スケジューラに対しては, マトリクススケジューラを適用する事で消費電力を 47.4%削減した. またこの 2 つの手法の採用により, 実行部分の電力を 33.9%削減する結果となった.電気通信大学201
    corecore