263 research outputs found

    BigDataBench: a Big Data Benchmark Suite from Internet Services

    Full text link
    As architecture, systems, and data management communities pay greater attention to innovative big data systems and architectures, the pressure of benchmarking and evaluating these systems rises. Considering the broad use of big data systems, big data benchmarks must include diversity of data and workloads. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. BigDataBench is publicly available from http://prof.ict.ac.cn/BigDataBench . Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache misses per 1000 instructions of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.Comment: 12 pages, 6 figures, The 20th IEEE International Symposium On High Performance Computer Architecture (HPCA-2014), February 15-19, 2014, Orlando, Florida, US

    Unsupervised Anomaly-based Malware Detection using Hardware Features

    Get PDF
    Recent works have shown promise in using microarchitectural execution patterns to detect malware programs. These detectors belong to a class of detectors known as signature-based detectors as they catch malware by comparing a program's execution pattern (signature) to execution patterns of known malware programs. In this work, we propose a new class of detectors - anomaly-based hardware malware detectors - that do not require signatures for malware detection, and thus can catch a wider range of malware including potentially novel ones. We use unsupervised machine learning to build profiles of normal program execution based on data from performance counters, and use these profiles to detect significant deviations in program behavior that occur as a result of malware exploitation. We show that real-world exploitation of popular programs such as IE and Adobe PDF Reader on a Windows/x86 platform can be detected with nearly perfect certainty. We also examine the limits and challenges in implementing this approach in face of a sophisticated adversary attempting to evade anomaly-based detection. The proposed detector is complementary to previously proposed signature-based detectors and can be used together to improve security.Comment: 1 page, Latex; added description for feature selection in Section 4, results unchange

    DRackSim: Simulator for Rack-scale Memory Disaggregation

    Full text link
    Memory disaggregation has emerged as an alternative to traditional server architecture in data centers. This paper introduces DRackSim, a simulation infrastructure to model rack-scale hardware disaggregated memory. DRackSim models multiple compute nodes, memory pools, and a rack-scale interconnect similar to GenZ. An application-level simulation approach simulates an x86 out-of-order multi-core processor with a multi-level cache hierarchy at compute nodes. A queue-based simulation is used to model a remote memory controller and rack-level interconnect, which allows both cache-based and page-based access to remote memory. DRackSim models a central memory manager to manage address space at the memory pools. We integrate community-accepted DRAMSim2 to perform memory simulation at local and remote memory using multiple DRAMSim2 instances. An incremental approach is followed to validate the core and cache subsystem of DRackSim with that of Gem5. We measure the performance of various HPC workloads and show the performance impact for different nodes/pools configuration

    Scalable system software for high performance large-scale applications

    Get PDF
    In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability

    Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications

    Full text link
    The Knights Landing (KNL) is the codename for the latest generation of Intel processors based on Intel Many Integrated Core (MIC) architecture. It relies on massive thread and data parallelism, and fast on-chip memory. This processor operates in standalone mode, booting an off-the-shelf Linux operating system. The KNL peak performance is very high - approximately 3 Tflops in double precision and 6 Tflops in single precision - but sustained performance depends critically on how well all parallel features of the processor are exploited by real-life applications. We assess the performance of this processor for Lattice Boltzmann codes, widely used in computational fluid-dynamics. In our OpenMP code we consider several memory data-layouts that meet the conflicting computing requirements of distinct parts of the application, and sustain a large fraction of peak performance. We make some performance comparisons with other processors and accelerators, and also discuss the impact of the various memory layouts on energy efficiency

    Implementing and evaluating graph algorithms for long vector architectures

    Get PDF
    High-Performance Computing can be accelerated using long-vector architectures. However, creating efficient coding implementations for these architectures can be challenging. This Master's thesis focuses on implementing four well-known and widely-used graph processing algorithms using the RISC-V Vector Extension, leveraging an experimental system in an FPGA. I present a graph storage format that benefits from long vectors and describe how these four algorithms can be rewritten to utilize it. This thesis also introduces an instrumentation tool for FPGA that I developed to link the output of electrical engineering software with performance analysis tools for HPC. This tool allows users to visualize information coming from the logic analyzer internal to the FPGA with powerful visualization tools, permitting fine-grain analysis of the FPGA signals correlated with the code running on it. This tool has been integrated into the experimental performance analysis tools of BSC. In this thesis I leverage this tool to analyze and improve my implementations of graph algorithms for long-vector architectures, collecting the process and thoughts behind each optimization. Finally, I compare the performance of my vector implementations with other machines, such as the NEC SX-Aurora, a commercial RISC-V board, and an Intel chip
    corecore