238 research outputs found

    Pervasive Parallel And Distributed Computing In A Liberal Arts College Curriculum

    Get PDF
    We present a model for incorporating parallel and distributed computing (PDC) throughout an undergraduate CS curriculum. Our curriculum is designed to introduce students early to parallel and distributed computing topics and to expose students to these topics repeatedly in the context of a wide variety of CS courses. The key to our approach is the development of a required intermediate-level course that serves as a introduction to computer systems and parallel computing. It serves as a requirement for every CS major and minor and is a prerequisite to upper-level courses that expand on parallel and distributed computing topics in different contexts. With the addition of this new course, we are able to easily make room in upper-level courses to add and expand parallel and distributed computing topics. The goal of our curricular design is to ensure that every graduating CS major has exposure to parallel and distributed computing, with both a breadth and depth of coverage. Our curriculum is particularly designed for the constraints of a small liberal arts college, however, much of its ideas and its design are applicable to any undergraduate CS curriculum

    Characterizing and Subsetting Big Data Workloads

    Full text link
    Big data benchmark suites must include a diversity of data and workloads to be useful in fairly evaluating big data systems and architectures. However, using truly comprehensive benchmarks poses great challenges for the architecture community. First, we need to thoroughly understand the behaviors of a variety of workloads. Second, our usual simulation-based research methods become prohibitively expensive for big data. As big data is an emerging field, more and more software stacks are being proposed to facilitate the development of big data applications, which aggravates hese challenges. In this paper, we first use Principle Component Analysis (PCA) to identify the most important characteristics from 45 metrics to characterize big data workloads from BigDataBench, a comprehensive big data benchmark suite. Second, we apply a clustering technique to the principle components obtained from the PCA to investigate the similarity among big data workloads, and we verify the importance of including different software stacks for big data benchmarking. Third, we select seven representative big data workloads by removing redundant ones and release the BigDataBench simulation version, which is publicly available from http://prof.ict.ac.cn/BigDataBench/simulatorversion/.Comment: 11 pages, 6 figures, 2014 IEEE International Symposium on Workload Characterizatio

    Improving capacity-performance tradeoffs in the storage tier

    Get PDF
    Data-set sizes are growing. New techniques are emerging to organize and analyze these data-sets. There is a key access pattern emerging with these new techniques, large sequential file accesses. The trend toward bigger files exists to help amortize the cost of data accesses from the storage layer, as many workloads are recognized to be I/O bound. The storage layer is widely recognized as the slowest layer in the system. This work focuses on the tradeoff one can make with that storage capacity to improve system performance. ^ Capacity can be leveraged for improved availability or improved performance. This tradeoff is key in the storage layer, as this allows for data loss prevention and bandwidth aggregation. Typically these tradeoffs do not allow much choice with regard to capacity use. This work will leverage replication as the enabling mechanism to improve the capacity-performance tradeoff in the storage tier, while still providing for availability. ^ This capacity-performance tradeoff can be made at both the local and distributed file system level. I propose two techniques that allow for an improved tradeoff of capacity. The local file system can be employed on scale-out or scale-up infrastructures to improve performance. The distributed file system is targeted at distributed frameworks, such as MapReduce, to improve the cluster performance. The local file system design is MorphStore, and the distributed file system is BoostDFS. ^ MorphStore is a file system that significantly improves performance when accessing large files by using two innovations. MorphStore combines (a) load-adaptive I/O access scheduling to dynamically optimize throughput (aggregation), and (b) utility-xiii driven replication to best use capacity for performance. Additionally, adaptive-access scheduling can be utilized to optimize scheduling of requests (for throughput) on systems with a large number of storage devices. Replication is utilized to make available high utility files and then optimize throughput of these high utility files based on system load. ^ BoostDFS is a distributed file system that allows a better capacity-performance tradeoff via inter-node file replication. BoostDFS is built on the observation that distributed file systems currently inter-node replication for availability, but provide no mechanism to further improve performance. Replication for availability provides diminishing returns on performance, this is due to saturation of locality. BoostDFS exploits the common by improving I/O performance of these local tasks. This is done via intra-node replication by leveraging MorphStore as the local file system. This technique allows for capacity to be traded for availability as well as performance, with a small capacity overhead under constant availability. ^ Both MorphStore and BoostDFS utilize replication. Replication allows for both bandwidth aggregation and availability, This work primarily focuses on the performance utility of replication, but does not sacrifice availability in the process. These techniques provide an improved capacity-performance tradeoff while allowing the desired level of availability
    • …
    corecore