141 research outputs found

    Towards co-designed optimizations in parallel frameworks: A MapReduce case study

    Full text link
    The explosion of Big Data was followed by the proliferation of numerous complex parallel software stacks whose aim is to tackle the challenges of data deluge. A drawback of a such multi-layered hierarchical deployment is the inability to maintain and delegate vital semantic information between layers in the stack. Software abstractions increase the semantic distance between an application and its generated code. However, parallel software frameworks contain inherent semantic information that general purpose compilers are not designed to exploit. This paper presents a case study demonstrating how the specific semantic information of the MapReduce paradigm can be exploited on multicore architectures. MR4J has been implemented in Java and evaluated against hand-optimized C and C++ equivalents. The initial observed results led to the design of a semantically aware optimizer that runs automatically without requiring modification to application code. The optimizer is able to speedup the execution time of MR4J by up to 2.0x. The introduced optimization not only improves the performance of the generated code, during the map phase, but also reduces the pressure on the garbage collector. This demonstrates how semantic information can be harnessed without sacrificing sound software engineering practices when using parallel software frameworks.Comment: 8 page

    DReAM: Dynamic Re-arrangement of Address Mapping to Improve the Performance of DRAMs

    Full text link
    The initial location of data in DRAMs is determined and controlled by the 'address-mapping' and even modern memory controllers use a fixed and run-time-agnostic address mapping. On the other hand, the memory access pattern seen at the memory interface level will dynamically change at run-time. This dynamic nature of memory access pattern and the fixed behavior of address mapping process in DRAM controllers, implied by using a fixed address mapping scheme, means that DRAM performance cannot be exploited efficiently. DReAM is a novel hardware technique that can detect a workload-specific address mapping at run-time based on the application access pattern which improves the performance of DRAMs. The experimental results show that DReAM outperforms the best evaluated address mapping on average by 9%, for mapping-sensitive workloads, by 2% for mapping-insensitive workloads, and up to 28% across all the workloads. DReAM can be seen as an insurance policy capable of detecting which scenarios are not well served by the predefined address mapping

    HAPPY: Hybrid Address-based Page Policy in DRAMs

    Full text link
    Memory controllers have used static page closure policies to decide whether a row should be left open, open-page policy, or closed immediately, close-page policy, after the row has been accessed. The appropriate choice for a particular access can reduce the average memory latency. However, since application access patterns change at run time, static page policies cannot guarantee to deliver optimum execution time. Hybrid page policies have been investigated as a means of covering these dynamic scenarios and are now implemented in state-of-the-art processors. Hybrid page policies switch between open-page and close-page policies while the application is running, by monitoring the access pattern of row hits/conflicts and predicting future behavior. Unfortunately, as the size of DRAM memory increases, fine-grain tracking and analysis of memory access patterns does not remain practical. We propose a compact memory address-based encoding technique which can improve or maintain the performance of DRAMs page closure predictors while reducing the hardware overhead in comparison with state-of-the-art techniques. As a case study, we integrate our technique, HAPPY, with a state-of-the-art monitor, the Intel-adaptive open-page policy predictor employed by the Intel Xeon X5650, and a traditional Hybrid page policy. We evaluate them across 70 memory intensive workload mixes consisting of single-thread and multi-thread applications. The experimental results show that using the HAPPY encoding applied to the Intel-adaptive page closure policy can reduce the hardware overhead by 5X for the evaluated 64 GB memory (up to 40X for a 512 GB memory) while maintaining the prediction accuracy

    FINESSD:Near-Storage Feature Selection with Mutual Information for Resource-Limited FPGAs

    Get PDF
    Feature selection is the data analysis process that selects a smaller and curated subset of the original dataset by filtering out data (features) which are irrelevant or redundant. The most important features can be ranked and selected based on statistical measures, such as mutual information. Feature selection not only reduces the size of dataset as well as the execution time for training Machine Learning (ML) models, but it can also improve the accuracy of the inference. This paper analyses mutual-information-based feature selection for resource-constrained FPGAs and proposes FINESSD, a novel approach that can be deployed for near-storage acceleration. This paper highlights that the Mutual Information Maximization (MIM) algorithm does not require multiple passes over the data while being a good trade-off between accuracy and FPGA resources, when approximated appropriately. The new FPGA accelerator for MIM generated by FINESSD can fully utilize the NVMe bandwidth of a modern SSD and perform feature selection without requiring full dataset transfers onto the main processor. The evaluation using a Samsung SmartSSD over small, large and out-of-core datasets shows that, compared to the mainstream multiprocessing Python ML libraries and an optimized C library, FINESSD yields up to 35x and 19x speedup respectively while being more than 70x more energy efficient for large, out-of-core datasets

    Efficient Sharing of Optical Resources in Low-Power Optical Networks-on-Chip

    Get PDF
    With the ever-growing core counts in modern computing systems, NoCs consume an increasing part of the power budget due to bandwidth and power density limitations of electrical interconnects. To maintain performance and power scaling, alternative technologies are required, with silicon photonics, sophisticated network designs are required to minimize static power overheads. In this paper, we propose Amon, a low-power ONoC that decreases number of μRings, wavelengths and path losses to reduce power consumption. Amom performs destination checking prior to data transmission on an underlying control network, allowing the sharing per-Watt by at least 23% (up to 70%), while reducing power without latency overheads on both synthetic and realistic applications. For aggressive optical technology parameters, Amom considerably outperforms all alternative NoCs in terms of power, outlining its increasing superiority as technology matures

    Subchannel Scheduling for Shared Optical On-chip Buses

    Get PDF
    Maximizing bandwidth utilization of optical on-chip interconnects in essential to compensate for static power overheads in optical networks-on-chip. Shared optical buses were shown to be a power-efficient, modular design solution with tremendous power saving potential by allowing optical bandwidth to be shared by all connected nodes. Previous proposals resolve bus contention by scheduling senders sequentially on the entire optical bandwidth; however, logically splitting a bus into sub-channels to allow both sequential and parallel data transmission has been shown to be highly efficient in electrical interconnects and could also be applied to shared optical buses. In this paper, we propose an efficient subchannel scheduling algorithm that aims to minimize the number of bus utilization cycles by assigning sender-receiver pairs both to subchannels and time slots. We present both a distributed and a centralized bus arbitration scheme and show that both can be implemented with low overheads. Our results show that subchannel scheduling can more than double throughput on shared optical buses compared to sequential scheduling without any power overheads in most cases. Arbitration latency overheads compared to state-of-the-art sequential schemes are moderate-to-low for significant bus bandwidths and only noticeable for low injection rates
    corecore