25,492 research outputs found

    Feedback Generation for Performance Problems in Introductory Programming Assignments

    Full text link
    Providing feedback on programming assignments manually is a tedious, error prone, and time-consuming task. In this paper, we motivate and address the problem of generating feedback on performance aspects in introductory programming assignments. We studied a large number of functionally correct student solutions to introductory programming assignments and observed: (1) There are different algorithmic strategies, with varying levels of efficiency, for solving a given problem. These different strategies merit different feedback. (2) The same algorithmic strategy can be implemented in countless different ways, which are not relevant for reporting feedback on the student program. We propose a light-weight programming language extension that allows a teacher to define an algorithmic strategy by specifying certain key values that should occur during the execution of an implementation. We describe a dynamic analysis based approach to test whether a student's program matches a teacher's specification. Our experimental results illustrate the effectiveness of both our specification language and our dynamic analysis. On one of our benchmarks consisting of 2316 functionally correct implementations to 3 programming problems, we identified 16 strategies that we were able to describe using our specification language (in 95 minutes after inspecting 66, i.e., around 3%, implementations). Our dynamic analysis correctly matched each implementation with its corresponding specification, thereby automatically producing the intended feedback.Comment: Tech report/extended version of FSE 2014 pape

    Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors

    Full text link
    In this paper we consider the problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles. This is a common problem that arises in many agent-based simulation studies, and is of central importance in the context of High Level Architecture (HLA), where it is at the core of the Data Distribution Management (DDM) service. Several realizations of the DDM service have been proposed; however, many of them are either inefficient or inherently sequential. These are serious limitations since multicore processors are now ubiquitous, and DDM algorithms -- being CPU-intensive -- could benefit from additional computing power. We propose a parallel version of the Sort-Based Matching algorithm for shared-memory multiprocessors. Sort-Based Matching is one of the most efficient serial algorithms for the DDM problem, but is quite difficult to parallelize due to data dependencies. We describe the algorithm and compute its asymptotic running time; we complete the analysis by assessing its performance and scalability through extensive experiments on two commodity multicore systems based on a dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.Comment: Proceedings of the 21-th ACM/IEEE International Symposium on Distributed Simulation and Real Time Applications (DS-RT 2017). Best Paper Award @DS-RT 201

    Modern Approaches to Exact Diagonalization and Selected Configuration Interaction with the Adaptive Sampling CI Method.

    Get PDF
    Recent advances in selected configuration interaction methods have made them competitive with the most accurate techniques available and, hence, creating an increasingly powerful tool for solving quantum Hamiltonians. In this work, we build on recent advances from the adaptive sampling configuration interaction (ASCI) algorithm. We show that a useful paradigm for generating efficient selected CI/exact diagonalization algorithms is driven by fast sorting algorithms, much in the same way iterative diagonalization is based on the paradigm of matrix vector multiplication. We present several new algorithms for all parts of performing a selected CI, which includes new ASCI search, dynamic bit masking, fast orbital rotations, fast diagonal matrix elements, and residue arrays. The ASCI search algorithm can be used in several different modes, which includes an integral driven search and a coefficient driven search. The algorithms presented here are fast and scalable, and we find that because they are built on fast sorting algorithms they are more efficient than all other approaches we considered. After introducing these techniques, we present ASCI results applied to a large range of systems and basis sets to demonstrate the types of simulations that can be practically treated at the full-CI level with modern methods and hardware, presenting double- and triple-ζ benchmark data for the G1 data set. The largest of these calculations is Si2H6 which is a simulation of 34 electrons in 152 orbitals. We also present some preliminary results for fast deterministic perturbation theory simulations that use hash functions to maintain high efficiency for treating large basis sets

    A Parallel Adaptive P3M code with Hierarchical Particle Reordering

    Full text link
    We discuss the design and implementation of HYDRA_OMP a parallel implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M) code HYDRA. The code is designed primarily for conducting cosmological hydrodynamic simulations and is written in Fortran77+OpenMP. A number of optimizations for RISC processors and SMP-NUMA architectures have been implemented, the most important optimization being hierarchical reordering of particles within chaining cells, which greatly improves data locality thereby removing the cache misses typically associated with linked lists. Parallel scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes for a variety of modern SMP architectures. We give performance data in terms of the number of particle updates per second, which is a more useful performance metric than raw MFlops. A basic version of the code will be made available to the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics Communication

    Even faster sorting of (not only) integers

    Full text link
    In this paper we introduce RADULS2, the fastest parallel sorter based on radix algorithm. It is optimized to process huge amounts of data making use of modern multicore CPUs. The main novelties include: extremely optimized algorithm for handling tiny arrays (up to about a hundred of records) that could appear even billions times as subproblems to handle and improved processing of larger subarrays with better use of non-temporal memory stores

    An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors

    Full text link
    The emergence of multicore and manycore processors is set to change the parallel computing world. Applications are shifting towards increased parallelism in order to utilise these architectures efficiently. This leads to a situation where every application creates its desirable number of threads, based on its parallel nature and the system resources allowance. Task scheduling in such a multithreaded multiprogramming environment is a significant challenge. In task scheduling, not only the order of the execution, but also the mapping of threads to the execution resources is of a great importance. In this paper we state and discuss some fundamental rules based on results obtained from selected applications of the BOTS benchmarks on the 64-core TILEPro64 processor. We demonstrate how previously efficient mapping policies such as those of the SMP Linux scheduler become inefficient when the number of threads and cores grows. We propose a novel, low-overhead technique, a heuristic based on the amount of time spent by each CPU doing some useful work, to fairly distribute the workloads amongst the cores in a multiprogramming environment. Our novel approach could be implemented as a pragma similar to those in the new task-based OpenMP versions, or can be incorporated as a distributed thread mapping mechanism in future manycore programming frameworks. We show that our thread mapping scheme can outperform the native GNU/Linux thread scheduler in both single-programming and multiprogramming environments.Comment: ParCo Conference, Munich, Germany, 201

    Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework

    Full text link
    Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms

    Cache-aware Parallel Programming for Manycore Processors

    Full text link
    With rapidly evolving technology, multicore and manycore processors have emerged as promising architectures to benefit from increasing transistor numbers. The transition towards these parallel architectures makes today an exciting time to investigate challenges in parallel computing. The TILEPro64 is a manycore accelerator, composed of 64 tiles interconnected via multiple 8x8 mesh networks. It contains per-tile caches and supports cache-coherent shared memory by default. In this paper we present a programming technique to take advantages of distributed caching facilities in manycore processors. However, unlike other work in this area, our approach does not use architecture-specific libraries. Instead, we provide the programmer with a novel technique on how to program future Non-Uniform Cache Architecture (NUCA) manycore systems, bearing in mind their caching organisation. We show that our localised programming approach can result in a significant improvement of the parallelisation efficiency (speed-up).Comment: This work was presented at the international symposium on Highly- Efficient Accelerators and Reconfigurable Technologies (HEART2013), Edinburgh, Scotland, June 13-14, 201
    corecore