33 research outputs found

    A dataflow IR for memory efficient RIPL compilation to FPGAs

    Get PDF
    Field programmable gate arrays (FPGAs) are fundamentally different to fixed processors architectures because their memory hierarchies can be tailored to the needs of an algorithm. FPGA compilers for high level languages are not hindered by fixed memory hierarchies. The constraint when compiling to FPGAs is the availability of resources. In this paper we describe how the dataflow intermediary of our declarative FPGA image processing DSL called RIPL (Rathlin Image Processing Language) enables us to constrain memory. We use five benchmarks to demonstrate that memory use with RIPL is comparable to the Vivado HLS OpenCV library without the need for language pragmas to guide hardware synthesis. The benchmarks also show that RIPL is more expressive than the Darkroom FPGA image processing language

    Topology-Aware Parallelism for NUMA Copying Collectors

    Get PDF
    Abstract. NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocating memory from local NUMA nodes. Re-searchers have suggested that the garbage collector should profile mem-ory access patterns or use object locality heuristics to determine the tar-get NUMA node before moving an object. However, these solutions are costly when applied to every live object in the reference graph. Our earlier research suggests that connected objects represented by the rooted sub-graphs provide abundant locality and they are appropriate for NUMA architecture. In this paper, we utilize the intrinsic locality of rooted sub-graphs to improve parallel copying collector performance. Our new topology-aware parallel copying collector preserves rooted sub-graph integrity by moving the connected objects as a unit to the target NUMA node. In addition, it distributes and assigns the copying tasks to appropriate (i.e. NUMA node local) GC threads. For load balancing, our solution enforces locality on the work-stealing mechanism by stealing from local NUMA nodes only. We evaluated our approach on SPECjbb2013, DaCapo 9.12 and Neo4j. Results show an improvement in GC performance by up to 2.5x speedup and 37 % better application performance

    PHARMA 4.0–IMPACT OF THE INTERNET OF THINGS ON HEALTH CARE

    Get PDF
    The IoT in health care is currently booming in the world of health care in particular. Industry has risen from generation 1.0 to 4.0 during the Internet of things period. As we remember, we came across the exact submission of the traditional health care system. Each time the patient has needed to visit the clinic/hospitals, even for small complications that may affect the patient's medical costs along with time and energy. One more significant factor is also an emergency; otherwise, she/he/older population was unable to demand urgent assistance from the older system of healthcare. And yet somehow, the situation has changed with the use of the cyber-physical world; we are heading out of the 4th phase of the health care industry means smart health care network. This paper offers an insight into different facets of how healthcare systems such as doctors, hospitals, and of course, patients are powered by the internet of things and how can it track and ensure fast, quality, and efficient use of less time also in a smart way. Here, a patient knows how to track patients by using a collection of different wearable sensor nodes for real-time monitoring and examination of specific patient criteria. One of the most boosting subjects characterizes the development of medical technology within their own homes, enabling older or physically weak people to stay as long as possible at home while being medically cared for and monitored. We searched literature and guidelines in Pubmed, Web of Science, Google Scholar, Scopus, CNKI, and Embase databases up to 2019. The following search terms alone or matched with the Boolean operators ‘AND’ or ‘OR’ were used: "Nanoparticles", “Anticancer treatment", ‘Bioflavonoids’, ‘Plant origin drugs’, ‘Nano formulations’, ‘Cancer’ and ‘Novel drug delivery systems’. We focused on full-text articles, but abstracts were considered if relevant

    Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

    No full text
    Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers

    Improving Perfect Parallelism

    No full text

    Characterizing task-based OpenMP programs.

    No full text
    Programmers struggle to understand performance of task-based OpenMP programs since profiling tools only report thread-based performance. Performance tuning also requires task-based performance in order to balance per-task memory hierarchy utilization against exposed task parallelism. We provide a cost-effective method to extract detailed task-based performance information from OpenMP programs. We demonstrate the utility of our method by quickly diagnosing performance problems and characterizing exposed task parallelism and per-task instruction profiles of benchmarks in the widely-used Barcelona OpenMP Tasks Suite. Programmers can tune performance faster and understand performance tradeoffs more effectively than existing tools by using our method to characterize task-based performance

    anamud/mir-dev: MIR v1.0.0

    No full text
    This is the first release of MIR, hence v1.0.0. MIR is a task-based runtime system library specialized for high performance execution and detailed yet cost-effective profiling of OpenMP programs. Fork the latest development version and submit issues at https://github.com/anamud/mir-dev. Thank you for your interest in MIR

    Data sizes to study BOTS input sensitivity.

    No full text
    <p>* UTS is a synthetic stress benchmark whose default inputs produce an extraordinary amount of tasks—approx. 1.5–4 billion—which cannot be profiled using our system. We have chosen input sets for UTS which produce approx. 100–300 thousand tasks and maintain stress.</p><p>Data sizes to study BOTS input sensitivity.</p

    Diagnosing performance problems using thread-based performance metrics in BOTS Sort and Strassen.

    No full text
    <p>Sort input: array size = 64M elements, quicksort cutoff = {4096 (default), 262144}, sequential merge sort cutoff same as quicksort cutoff, insertion sort cutoff = 128. Strassen input: dimension = 4096, cutoff = 128 (default). Blocked matrix multiplication (blk-matmul) input: dimension = 4096, block size = 128. Executed on all cores of 48-core AMD Opteron 6172 machine running at highest frequency with frequency scaling turned off. (a) Speedup (b) Visualization of state traces from 6/48 threads executing Sort with default cutoffs. White bars indicate task creation, black bars, task execution and gray bars, task synchronization. The six threads are bound to cores on different dies.</p
    corecore