25 research outputs found

    EXPLORING MULTIPLE LEVELS OF PERFORMANCE MODELING FOR HETEROGENEOUS SYSTEMS

    Get PDF
    One of the major challenges faced by the HPC community today is user-friendly and accurate heterogeneous performance modeling. Although performance prediction models exist to fine-tune applications, they are seldom easy-to-use and do not address multiple levels of design space abstraction. Our research aims to bridge the gap between reliable performance model selection and user-friendly analysis. We propose a straightforward and accurate performance prediction suite for multi-GPGPU systems that primarily targets synchronous iterative algorithms using our synchronous iterative GPGPU execution model. The performance modeling suite addresses two levels of system abstraction: low-level where partial details of implementation are present along with system specifications; and high-level where implementation details are minimum and only high-level system specifications are known. The low-level abstraction models use statistical techniques for performance prediction whereas the high-level abstraction models are composed of existing analytical and quantitative models. Our initial validation results yield high prediction accuracy with less than 10% error rate for several tested GPGPU cluster configurations and case studies. The final goal of our research is to offer a reliable and user-friendly performance prediction framework that allows users to select an optimal performance modeling strategy for the given design goals

    The Realm of Graphical Processing Unit (GPU) Computing

    No full text
    The goal of the chapter is to introduce the upper-level Computer Engineering/Computer Science undergraduate (UG) students to general-purpose graphical processing unit (GPGPU) computing. The specific focus of the chapter is on GPGPU computing using the Compute Unified Device Architecture (CUDA) C framework due to the following three reasons: (1) Nvidia GPUs are ubiquitous in high-performance computing, (2) CUDA is relatively easy to understand versus OpenCL, especially for UG students with limited heterogeneous device programming experience, and (3) CUDA experience simplifies learning OpenCL and OpenACC. The chapter consists of nine pedagogical sections with several active-learning exercises to effectively engage students with the text. The chapter opens with an introduction to GPGPU computing. The chapter sections include: (1) Data parallelism; (2) CUDA program structure; (3) CUDA compilation flow; (4) CUDA thread organization; (5) Kernel: Execution configuration and kernel structure; (6) CUDA memory organization; (7) CUDA optimizations; (8) Case study: Image convolution on GPUs; and (9) GPU computing: The future. The authors believe that the chapter layout facilitates effective student-learning by starting from the basics of GPGPU computing and then leading up to the advanced concepts. With this chapter, the authors aim to equip students with the necessary skills to undertake graduate-level courses on GPU programming and make a strong start with undergraduate research.https://scholarlycommons.pacific.edu/soecs-facbooks/1014/thumbnail.jp

    True-Ed Select Enters Social Computing: A Machine Learning Based University Selection Framework

    No full text
    University/College selection is a daunting task for young adults and their parents alike. This research presents True-Ed Select, a machine learning framework that simplifies the college selection process. The framework uses a four-layered approach comprising user survey, machine learning, consolidation, and recommendation. The first layer collects both the objective and subjective attributes from users that best characterize their ideal college experience. The second layer employs machine learning techniques to analyze the objective and subjective attributes. The third layer combines the results from the machine learning techniques. The fourth layer inputs the consolidated result and presents a user-friendly list of top educational institutions that best match the user’s interests. We use our framework to analyze over 3500 United States post-secondary institutions and show search space reduction to top 20 institutions. This drastically reduced search space facilitates effective and assured college selection for end users

    GPU acceleration of a best-features based digital rotoscope

    No full text
    This paper presents the first hybrid GPU-CPU implementation of a best-features based digital rotoscope. Unlike the other rotoscoping and Non Photo-Realistic Rendering (NPR) implementations, this best-features rotoscoping technique incorporates four major stages: background subtraction, best-feature corner detection, marker-based watershed segmentation, and color palette to produce a cartoon stylized video sequence. Our GPU-based implementation uses the computing power of both the CPU host and the GPU device for fast processing. We implement the computationally-intensive and parallel stages on the GPU device by using optimizations such as shared memory for reduced global memory accesses and execution configuration to maximize the GPU multiprocessor occupancy. We also devise a novel window-based reduction method on the GPU device to select optimally spaced best features; this task is inherently sequential in the serial algorithm. We test our implementation using videos of different resolutions. Our GPU implementation reports speedup as high as 3.71x for a video resolution of 1440x2560, and an end-to-end execution time reduction from 21 minutes to 6 minutes for the largest video of resolution 2160x3840

    A testing engine for high-performance and cost-effective workflow execution in the cloud

    No full text
    While pursuing high performance and cost effectiveness for directed acyclic graph (DAG)-structured scientific workflow executions in the cloud, it is critical to identify appropriate resource instances and their quantity. This paper presents a testing engine that employs a resource-selection heuristic, which statically analyzes the DAG structure to guide the selection of resource instances, how many and which ones. The testing engine combines the heuristic with two platform-independent DAG-scheduling policies, the Area-oriented DAG-scheduling heuristic (AO) and the Locally-Optimal heuristic (L-OPT), to perform extensive validation assessments. The testing engine ensures the realism of these assessments by modeling the performance variability of the cloud platform using real traces. The testing engine also enables cost-effectiveness analysis that guides users to select a small set of instance candidates that provide performance-cost trade off. Our empirical results show that the pairing of the resource-selection heuristic with AO scheduling policy is a powerful method for cost-effective DAG-structured workflow execution in the cloud

    A best-features based digital rotoscope

    No full text
    The paper presents a best-features based digital rotoscope to animate a video sequence. Our thesis is that corners in an image are viable feature points for animation because they move appreciably across frames. Our proposed rotoscope processes a video sequence one frame at a time and comprises four image processing stages namely, the background subtraction stage, two-phase Shi-Tomasi feature extraction stage, watershed segmentation stage, and the color palette stage. The background subtraction stage subtracts the background from the input image, thereby isolating the foreground colors and producing an inverted image. The two-phase Shi-Tomasi feature extraction stage performs two passes of corner detection on the inverted image to identify a user-defined number of best-features. The watershed segmentation stage uses the best-features as markers to segment the input frame. Finally, the color palette stage colors each one of the segments with the average of the input frame\u27s color values in that segment, thereby creating a rotoscoped frame. After processing all of the frames in a video sequence, the digital rotoscope produces an animated movie. We study the impact of choosing different numbers of best-features on the quality of animation. Our empirical study reveals that for a frame-size of 480 Ă— 640 Ă— 3, 1000 features or more produce effective animation. Our four-stage digital rotoscope provides opportunities for parallel implementations on high-performance architectures, thereby creating avenues for fast analysis

    Applying frequency analysis techniques to dag-based workflows to benchmark and predict resource behavior on non-dedicated clusters

    No full text
    Today, scientific workflows on high-end nondedicated clusters increasingly resemble directed acyclic graphs (DAGs). The execution trace analysis of the associated DAG-based workflows can provide valuable insights into the system behavior in general, and the occurrences of events like idle times in particular, thereby opening avenues for optimized resource utilization. In this paper, we propose a bipartite tool that uses frequency analysis techniques to benchmark and predict event occurrences in DAG-based workflows; highlighting the system behavior for a given cluster configuration. Using an empirically determined prediction window, the tool parses real-time traces to generate the cumulative distribution function (CDF) of the event occurrences. The CDF is then queried to predict the likelihood of a given number of event instances on the cluster resources in a future time frame. Our results yield average prediction hit-rates as high as 94%. The proposed research enables a runtime system to identify unfavorable event occurrences, thereby allowing for preventive scheduling strategies that maximize system utilization

    Incorporating research in the undergraduate experience at a private teaching-centric institution

    No full text
    This paper presents several case studies that demonstrate how undergraduate students at a teaching-centric institution can actively participate in research. We elaborate on three mechanisms to include students in research: independent study, technical electives, and senior design (capstone) project. Our examples in these mechanisms present techniques to keep students interested and motivated throughout a research project. These techniques also create possibilities for publication and creation of new research areas, both of which are vital to the faculty\u27s and student\u27s success. We elaborate on lessons learned with the aim to help faculty at teaching-centric institutions to effectively incorporate research in the undergraduate experience, while conducting an effective research program of their own

    Evaluation of GPU architectures using spiking neural networks

    No full text
    During recent years General-Purpose Graphical Processing Units (GP-GPUs) have entered the field of High- Performance Computing (HPC) as one of the primary architectural focuses for many research groups working with complex scientific applications. Nvidia\u27s Tesla C2050, codenamed Fermi, and AMD\u27s Radeon 5870 are two devices positioned to meet the computationally demanding needs of supercomputing research groups across the globe. Though Nvidia GPUs powered by CUDA have been the frequent choices of the performance centric research groups, the introduction and growth of OpenCL has promoted AMD GP-GPUs as potential accelerator candidates that can challenge Nvidia\u27s stronghold. These architectures not only offer a plethora of features for application developers to explore, but their radically different architectures calls for a detailed study that weighs their merits and evaluates their potential to accelerate complex scientific applications. In this paper, we present our performance analysis research comparing Nvidia\u27s Fermi and AMD\u27s Radeon 5870 using OpenCL as the common programming model. We have chosen four different neuron models for Spiking Neural Networks (SNNs), each with different communication and computation requirements, namely the Izhikevich, Wilson, Morris Lecar (ML), and the Hodgkin Huxley (HH) models. We compare the runtime performance of the Fermi and Radeon GPUs with an implementation that exhausts all optimization techniques available with OpenCL. Several equivalent architectural parameters of the two GPUs are studied and correlated with the application performance. In addition to the comparative study effort, our implementations were able to achieve a speed-up of 857.3x and 658.51x on the Fermi and Radeon architectures respectively for the most compute intensive HH model with a dense network containing 9.72 million neurons. The final outcome of this research is a detailed architectural comparison of the two GPU architectures with a common programming platform. © 2011 IEEE
    corecore