30 research outputs found

    Design and Development of a Run-Time Monitor for Multi-Core Architectures in Cloud Computing

    Get PDF
    Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data

    QuantPipe: Applying Adaptive Post-Training Quantization for Distributed Transformer Pipelines in Dynamic Edge Environments

    Full text link
    Pipeline parallelism has achieved great success in deploying large-scale transformer models in cloud environments, but has received less attention in edge environments. Unlike in cloud scenarios with high-speed and stable network interconnects, dynamic bandwidth in edge systems can degrade distributed pipeline performance. We address this issue with QuantPipe, a communication-efficient distributed edge system that introduces post-training quantization (PTQ) to compress the communicated tensors. QuantPipe uses adaptive PTQ to change bitwidths in response to bandwidth dynamics, maintaining transformer pipeline performance while incurring limited inference accuracy loss. We further improve the accuracy with a directed-search analytical clipping for integer quantization method (DS-ACIQ), which bridges the gap between estimated and real data distributions. Experimental results show that QuantPipe adapts to dynamic bandwidth to maintain pipeline performance while achieving a practical model accuracy using a wide range of quantization bitwidths, e.g., improving accuracy under 2-bit quantization by 15.85\% on ImageNet compared to naive quantization

    Opportunities for Concurrent Dynamic Analysis with Explicit Inter-core Communication

    No full text
    Multicore is now the dominant processor trend, and the number of cores is rapidly increasing. The paradigm shift to multicore forces the redesign of the software stack, which includes dynamic analysis. Dynamic analyses provide rich features to software in various areas, such as debugging, testing, optimization, and security. However, these techniques often suffer from excessive overhead, which make it less practical. Previously, this overhead has been overcome by improved processor performance as each generation gets faster, but the performance requirements of dynamic analyses in the multicore era cannot be fulfilled without redesigning for parallelism. Scalable design of dynamic analysis is a challenging problem. Not only must the analysis itself must be parallel, but the analysis must also be decoupled from the application and run concurrently. A typical method of decoupling the analysis from the application is to send the analysis data from the application to the core that runs the analysis thread via buffering. However, buffering can perturb application cache performance, and the cache coherence protocol may not be efficient, or even implemented, with large numbers of cores in the future. This paper presents our initial effort to explore the hardware design space and software approach that will alleviate the scalability problem for dynamic analysis on multicore. We choose to make use of explicit inter-core communication that is already available in a real processor, the TILE64 processor and evaluate the opportunity for scalable dynamic analyses. We provide our model and implement concurrent call graph profiling as a case study. Our evaluation shows that pure communication overhead from the application point of view is as low as 1%. We expect that our work will help design scalable dynamic analyses and will influence the design of future many-core processors

    PIM- and Stream Processor-based Processing for Radar Signal Applications

    No full text
    The growing gap in performance between processor and memory speeds has created a problem for data-intensive applications. Recent approaches for solving this problem are processor-inmemory (PIM) technology and stream processor technology. In this paper, we assess the performance of systems based on PIM and stream processors by implementing data-intensive applications. The implementation results are compared with the measured performance of conventional systems based on the PowerPC and Pentium processors. The results show that the performance of systems based on these processors is improved up to 70 compared with conventional systems for these data-intensive applications

    Dynamic power management of multiprocessor systems

    No full text
    Power management is critical to power-constrained real-time systems. In this paper, we present a dynamic power management algorithm. Unlike other approaches that focus on the tradeoff between power and performance, our algorithm maximizes the power utilization and performance. Our algorithm considers the dynamic nature of the environment such as changes on the available energy and adapts system parameters such as the operating voltage, frequency, and the number of processors. In our algorithm, we divide the power management problem into three sub-problems: initial power allocation, system parameter computation based on the allocated power, and dynamic update of the power and system parameters at run time. Initial power allocation minimizes wasted energy by using extra energy for useful work. It also avoids the undersupplied power situation by reducing power usage before such situations happen. The system parameters are computed to maximize the performance for a given power. During runtime, the system parameters are updated continuously to accommodate differences between the expected and real situations. The simulation results of the algorithm for a satellite system using eight Processor-In-Memory (PIM) processors are presented. 1
    corecore