30 research outputs found
Design and Development of a Run-Time Monitor for Multi-Core Architectures in Cloud Computing
Cloud computing is a new information technology trend that moves computing and data away from desktops and portable PCs into large data centers. The basic principle of cloud computing is to deliver applications as services over the Internet as well as infrastructure. A cloud is a type of parallel and distributed system consisting of a collection of inter-connected and virtualized computers that are dynamically provisioned and presented as one or more unified computing resources. The large-scale distributed applications on a cloud require adaptive service-based software, which has the capability of monitoring system status changes, analyzing the monitored information, and adapting its service configuration while considering tradeoffs among multiple QoS features simultaneously. In this paper, we design and develop a Run-Time Monitor (RTM) which is a system software to monitor the application behavior at run-time, analyze the collected information, and optimize cloud computing resources for multi-core architectures. RTM monitors application software through library instrumentation as well as underlying hardware through a performance counter optimizing its computing configuration based on the analyzed data
QuantPipe: Applying Adaptive Post-Training Quantization for Distributed Transformer Pipelines in Dynamic Edge Environments
Pipeline parallelism has achieved great success in deploying large-scale
transformer models in cloud environments, but has received less attention in
edge environments. Unlike in cloud scenarios with high-speed and stable network
interconnects, dynamic bandwidth in edge systems can degrade distributed
pipeline performance. We address this issue with QuantPipe, a
communication-efficient distributed edge system that introduces post-training
quantization (PTQ) to compress the communicated tensors. QuantPipe uses
adaptive PTQ to change bitwidths in response to bandwidth dynamics, maintaining
transformer pipeline performance while incurring limited inference accuracy
loss. We further improve the accuracy with a directed-search analytical
clipping for integer quantization method (DS-ACIQ), which bridges the gap
between estimated and real data distributions. Experimental results show that
QuantPipe adapts to dynamic bandwidth to maintain pipeline performance while
achieving a practical model accuracy using a wide range of quantization
bitwidths, e.g., improving accuracy under 2-bit quantization by 15.85\% on
ImageNet compared to naive quantization
Opportunities for Concurrent Dynamic Analysis with Explicit Inter-core Communication
Multicore is now the dominant processor trend, and the number of cores is rapidly increasing. The paradigm shift to multicore forces the redesign of the software stack, which includes dynamic analysis. Dynamic analyses provide rich features to software in various areas, such as debugging, testing, optimization, and security. However, these techniques often suffer from excessive overhead, which make it less practical. Previously, this overhead has been overcome by improved processor performance as each generation gets faster, but the performance requirements of dynamic analyses in the multicore era cannot be fulfilled without redesigning for parallelism. Scalable design of dynamic analysis is a challenging problem. Not only must the analysis itself must be parallel, but the analysis must also be decoupled from the application and run concurrently. A typical method of decoupling the analysis from the application is to send the analysis data from the application to the core that runs the analysis thread via buffering. However, buffering can perturb application cache performance, and the cache coherence protocol may not be efficient, or even implemented, with large numbers of cores in the future. This paper presents our initial effort to explore the hardware design space and software approach that will alleviate the scalability problem for dynamic analysis on multicore. We choose to make use of explicit inter-core communication that is already available in a real processor, the TILE64 processor and evaluate the opportunity for scalable dynamic analyses. We provide our model and implement concurrent call graph profiling as a case study. Our evaluation shows that pure communication overhead from the application point of view is as low as 1%. We expect that our work will help design scalable dynamic analyses and will influence the design of future many-core processors
PIM- and Stream Processor-based Processing for Radar Signal Applications
The growing gap in performance between processor and memory speeds has created a problem for data-intensive applications. Recent approaches for solving this problem are processor-inmemory (PIM) technology and stream processor technology. In this paper, we assess the performance of systems based on PIM and stream processors by implementing data-intensive applications. The implementation results are compared with the measured performance of conventional systems based on the PowerPC and Pentium processors. The results show that the performance of systems based on these processors is improved up to 70 compared with conventional systems for these data-intensive applications
Dynamic power management of multiprocessor systems
Power management is critical to power-constrained real-time systems. In this paper, we present a dynamic power management algorithm. Unlike other approaches that focus on the tradeoff between power and performance, our algorithm maximizes the power utilization and performance. Our algorithm considers the dynamic nature of the environment such as changes on the available energy and adapts system parameters such as the operating voltage, frequency, and the number of processors. In our algorithm, we divide the power management problem into three sub-problems: initial power allocation, system parameter computation based on the allocated power, and dynamic update of the power and system parameters at run time. Initial power allocation minimizes wasted energy by using extra energy for useful work. It also avoids the undersupplied power situation by reducing power usage before such situations happen. The system parameters are computed to maximize the performance for a given power. During runtime, the system parameters are updated continuously to accommodate differences between the expected and real situations. The simulation results of the algorithm for a satellite system using eight Processor-In-Memory (PIM) processors are presented. 1