25 research outputs found
Optimizing datacenter power with memory system levers for guaranteed quality-of-service
pre-printCo-location of applications is a proven technique to improve hardware utilization. Recent advances in virtualization have made co-location of independent applications on shared hardware a common scenario in datacenters. Colocation, while maintaining Quality-of-Service (QoS) for each application is a complex problem that is fast gaining relevance for these datacenters. The problem is exacerbated by the need for effective resource utilization at datacenter scales. In this work, we show that the memory system is a primary bottleneck in many workloads and is a more effective focal point when enforcing QoS. We examine four different memory system levers to enforce QoS: two that have been previously proposed, and two novel levers. We compare the effectiveness of each lever in minimizing power and resource needs, while enforcing QoS guarantees. We also evaluate the effectiveness of combining various levers and show that this combined approach can yield power reductions of up to 28%
Hand Foot and Mouth Disease in an infant
ABSTRACT Hand foot and mouth disease is a self limiting enteroviral lesion characterized by papulovesicular eruptions over hands and feet with circinate oral ulcers in the palate. The fever associated with the illness subsides in 48 hours and rash lasts for 7 to 10 days. Rarely they can also be associated with complications like encephalitis and myocarditis. We report a one year six month old infant with hand foot mouth disease which resolved without complications
Performance characterization and optimization of mobile augmented reality on handheld platforms
Abstract — The introduction of low power general purpose processors (like the Intel ® Atom ™ processor) expands the capability of handheld and mobile internet devices (MIDs) to include compelling visual computing applications. One rapidly emerging visual computing usage model is known as mobile augmented reality (MAR). In the MAR usage model, the user is able to point the handheld camera to an object (like a wine bottle) or a set of objects (like an outdoor scene of buildings or monuments) and the device automatically recognizes and displays information regarding the object(s). Achieving this on the handheld requires significant compute processing resulting in a response time in the order of several seconds. In this paper, we analyze a MAR workload and identify the primary hotspot functions that incur a large fraction of the overall response time. We also present a detailed architectural characterization of the hotspot functions in terms of CPI, MPI, etc. We then implement and analyze the benefits of several software optimizations: (a) vectorization, (b) multi-threading, (c) cache conflict avoidance and (d) miscellaneous code optimizations that reduce the number of computations. We show that a 3X performance improvement in execution time can be achieved by implementing these optimizations. Overall, we believe our analysis provides a detailed understanding of the processing for a new domain of visual computing workloads (i.e. MAR) running on low power handheld compute platforms. 1
PREFETCHING VS THE MEMORY SYSTEM: OPTIMIZATIONS For Multi-core Server Platforms
This dissertation investigates prefetching scheme for servers with respect to realistic memory systems. A large body of research work has been done in prefetching, even for server workloads that have sparse locality. Real systems disable prefetching in server settings, suggesting that there is a fundamental disconnect between research and practice. Our theory, a major point of this thesis, is that this disconnect is due to the use of simplistic memory models — and our experimental results show that, among other things, using simplistic models can over-predict the system performance by up to 65%. Our investigation proceeds as follows: • (In)Accuracy of Simplistic Memory Models. We demonstrate the degrees of inaccuracy of models commonly used in system design: in particular, simple models are reasonably accurate when applied to simple systems (e.g. uniprocessors), but they become increasingly inaccurate as the level of complexity of the system grows — as cores are added, and as prefetching is added. • Memory side prefetching. We then perform a detailed case study of a well known server oriented prefetch scheme — memory-side sequential prefetch — to develop understanding of the interaction between prefetch scheme and memory systems. In particular, we find that the projected performance gains fail to materialize due to the lack of locality in the server benchmarks and the bandwidth constraints introduced by the prefetch requests. We conclude that prefetching studies so far have been using the wrong metric to gauge idleness of the memory subsystem and consequently saturate the bus with prefetch requests. • Multi-core Server Prefetching. We use our newfound understanding of prefetch and memory systems interplay to develop a novel scheme for prefetching in server platforms that does interact well with real memory systems. We find that tuning the aggressiveness of prefetching to the average memory latency, which depends on the available bandwidth, performs the best in server platforms
Transparent datamemory organizations for digital signal processors
Today’s digital signal processors (DSPs), unlike general-purpose processors, use a non-uniform addressing model in which the primary components of the memory system—the DRAM and dual tagless SRAMs—are referenced through completely separate segments of the address space. The recent trend of programming DSPs in highlevel languages instead of assembly code has exposed this memory model as a potential weakness, as the model makes for a poor compiler target. In many of today’s high-performance DSPs this non-uniform model is being replaced by a uniform model—a transparent organization like that of most general-purpose systems, in which all memory structures share the same address space as the DRAM system. In such a memory organization, one must replace the DSP’s tagless SRAMs with something resembling a general-purpose cache. This study investigates the performance of a range of traditional and slightly non-traditional cache organizations for a high-performanc
Extended Split-Issue: Enabling Flexibility in the Hardware Implementation of NUAL VLIW DSPs
VLIW architecture based DSPs have become widespread due to the combined benefits of simple hardware and compiler-extracted instruction-level parallelism. However, the VLIW instruction set architecture and its hardware implementation are tightly coupled, especially so for Non-Unit Assumed Latency (NUAL) VLIWs. The problem of object code compatibility across processors having different numbers of functional units or hardware latencies has been the Achilles ' heel of this otherwise powerful architecture. In this paper, we propose eXtended Split-Issue (XSI), a novel mechanism that breaks the instruction packet syntax of an NUAL VLIW compiler without violating the dataflow dependences. XSI provides a designer the freedom of disassociating the hardware implementation of the NUAL VLIW processor from the instruction set architecture. Further, we investigate fairly radical (in the context of VLIW) changes to the hardware—like removing an adder, adding a multiplier, and incorporating simultaneous multithreading (SMT)—to show that our technique works for a variety of hardware configurations without compromising on performance. The technique can be used in both single-threaded and multi-threaded architectures to achieve a level of flexibility heretofore unavailable in the VLIW arena
Extended Split-Issue: Enabling Flexibility
VLIW architecture based DSPs have become widespread due to the combined benefits of simple hardware and compiler-extracted instruction-level parallelism. However, the VLIW instruction set architecture and its hardware implementation are tightly coupled, especially so for Non-Unit Assumed Latency (NUAL) VLIWs. The problem of object code compatibility across processors having different numbers of functional units or hardware latencies has been the Achilles ' heel of this otherwise powerful architecture. In this paper, we propose eXtended Split-Issue (XSI), a novel mechanism that breaks the instruction packet syntax of an NUAL VLIW compiler without violating the dataflow dependences. XSI provides a designer the freedom of disassociating the hardware implementation of the NUAL VLIW processor from the instruction set architecture. Further, we investigate fairly radical (in the context of VLIW) changes to the hardware—like removing an adder, adding a multiplier, and incorporating simultaneous multithreading (SMT)—to show that our technique works for a variety of hardware configurations without compromising on performance. The technique can be used in both single-threaded and multi-threaded architectures to achieve a level of flexibility heretofore unavailable in the VLIW arena.
Breast feeding practices among health care professionals in a tertiary care hospital from South India
Personal breastfeeding experiences of health care professionals play a major role in influencing their attitudes and expertise regarding counseling and managing breastfeeding issues in patients. This study was done with an objective of studying the current breastfeeding practices among health care professionals (HP) and their spouses and the factors influencing them. All children < 5 years of age, residing in hospital′s residential quarters, were included. A detailed breastfeeding history demographic data were obtained following a semi-structured interview with mothers. Among 81 children included for analysis, in 73 children (90.1%), an initiation of breastfeeding was within 24 hours of birth and in 36 children (44.4%), it was within first hour of life. 43 children (58.1%) were exclusively breast fed for 6 months. Mean duration of EBF was 5.3 months and total duration of breastfeeding was 13.2 months. Gender of HP, gender of the child and socio-economic factors were not found to significantly affect breastfeeding practices among HP