Search CORE

1,453 research outputs found

Enhancing the Accuracy of Synthetic File System Benchmarks

Author: Farhat Salam
Publication venue: NSUWorks
Publication date: 01/01/2017
Field of study

File system benchmarking plays an essential part in assessing the file system’s performance. It is especially difficult to measure and study the file system’s performance as it deals with several layers of hardware and software. Furthermore, different systems have different workload characteristics so while a file system may be optimized based on one given workload it might not perform optimally based on other types of workloads. Thus, it is imperative that the file system under study be examined with a workload equivalent to its production workload to ensure that it is optimized according to its usage. The most widely used benchmarking method is synthetic benchmarking due to its ease of use and flexibility. The flexibility of synthetic benchmarks allows system designers to produce a variety of different workloads that will provide insight on how the file system will perform under slightly different conditions. The downside of synthetic workloads is that they produce generic workloads that do not have the same characteristics as production workloads. For instance, synthetic benchmarks do not take into consideration the effects of the cache that can greatly impact the performance of the underlying file system. In addition, they do not model the variation in a given workload. This can lead to file systems not optimally designed for their usage. This work enhanced synthetic workload generation methods by taking into consideration how the file system operations are satisfied by the lower level function calls. In addition, this work modeled the variations of the workload’s footprint when present. The first step in the methodology was to run a given workload and trace it by a tool called tracefs. The collected traces contained data on the file system operations and the lower level function calls that satisfied these operations. Then the trace was divided into chunks sufficiently small enough to consider the workload characteristics of that chunk to be uniform. Then the configuration file that modeled each chunk was generated and supplied to a synthetic workload generator tool that was created by this work called FileRunner. The workload definition for each chunk allowed FileRunner to generate a synthetic workload that produced the same workload footprint as the corresponding segment in the original workload. In other words, the synthetic workload would exercise the lower level function calls in the same way as the original workload. Furthermore, FileRunner generated a synthetic workload for each specified segment in the order that they appeared in the trace that would result in a in a final workload mimicking the variation present in the original workload. The results indicated that the methodology can create a workload with a throughput within 10% difference and with operation latencies, with the exception of the create latencies, to be within the allowable 10% difference and in some cases within the 15% maximum allowable difference. The work was able to accurately model the I/O footprint. In some cases the difference was negligible and in the worst case it was at 2.49% difference

NSU Works

Recommended from our members

Record and Transplay: Partial Checkpointing for Replay Debugging

Author: Nieh Jason
Subhraveti Dinesh Kumar
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2009
Field of study

Software bugs that occur in production are often difficult to reproduce in the lab due to subtle differences in the application environment and nondeterminism. Toward addressing this problem, we present Transplay, a system that captures application software bugs as they occur in production and deterministically reproduces them in a completely different environment, potentially running a different operating system, where the application, its binaries and other support data do not exist. Transplay introduces partial checkpointing, a new mechanism that provides two key properties. It efficiently captures the minimal state necessary to reexecute just the last few moments of the application before it encountered a failure. The recorded state, which typically consists of a few megabytes of data, is used to replay the application without requiring the specific application binaries or the original execution environment. Transplay integrates with existing debuggers to provide facilities such as breakpoints and single-stepping to allow the user to examine the contents of variables and other program state at each source line of the application's replayed execution. We have implemented a Transplay prototype that can record unmodified Linux applications and replay them on different versions of Linux as well as Windows. Experiments with server applications such as the Apache web server show that Transplay can be used in production with modest recording overhead

Columbia University Academic Commons

Doctor of Philosophy

Author: Burtsev Anton
Publication venue: University of Utah
Publication date: 01/05/2013
Field of study

dissertationA modern software system is a composition of parts that are themselves highly complex: operating systems, middleware, libraries, servers, and so on. In principle, compositionality of interfaces means that we can understand any given module independently of the internal workings of other parts. In practice, however, abstractions are leaky, and with every generation, modern software systems grow in complexity. Traditional ways of understanding failures, explaining anomalous executions, and analyzing performance are reaching their limits in the face of emergent behavior, unrepeatability, cross-component execution, software aging, and adversarial changes to the system at run time. Deterministic systems analysis has a potential to change the way we analyze and debug software systems. Recorded once, the execution of the system becomes an independent artifact, which can be analyzed offline. The availability of the complete system state, the guaranteed behavior of re-execution, and the absence of limitations on the run-time complexity of analysis collectively enable the deep, iterative, and automatic exploration of the dynamic properties of the system. This work creates a foundation for making deterministic replay a ubiquitous system analysis tool. It defines design and engineering principles for building fast and practical replay machines capable of capturing complete execution of the entire operating system with an overhead of several percents, on a realistic workload, and with minimal installation costs. To enable an intuitive interface of constructing replay analysis tools, this work implements a powerful virtual machine introspection layer that enables an analysis algorithm to be programmed against the state of the recorded system through familiar terms of source-level variable and type names. To support performance analysis, the replay engine provides a faithful performance model of the original execution during replay

The University of Utah: J. Willard Marriott Digital Library

Enabling preemptive multiprogramming on GPUs

Author: Cabezas Javier
Gelado Fernandez Isaac
Navarro Nacho
Ramírez Bellido Alejandro
Tanasic Ivan
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

GPUs are being increasingly adopted as compute accelerators in many domains, spanning environments from mobile systems to cloud computing. These systems are usually running multiple applications, from one or several users. However GPUs do not provide the support for resource sharing traditionally expected in these scenarios. Thus, such systems are unable to provide key multiprogrammed workload requirements, such as responsiveness, fairness or quality of service. In this paper, we propose a set of hardware extensions that allow GPUs to efficiently support multiprogrammed GPU workloads. We argue for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies. We extend the architecture to allow concurrent execution of GPU kernels from different user processes and implement a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities. We extend the NVIDIA GK110 (Kepler) like GPU architecture with our proposals and evaluate them on a set of multiprogrammed workloads with up to eight concurrent processes. Our proposals improve execution time of high-priority processes by 15.6x, the average application turnaround time between 1.5x to 2x, and system fairness up to 3.4x.We would like to thank the anonymous reviewers, Alexan- der Veidenbaum, Carlos Villavieja, Lluis Vilanova, Lluc Al- varez, and Marc Jorda on their comments and help improving our work and this paper. This work is supported by Euro- pean Commission through TERAFLUX (FP7-249013), Mont- Blanc (FP7-288777), and RoMoL (GA-321253) projects, NVIDIA through the CUDA Center of Excellence program, Spanish Government through Programa Severo Ochoa (SEV-2011-0067) and Spanish Ministry of Science and Technology through TIN2007-60625 and TIN2012-34557 projects.Peer ReviewedPostprint (author’s final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

CRAID: Online RAID upgrades using dynamic hot data reorganization

Author: Cortés Toni
Miranda Bueno Alberto
Publication venue: USENIX Association
Publication date: 01/01/2014
Field of study

Current algorithms used to upgrade RAID arrays typically require large amounts of data to be migrated, even those that move only the minimum amount of data required to keep a balanced data load. This paper presents CRAID, a self-optimizing RAID array that performs an online block reorganization of frequently used, long-term accessed data in order to reduce this migration even further. To achieve this objective, CRAID tracks frequently used, long-term data blocks and copies them to a dedicated partition spread across all the disks in the array. When new disks are added, CRAID only needs to extend this process to the new devices to redistribute this partition, thus greatly reducing the overhead of the upgrade process. In addition, the reorganized access patterns within this partition improve the array’s performance, amortizing the copy overhead and allowing CRAID to offer a performance competitive with traditional RAIDs. We describe CRAID’s motivation and design and we evaluate it by replaying seven real-world workloads including a file server, a web server and a user share. Our experiments show that CRAID can successfully detect hot data variations and begin using new disks as soon as they are added to the array. Also, the usage of a dedicated partition improves the sequentiality of relevant data access, which amortizes the cost of reorganizations. Finally, we prove that a full-HDD CRAID array with a small distributed partition (<1.28% per disk) can compete in performance with an ideally restriped RAID-5 and a hybrid RAID-5 with a small SSD cache.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Simulation of MPI applications with time-independent traces

Author: Adve
Bagrodia
Browne
Chassin de Kergommeaux
Dickens
Gabriel
Geimer
Genovese
Gropp
Knüpfer
Núñez
Prakash
Reussner
Shende
Zheng
Publication venue: 'Wiley'
Publication date: 01/04/2015
Field of study

International audienceAnalyzing and understanding the performance behavior of parallel applications on parallel computing platforms is a long-standing concern in the High Performance Computing community. When the targeted platforms are not available , simulation is a reasonable approach to obtain objective performance indicators and explore various hypothetical scenarios. In the context of applications implemented with the Message Passing Interface, two simulation methods have been proposed, on-line simulation and off-line simulation, both with their own drawbacks and advantages. In this work we present an off-line simulation framework, i.e., one that simulates the execution of an application based on event traces obtained from an actual execution. The main novelty of this work, when compared to previously proposed off-line simulators, is that traces that drive the simulation can be acquired on large, distributed, heterogeneous , and non-dedicated platforms. As a result the scalability of trace acquisition is increased, which is achieved by enforcing that traces contain no time-related information. Moreover, our framework is based on an state-of-the-art scalable, fast, and validated simulation kernel. We introduce the notion of performing off-line simulation from time-independent traces, propose and evaluate several trace acquisition strategies, describe our simulation framework, and assess its quality in terms of trace acquisition scalability, simulation accuracy, and simulation time

HAL-ENS-LYON

HAL-IN2P3

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot