420 research outputs found
Performance and Reliability of Non-Markovian Heterogeneous Distributed Computing Systems
Average service time, quality-of-service (QoS), and service reliability associated with heterogeneous parallel and distributed computing systems (DCSs) are analytically characterized in a realistic setting for which tangible, stochastic communication delays are present with nonexponential distributions. The departure from the traditionally assumed exponential distributions for event times, such as task-execution times, communication arrival times and load-transfer delays, gives rise to a non-Markovian dynamical problem for which a novel age dependent, renewal-based distributed queuing model is developed. Numerical examples offered by the model shed light on the operational and system settings for which the Markovian setting, resulting from employing an exponential-distribution assumption on the event times, yields inaccurate predictions. A key benefit of the model is that it offers a rigorous framework for devising optimal dynamic task reallocation (DTR) policies systematically in heterogeneous DCSs by optimally selecting the fraction of the excess loads that need to be exchanged among the servers, thereby controlling the degree of cooperative processing in a DCSs. Key results on performance prediction and optimization of DCSs are validated using Monte-Carlo (MC) simulation as well as experiments on a distributed computing testbed. The scalability, in the number of servers, of the age-dependent model is studied and a linearly scalable analytical approximation is derived
Evaluating the Robustness of Resource Allocations Obtained through Performance Modeling with Stochastic Process Algebra
Recent developments in the field of parallel and distributed computing has led to a proliferation of solving large and computationally intensive mathematical, science, or engineering problems, that consist of several parallelizable parts and several non-parallelizable (sequential) parts. In a parallel and distributed computing environment, the performance goal is to optimize the execution of parallelizable parts of an application on concurrent processors. This requires efficient application scheduling and resource allocation for mapping applications to a set of suitable parallel processors such that the overall performance goal is achieved. However, such computational environments are often prone to unpredictable variations in application (problem and algorithm) and system characteristics. Therefore, a robustness study is required to guarantee a desired level of performance. Given an initial workload, a mapping of applications to resources is considered to be robust if that mapping optimizes execution performance and guarantees a desired level of performance in the presence of unpredictable perturbations at runtime. In this research, a stochastic process algebra, Performance Evaluation Process Algebra (PEPA), is used for obtaining resource allocations via a numerical analysis of performance modeling of the parallel execution of applications on parallel computing resources. The PEPA performance model is translated into an underlying mathematical Markov chain model for obtaining performance measures. Further, a robustness analysis of the allocation techniques is performed for finding a robustmapping from a set of initial mapping schemes. The numerical analysis of the performance models have confirmed similarity with the simulation results of earlier research available in existing literature. When compared to direct experiments and simulations, numerical models and the corresponding analyses are easier to reproduce, do not incur any setup or installation costs, do not impose any prerequisites for learning a simulation framework, and are not limited by the complexity of the underlying infrastructure or simulation libraries
A synthesis of logic and bio-inspired techniques in the design of dependable systems
Much of the development of model-based design and dependability analysis in the design of dependable systems, including software intensive systems, can be attributed to the application of advances in formal logic and its application to fault forecasting and verification of systems. In parallel, work on bio-inspired technologies has shown potential for the evolutionary design of engineering systems via automated exploration of potentially large design spaces. We have not yet seen the emergence of a design paradigm that effectively combines these two techniques, schematically founded on the two pillars of formal logic and biology, from the early stages of, and throughout, the design lifecycle. Such a design paradigm would apply these techniques synergistically and systematically to enable optimal refinement of new designs which can be driven effectively by dependability requirements. The paper sketches such a model-centric paradigm for the design of dependable systems, presented in the scope of the HiP-HOPS tool and technique, that brings these technologies together to realise their combined potential benefits. The paper begins by identifying current challenges in model-based safety assessment and then overviews the use of meta-heuristics at various stages of the design lifecycle covering topics that span from allocation of dependability requirements, through dependability analysis, to multi-objective optimisation of system architectures and maintenance schedules
Designing, Building, and Modeling Maneuverable Applications within Shared Computing Resources
Extending the military principle of maneuver into war-fighting domain of cyberspace, academic and military researchers have produced many theoretical and strategic works, though few have focused on researching actual applications and systems that apply this principle. We present our research in designing, building and modeling maneuverable applications in order to gain the system advantages of resource provisioning, application optimization, and cybersecurity improvement. We have coined the phrase “Maneuverable Applications” to be defined as distributed and parallel application that take advantage of the modification, relocation, addition or removal of computing resources, giving the perception of movement. Our work with maneuverable applications has been within shared computing resources, such as the Clemson University Palmetto cluster, where multiple users share access and time to a collection of inter-networked computers and servers. In this dissertation, we describe our implementation and analytic modeling of environments and systems to maneuver computational nodes, network capabilities, and security enhancements for overcoming challenges to a cyberspace platform. Specifically we describe our work to create a system to provision a big data computational resource within academic environments. We also present a computing testbed built to allow researchers to study network optimizations of data centers. We discuss our Petri Net model of an adaptable system, which increases its cybersecurity posture in the face of varying levels of threat from malicious actors. Lastly, we present work and investigation into integrating these technologies into a prototype resource manager for maneuverable applications and validating our model using this implementation
Optimizing performance of workflow executions under authorization control
“Business processes or workflows are often used to
model enterprise or scientific applications. It has
received considerable attention to automate workflow
executions on computing resources. However, many
workflow scenarios still involve human activities and
consist of a mixture of human tasks and computing
tasks.
Human involvement introduces security and
authorization concerns, requiring restrictions on who
is allowed to perform which tasks at what time. Role-
Based Access Control (RBAC) is a popular authorization
mechanism. In RBAC, the authorization concepts such as
roles and permissions are defined, and various
authorization constraints are supported, including
separation of duty, temporal constraints, etc. Under
RBAC, users are assigned to certain roles, while the
roles are associated with prescribed permissions.
When we assess resource capacities, or evaluate the
performance of workflow executions on supporting
platforms, it is often assumed that when a task is
allocated to a resource, the resource will accept the
task and start the execution once a processor becomes available. However, when the authorization policies
are taken into account,” this assumption may not be
true and the situation becomes more complex. For
example, when a task arrives, a valid and activated
role has to be assigned to a task before the task can
start execution. The deployed authorization
constraints may delay the workflow execution due to
the roles’ availability, or other restrictions on the
role assignments, which will consequently have
negative impact on application performance.
When the authorization constraints are present to
restrict the workflow executions, it entails new
research issues that have not been studied yet in
conventional workflow management. This thesis aims to
investigate these new research issues.
First, it is important to know whether a feasible
authorization solution can be found to enable the
executions of all tasks in a workflow, i.e., check the
feasibility of the deployed authorization constraints.
This thesis studies the issue of the feasibility
checking and models the feasibility checking problem
as a constraints satisfaction problem.
Second, it is useful to know when the performance of
workflow executions will not be affected by the given
authorization constraints. This thesis proposes the
methods to determine the time durations when the given
authorization constraints do not have impact.
Third, when the authorization constraints do have
the performance impact, how can we quantitatively
analyse and determine the impact? When there are multiple choices to assign the roles to the tasks,
will different choices lead to the different
performance impact? If so, can we find an optimal way
to conduct the task-role assignments so that the
performance impact is minimized? This thesis proposes
the method to analyze the delay caused by the
authorization constraints if the workflow arrives
beyond the non-impact time duration calculated above.
Through the analysis of the delay, we realize that the
authorization method, i.e., the method to select the
roles to assign to the tasks affects the length of the
delay caused by the authorization constraints. Based
on this finding, we propose an optimal authorization
method, called the Global Authorization Aware (GAA)
method.
Fourth, a key reason why authorization constraints
may have impact on performance is because the
authorization control directs the tasks to some
particular roles. Then how to determine the level of
workload directed to each role given a set of
authorization constraints? This thesis conducts the
theoretical analysis about how the authorization
constraints direct the workload to the roles, and
proposes the methods to calculate the arriving rate of
the requests directed to each role under the role,
temporal and cardinality constraints.
Finally, the amount of resources allocated to
support each individual role may have impact on the
execution performance of the workflows. Therefore, it
is desired to develop the strategies to determine the
adequate amount of resources when the authorization
control is present in the system. This thesis presents the methods to allocate the appropriate quantity for
resources, including both human resources and
computing resources. Different features of human
resources and computing resources are taken into
account. For human resources, the objective is to
maximize the performance subject to the budgets to
hire the human resources, while for computing
resources, the strategy aims to allocate adequate
amount of computing resources to meet the QoS
requirements
Theory of Resource Allocation for Robust Distributed Computing
Lately, distributed computing (DC) has emerged in several application scenarios such as grid computing, high-performance and reconfigurable computing, wireless sensor networks, battle management systems, peer-to-peer networks, and donation grids. When DC is performed in these scenarios, the distributed computing system (DCS) supporting the applications not only exhibits heterogeneous computing resources and a significant communication latency, but also becomes highly dynamic due to the communication network as well as the computing servers are affected by a wide class of anomalies that change the topology of the system in a random fashion. These anomalies exhibit spatial and/or temporal correlation when they result, for instance, from wide-area power or network outages These correlated failures may not only inflict a large amount of damage to the system, but they may also induce further failures in other servers as a result of the lack of reliable communication between the components of the DCS. In order to provide a robust DC environment in the presence of component failures, it is key to develop a general framework for accurately modeling the complex dynamics of a DCS. In this dissertation a novel approach has been undertaken for modeling a general class of DCSs and for analytically characterizing the performance and reliability of parallel applications executed on such systems. A general probabilistic model has been constructed by assuming that the random times governing the dynamics of the DCS follow arbitrary probability distributions with heterogeneous parameters. Auxiliary age variables have been introduced in the modeling of a DCS and a hybrid continuous and discrete state-space model the system has been constructed. This hybrid model has enabled the development of an age-dependent stochastic regeneration theory, which, in turn, has been employed to analytically characterize the average execution time, the quality-of-service and the reliability in serving an application. These are three metrics of performance and reliability of practical interest in DC. Analytical approximations as well as mathematical lower and upper bounds for these metrics have also been derived in an attempt to reduce the amount of computational resources demanded by the exact characterizations. In order to systematically assess the reliability of DCSs in the presence of correlated component failures, a novel probabilistic model for spatially correlated failures has been developed. The model, based on graph theory and Markov random fields, captures both geographical and logical correlations induced by the arbitrary topology of the communication network of a DCS. The modeling framework, in conjunction with a general class of dynamic task reallocation (DTR) control policies, has been used to optimize the performance and reliability of applications in the presence of independent as well as spatially correlated anomalies. Theoretical predictions, Monte- Carlo simulations as well as experimental results have shown that optimizing these metrics can significantly impact the performance of a DCS. Moreover, the general setting developed here has shed insights on: (i) the effect of different stochastic mod- els on the accuracy of the performance and reliability metrics, (ii) the dependence of the DTR policies on system parameters such as failure rates and task-processing rates, (iii) the severe impact of correlated failures on the reliability of DCSs, (iv) the dependence of the DTR policies on degree of correlation in the failures, and (v) the fundamental trade-off between minimizing the execution time of an application and maximizing its reliability
Architecture-Level Software Performance Models for Online Performance Prediction
Proactive performance and resource management of modern IT infrastructures requires the ability to predict at run-time, how the performance of running services would be affected if the workload or the system changes. In this thesis, modeling and prediction facilities that enable online performance prediction during system operation are presented. Analyses about the impact of reconfigurations and workload trends can be conducted on the model level, without executing expensive performance tests
Workflow models for heterogeneous distributed systems
The role of data in modern scientific workflows becomes more and more crucial. The unprecedented amount of data available in the digital era, combined with the recent advancements in Machine Learning and High-Performance Computing (HPC), let computers surpass human performances in a wide range of fields, such as Computer Vision, Natural Language Processing and Bioinformatics. However, a solid data management strategy becomes crucial for key aspects like performance optimisation, privacy preservation and security.
Most modern programming paradigms for Big Data analysis adhere to the principle of data locality: moving computation closer to the data to remove transfer-related overheads and risks. Still, there are scenarios in which it is worth, or even unavoidable, to transfer data between different steps of a complex workflow.
The contribution of this dissertation is twofold. First, it defines a novel methodology for distributed modular applications, allowing topology-aware scheduling and data management while separating business logic, data dependencies, parallel patterns and execution environments. In addition, it introduces computational notebooks as a high-level and user-friendly interface to this new kind of workflow, aiming to flatten the learning curve and improve the adoption of such methodology.
Each of these contributions is accompanied by a full-fledged, Open Source implementation, which has been used for evaluation purposes and allows the interested reader to experience the related methodology first-hand. The validity of the proposed approaches has been demonstrated on a total of five real scientific applications in the domains of Deep Learning, Bioinformatics and Molecular Dynamics Simulation, executing them on large-scale mixed cloud-High-Performance Computing (HPC) infrastructures
Simulating fog and edge computing scenarios: an overview and research challenges
The fourth industrial revolution heralds a paradigm shift in how people, processes, things, data and networks communicate and connect with each other. Conventional computing infrastructures are struggling to satisfy dramatic growth in demand from a deluge of connected heterogeneous endpoints located at the edge of networks while, at the same time, meeting quality of service levels. The complexity of computing at the edge makes it increasingly difficult for infrastructure providers to plan for and provision resources to meet this demand. While simulation frameworks are used extensively in the modelling of cloud computing environments in order to test and validate technical solutions, they are at a nascent stage of development and adoption for fog and edge computing. This paper provides an overview of challenges posed by fog and edge computing in relation to simulation
- …