9,112 research outputs found

    Scalable Impairment-Aware Anycast Routing in Multi-Domain Optical Grid Networks

    Get PDF
    ABSTRACT In optical Grid networks, the main challenge is to account for not only network parameters, but also for resource availability. Anycast routing has previously been proposed as an effective solution to provide job scheduling services in optical Grids, offering a generic interface to access Grid resources and services. The main weakness of this approach is its limited scalability, especially in a multi-domain scenario. This paper proposes a novel anycast proxy architecture, which extends the anycast principle to a multi-domain scenario. The main purpose of the architecture is to perform aggregation of resource and network states, and as such improve computational scalability and reduce control plane traffic. Furthermore, the architecture has the desirable properties of allowing Grid domains to maintain their autonomy and hide internal configuration details from other domains. Finally, we propose an impairment-aware anycast routing algorithm that incorporates the main physical layer characteristics of large-scale optical networks into its path computation process. By integrating the proposed routing scheme into the introduced architecture we demonstrate significant network performance improvements. Keywords: Grid computing, routing algorithms, optical networks, physical impairments, anycast routing. INTRODUCTION Today, the need for network systems to support storage and computing services for science and business, is often satisfied by relatively isolated computing infrastructure (clusters). Migration to truly distributed and integrated applications requires optimization and (re)design of the underlying network technology to create a Grid platform for the cost and resource efficient delivery of network services with substantial data transfer, processing power and/or data storage requirements. Optical networks offer an undeniable potential for the Grid, given their proven track-record in the context of high-speed, long-haul, networking. Not only eScience applications dealing with large experimental data sets (e.g. particle physics) but also business/consumer oriented applications can benefit from optical Grid infrastructure [1]: both the high data rates typical of eScience applications and the low latency requirements of consumer/business applications (cf. interactivity) can effectively be addressed. When using transparent WDM as such network technology, signals are transported end-to-end optically without being converted to the electrical domain in between. Connection provisioning of all-optical connections (lightpaths) between source and destination nodes is based on specific routing and wavelength assignment algorithms (RWA). Traditional RWA schemes only account for network conditions such as connectivity and available capacity, without considering physical layer details. However, in transparent optical networks covering large geographical areas, the optical signal experiences the accumulation of physical impairments through transmission and switching, possibly resulting in unacceptable signal quality Another emerging and challenging task in distributed and heterogeneous computing environments, is job scheduling: when and where to execute a given Grid job, based on the requirements of the job (for instance a deadline and minimal computational power) and the current state of the network and resources. Traditionally, a local scheduler optimizes utilization and performance of a single Grid site, while a meta-scheduler is distributes workload across different sites. Current implementations of these (meta-)schedulers only account for Grid resource availability In this paper we propose a novel architecture to support impairment-aware anycast routing for large-scale optical Grid networks. Section 2 discusses general approaches to support multi-domain networks. We then proceed to introduce a novel architecture, which can provide anycast Grid services in a multi-domain scenario (Section 3). Simulation analysis is used to demonstrate the improved scalability without incurring significant performance loss. Furthermore, Section 4 shows how to incorporate physical layer impairments, to further improve the performance of optical Grid networks. Conclusions are presented in Section 5

    Scalable dimensioning of resilient Lambda Grids

    Get PDF
    This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit

    Resilient network dimensioning for optical grid/clouds using relocation

    Get PDF
    In this paper we address the problem of dimensioning infrastructure, comprising both network and server resources, for large-scale decentralized distributed systems such as grids or clouds. We will provide an overview of our work in this area, and in particular focus on how to design the resulting grid/cloud to be resilient against network link and/or server site failures. To this end, we will exploit relocation: under failure conditions, a request may be sent to an alternate destination than the one under failure-free conditions. We will provide a comprehensive overview of related work in this area, and focus in some detail on our own most recent work. The latter comprises a case study where traffic has a known origin, but we assume a degree of freedom as to where its end up being processed, which is typically the case for e. g., grid applications of the bag-of-tasks (BoT) type or for providing cloud services. In particular, we will provide in this paper a new integer linear programming (ILP) formulation to solve the resilient grid/cloud dimensioning problem using failure-dependent backup routes. Our algorithm will simultaneously decide on server and network capacity. We find that in the anycast routing problem we address, the benefit of using failure-dependent (FD) rerouting is limited compared to failure-independent (FID) backup routing. We confirm our earlier findings in terms of network capacity savings achieved by relocation compared to not exploiting relocation (order of 6-10% in the current case studies)

    A Case for Cooperative and Incentive-Based Coupling of Distributed Clusters

    Full text link
    Research interest in Grid computing has grown significantly over the past five years. Management of distributed resources is one of the key issues in Grid computing. Central to management of resources is the effectiveness of resource allocation as it determines the overall utility of the system. The current approaches to superscheduling in a grid environment are non-coordinated since application level schedulers or brokers make scheduling decisions independently of the others in the system. Clearly, this can exacerbate the load sharing and utilization problems of distributed resources due to suboptimal schedules that are likely to occur. To overcome these limitations, we propose a mechanism for coordinated sharing of distributed clusters based on computational economy. The resulting environment, called \emph{Grid-Federation}, allows the transparent use of resources from the federation when local resources are insufficient to meet its users' requirements. The use of computational economy methodology in coordinating resource allocation not only facilitates the QoS based scheduling, but also enhances utility delivered by resources.Comment: 22 pages, extended version of the conference paper published at IEEE Cluster'05, Boston, M

    Learning scalable and transferable multi-robot/machine sequential assignment planning via graph embedding

    Full text link
    Can the success of reinforcement learning methods for simple combinatorial optimization problems be extended to multi-robot sequential assignment planning? In addition to the challenge of achieving near-optimal performance in large problems, transferability to an unseen number of robots and tasks is another key challenge for real-world applications. In this paper, we suggest a method that achieves the first success in both challenges for robot/machine scheduling problems. Our method comprises of three components. First, we show a robot scheduling problem can be expressed as a random probabilistic graphical model (PGM). We develop a mean-field inference method for random PGM and use it for Q-function inference. Second, we show that transferability can be achieved by carefully designing two-step sequential encoding of problem state. Third, we resolve the computational scalability issue of fitted Q-iteration by suggesting a heuristic auction-based Q-iteration fitting method enabled by transferability we achieved. We apply our method to discrete-time, discrete space problems (Multi-Robot Reward Collection (MRRC)) and scalably achieve 97% optimality with transferability. This optimality is maintained under stochastic contexts. By extending our method to continuous time, continuous space formulation, we claim to be the first learning-based method with scalable performance among multi-machine scheduling problems; our method scalability achieves comparable performance to popular metaheuristics in Identical parallel machine scheduling (IPMS) problems

    Joint dimensioning of server and network infrastructure for resilient optical grids/clouds

    Get PDF
    We address the dimensioning of infrastructure, comprising both network and server resources, for large-scale decentralized distributed systems such as grids or clouds. We design the resulting grid/cloud to be resilient against network link or server failures. To this end, we exploit relocation: Under failure conditions, a grid job or cloud virtual machine may be served at an alternate destination (i.e., different from the one under failure-free conditions). We thus consider grid/cloud requests to have a known origin, but assume a degree of freedom as to where they end up being served, which is the case for grid applications of the bag-of-tasks (BoT) type or hosted virtual machines in the cloud case. We present a generic methodology based on integer linear programming (ILP) that: 1) chooses a given number of sites in a given network topology where to install server infrastructure; and 2) determines the amount of both network and server capacity to cater for both the failure-free scenario and failures of links or nodes. For the latter, we consider either failure-independent (FID) or failure-dependent (FD) recovery. Case studies on European-scale networks show that relocation allows considerable reduction of the total amount of network and server resources, especially in sparse topologies and for higher numbers of server sites. Adopting a failure-dependent backup routing strategy does lead to lower resource dimensions, but only when we adopt relocation (especially for a high number of server sites): Without exploiting relocation, potential savings of FD versus FID are not meaningful
    • …
    corecore