147,136 research outputs found

    Multi-elastic Datacenters: Auto-scaled Virtual Clusters on Energy-Aware Physical Infrastructures

    Full text link
    [EN] Computer clusters are widely used platforms to execute different computational workloads. Indeed, the advent of virtualization and Cloud computing has paved the way to deploy virtual elastic clusters on top of Cloud infrastructures, which are typically backed by physical computing clusters. In turn, the advances in Green computing have fostered the ability to dynamically power on the nodes of physical clusters as required. Therefore, this paper introduces an open-source framework to deploy elastic virtual clusters running on elastic physical clusters where the computing capabilities of the virtual clusters are dynamically changed to satisfy both the user application's computing requirements and to minimise the amount of energy consumed by the underlying physical cluster that supports an on-premises Cloud. For that, we integrate: i) an elasticity manager both at the infrastructure level (power management) and at the virtual infrastructure level (horizontal elasticity); ii) an automatic Virtual Machine (VM) consolidation agent that reduces the amount of powered on physical nodes using live migration and iii) a vertical elasticity manager to dynamically and transparently change the memory allocated to VMs, thus fostering enhanced consolidation. A case study based on real datasets executed on a production infrastructure is used to validate the proposed solution. The results show that a multi-elastic virtualized datacenter provides users with the ability to deploy customized scalable computing clusters while reducing its energy footprint.The results of this work have been partially supported by ATMOSPHERE (Adaptive, Trustworthy, Manageable, Orchestrated, Secure, Privacy-assuring Hybrid, Ecosystem for Resilient Cloud Computing), funded by the European Commission under the Cooperation Programme, Horizon 2020 grant agreement No 777154.Alfonso Laguna, CD.; Caballer Fernández, M.; Calatrava Arroyo, A.; Moltó, G.; Blanquer Espert, I. (2018). Multi-elastic Datacenters: Auto-scaled Virtual Clusters on Energy-Aware Physical Infrastructures. Journal of Grid Computing. 17(1):191-204. https://doi.org/10.1007/s10723-018-9449-zS191204171Buyya, R.: High Performance Cluster Computing: Architectures and Systems. Prentice Hall PTR, Upper Saddle River (1999)de Alfonso, C., Caballer, M., Alvarruiz, F., Moltó, G.: An economic and energy-aware analysis of the viability of outsourcing cluster computing to the cloud. Futur. Gener. Comput. Syst. (Int. J. Grid Comput eScience) 29, 704–712 (2013). https://doi.org/10.1016/j.future.2012.08.014Williams, D., Jamjoom, H., Liu, Y.H., Weatherspoon, H.: Overdriver: handling memory overload in an oversubscribed cloud. ACM SIGPLAN Not. 46(7), 205 (2011). https://doi.org/10.1145/2007477.1952709 . http://dl.acm.org/citation.cfm?id=2007477.1952709Valentini, G., Lassonde, W., Khan, S., Min-Allah, N., Madani, S., Li, J., Zhang, L., Wang, L., Ghani, N., Kolodziej, J., Li, H., Zomaya, A., Xu, C.Z., Balaji, P., Vishnu, A., Pinel, F., Pecero, J., Kliazovich, D., Bouvry, P.: An overview of energy efficiency techniques in cluster computing systems. Clust. Comput. 16(1), 3–15 (2013). https://doi.org/10.1007/s10586-011-0171-xDe Alfonso, C., Caballer, M., Hernández, V.: Efficient power management in high performance computer clusters. In: Proceedings of the 1st International Multi-conference on Innovative Developments in ICT, Proceedings of the International Conference on Green Computing 2010 (ICGreen 2010), 39–44 (2010)OpenNebula: OpenNebula Cloud Software https://opennebula.org/ . [Online; accessed 12-June-2017]OpenStack: OpenStack Cloud Software. http://openstack.org . [Online; accessed 12 June 2017]VMWare: VMWare vCenter Server. https://www.vmware.com/products/vcenter-server.html . [Online; accessed 12 June 2017]De Alfonso, C., Blanquer, I.: Automatic consolidation of virtual machines in on-premises cloud platforms. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp 1070–1079 (2017). https://doi.org/10.1109/CCGRID.2017.128Chase, J.S., Irwin, D.E., Grit, L.E., Moore, J.D., Sprenkle, S.E.: Dynamic virtual clusters in a grid site manager. In: Proceedings of the 12th IEEE International Symposium on High Performance Distributed Computing, HPDC ’03, p 90. IEEE Computer Society, Washington, DC (2003). http://dl.acm.org/citation.cfm?id=822087.823392Doelitzscher, F., Held, M., Reich, C., Sulistio, A.: Viteraas: Virtual cluster as a service. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), pp 652–657 (2011). https://doi.org/10.1109/CloudCom.2011.101Wei, X., Wang, H., Li, H., Zou, L.: Dynamic deployment and management of elastic virtual clusters. In: 2011 Sixth Annual Chinagrid Conference (ChinaGrid), pp 35–41 (2011). https://doi.org/10.1109/ChinaGrid.2011.31de Assuncao, M.D., di Costanzo, A., Buyya, R.: Evaluating the cost-benefit of using cloud computing to extend the capacity of clusters. In: Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, HPDC ’09, pp 141–150. ACM, New York (2009). https://doi.org/10.1145/1551609.1551635 . http://doi.acm.org/10.1145/1551609.1551635Marshall, P., Keahey, K., Freeman, T.: Elastic site: Using clouds to elastically extend site resources. In: 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), pp 43–52 (2010). https://doi.org/10.1109/CCGRID.2010.80Niu, S., Zhai, J., Ma, X., Tang, X., Chen, W.: Cost-effective cloud hpc resource provisioning by building semi-elastic virtual clusters. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’13, pp 56:1–56:12. ACM, New York (2013). https://doi.org/10.1145/2503210.2503236 . http://doi.acm.org/10.1145/2503210.2503236Bialecki, A., Cafarella, M., Cutting, D., Omalley, O.: Hadoop: a framework for running applications on large clusters built of commodity hardware. Tech. rep. Apache Hadoop. http://hadoop.apache.org (2005)MIT: StarCluster Elastic Load Balancer. http://web.mit.edu/stardev/cluster/docs/0.92rc2/manual/load_balancer.htmlAppliance, C.C.S.: Creating elastic virtual clusters. http://cernvm.cern.ch/portal/elasticclusters (2015)Research project, T.G.: The games research project. http://www.green-datacenters.eu (2013)Cioara, T., Anghel, I., Salomie, I., Copil, G., Moldovan, D., Kipp, A.: Energy aware dynamic resource consolidation algorithm for virtualized service centers based on reinforcement learning. In: 2011 10th International Symposium on Parallel and Distributed Computing (ISPDC), pp 163–169 (2011). https://doi.org/10.1109/ISPDC.2011.32Farahnakian, F., Liljeberg, P., Plosila, J.: Energy-efficient virtual machines consolidation in cloud data centers using reinforcement learning. In: 2014 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 500–507 (2014). https://doi.org/10.1109/PDP.2014.109Masoumzadeh, S., Hlavacs, H.: Integrating vm selection criteria in distributed dynamic vm consolidation using fuzzy q-learning. In: 2013 9th International Conference on Network and Service Management (CNSM), pp 332–338 (2013). https://doi.org/10.1109/CNSM.2013.6727854Feller, E., Rilling, L., Morin, C.: Energy-aware ant colony based workload placement in clouds. In: 2011 12th IEEE/ACM International Conference on Grid Computing (GRID), pp 26–33 (2011). https://doi.org/10.1109/Grid.2011.13Pop, C.B., Anghel, I., Cioara, T., Salomie, I., Vartic, I.: A swarm-inspired data center consolidation methodology. In: Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics, WIMS ’12, pp 41:1–41:7. ACM, New York (2012). https://doi.org/10.1145/2254129.2254180Marzolla, M., Babaoglu, O., Panzieri, F.: Server consolidation in clouds through gossiping. In: Proceedings of the 2011 IEEE International Symposium on a World of Wireless, Mobile and Multimedia Networks, WOWMOM ’11, pp 1–6. IEEE Computer Society, Washington, DC (2011). https://doi.org/10.1109/WoWMoM.2011.5986483Ghafari, S., Fazeli, M., Patooghy, A., Rikhtechi, L.: Bee-mmt: A load balancing method for power consumption management in cloud computing. In: 2013 Sixth International Conference on Contemporary Computing (IC3), pp 76–80 (2013). https://doi.org/10.1109/IC3.2013.6612165Ajiro, Y., Tanaka, A.: Improving packing algorithms for server consolidation. In: International CMG Conference, pp. 399–406. Computer Measurement Group (2007)Verma, A., Ahuja, P., Neogi, A.: pmapper: power and migration cost aware application placement in virtualized systems. In: Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware, Middleware ’08, pp 243–264. Springer, New York (2008)Beloglazov, A., Abawajy, J., Buyya, R.: Energy-aware resource allocation heuristics for efficient management of data centers for cloud computing. Future Gener. Comput. Syst. 28 (5), 755–768 (2012). https://doi.org/10.1016/j.future.2011.04.017Guazzone, M., Anglano, C., Canonico, M.: Exploiting vm migration for the automated power and performance management of green cloud computing systems. In: Proceedings of the First International Conference on Energy Efficient Data Centers, E2DC’12, pp 81–92. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-33645-4_8Shi, L., Furlong, J., Wang, R.: Empirical evaluation of vector bin packing algorithms for energy efficient data centers. In: 2013 IEEE Symposium on Computers and Communications (ISCC), pp 000,009–000,015 (2013). https://doi.org/10.1109/ISCC.2013.6754915Tomás, L., Tordsson, J.: Improving cloud infrastructure utilization through overbooking. In: Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference on - CAC ’13, p 1. ACM Press, New York (2013). https://doi.org/10.1145/2494621.2494627Dawoud, W., Takouna, I., Meinel, C.: Elastic vm for cloud resources provisioning optimization. In: Abraham, A., Lloret Mauri, J., Buford, J., Suzuki, J., Thampi, S. (eds.) Advances in Computing and Communications, Communications in Computer and Information Science, vol. 190, pp 431–445. Springer, Berlin (2011). https://doi.org/10.1007/978-3-642-22709-7_43Tasoulas, E., Haugerund, H.R., Begnum, K.: Bayllocator: a proactive system to predict server utilization and dynamically allocate memory resources using Bayesian networks and ballooning. In: Proceedings of the 26th International Conference on Large Installation System Administration: Strategies, Tools, and Techniques, pp. 111–122. USENIX Association (2012)Hines, M.R., Gordon, A., Silva, M., Da Silva, D., Ryu, K., Ben-Yehuda, M.: Applications know best: performance-driven memory overcommit with Ginkgo. In: 2011 IEEE Third International Conference on Cloud Computing Technology and Science, pp. 130–137. IEEE. https://doi.org/10.1109/CloudCom.2011.27 (2011)Litke, A.: Manage resources on overcommitted KVM hosts. Tech. rep. IBM. http://www.ibm.com/developerworks/library/l-overcommit-kvm-resources/ (2011)De Alfonso, C., Caballer, M., Alvarruiz, F., Hernández, V.: An energy management system for cluster infrastructures. Comput. Electr. Eng. 39(8), 2579–2590 (2013). https://doi.org/10.1016/j.compeleceng.2013.05.004Moltó, G., Caballer, M, de Alfonso, C.: Automatic memory-based vertical elasticity and oversubscription on cloud platforms. Futur. Gener. Comput. Syst. 56, 1–10 (2016). https://doi.org/10.1016/j.future.2015.10.002Calatrava, A., Romero, E., Moltó, G., Caballer, M., Alonso, J.M.: Self-managed cost-efficient virtual elastic clusters on hybrid Cloud infrastructures. Futur. Gener. Comput. Syst. 61, 13–25 (2016). https://doi.org/10.1016/j.future.2016.01.018 . http://authors.elsevier.com/sd/article/S0167739X16300024 , http://linkinghub.elsevier.com/retrieve/pii/S0167739X16300024Caballer, M., Chatziangelou, M., Calatrava, A., Moltó, G., Pérez, A.: IM integration in the EGI VMOps Dashboard. In: EGI Conference 2017 and INDIGO Summit 2017 (2017)Calatrava, A., Caballer, M., Moltó, G., Pérez, A.: Virtual Elastic Clusters in the EGI LToS with EC3. In: EGI Conference 2017 and INDIGO Summit 2017 (2017)Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., Epema, D.H.: The grid workloads archive. Futur. Gener. Comput. Syst. 24(7), 672–686 (2008). https://doi.org/10.1016/j.future.2008.02.003 . http://www.sciencedirect.com/science/article/pii/S0167739X08000125Nordugrid dataset, the grid workloads archive (Online; accessed 27-March-2017). http://gwa.ewi.tudelft.nl/datasets/gwa-t-3-nordugrid/report/Caballer, M., Blanquer, I., Moltó, G., de Alfonso, C: Dynamic Management of Virtual Infrastructures. J. Grid Comput. 13, 53–70 (2015). https://doi.org/10.1007/s10723-014-9296-5 . http://link.springer.com/article/10.1007/s10723-014-9296-

    An Overview of Search Strategies in Distributed Environments

    Full text link
    [EN] Distributed systems are populated by a large number of heterogeneous entities that join and leave the systems dynamically. These entities act as clients and providers and interact with each other in order to get a resource or to achieve a goal. To facilitate the collaboration between entities the system should provide mechanisms to manage the information about which entities or resources are available in the system at a certain moment, as well as how to locate them in an e cient way. However, this is not an easy task in open and dynamic environments where there are changes in the available resources and global information is not always available. In this paper, we present a comprehensive vision of search in distributed environments. This review does not only considers the approaches of the Peer-to-Peer area, but also the approaches from three more areas: Service-Oriented Environments, Multi-Agent Systems, and Complex Networks. In these areas, the search for resources, services, or entities plays a key role for the proper performance of the systems built on them. The aim of this analysis is to compare approaches from these areas taking into account the underlying system structure and the algorithms or strategies that participate in the search process.Work partially supported by the Spanish Ministry of Science and Innovation through grants TIN2009-13839-C03-01, CSD2007-0022 (CONSOLIDER-INGENIO 2010), PROMETEO 2008/051, PAID-06-11-2048, and FPU grant AP-2008-00601 awarded to E. del Val.Del Val Noguera, E.; Rebollo Pedruelo, M.; Botti, V. (2013). An Overview of Search Strategies in Distributed Environments. Knowledge Engineering Review. 1-33. https://doi.org/10.1017/S0269888913000143S133Sigdel K. , Bertels K. , Pourebrahimi B. , Vassiliadis S. , Shuai L. 2005. A framework for adaptive matchmaking in distributed computing. In Proceedings of GRID Workshop.Prabhu S. 2007. Towards distributed dynamic web service composition. In ISADS '07: Proceedings of the 8th International Symposium on Autonomous Decentralized Systems. IEEE Computer Society, 25–32.Meshkova, E., Riihijärvi, J., Petrova, M., & Mähönen, P. (2008). A survey on resource discovery mechanisms, peer-to-peer and service discovery frameworks. Computer Networks, 52(11), 2097-2128. doi:10.1016/j.comnet.2008.03.006Martin D. , Paolucci M. , Wagner M. 2007. Towards semantic annotations of web services: Owl-s from the sawsdl perspective. In Proceedings of Workshop OWL-S: Experiences and Directions at 4th European Semantic Web Conference, Innsbruck, Austria.Ogston E. , Vassiliadis S. 2001b. Matchmaking among minimal agents without a facilitator. In Proceedings of the 5th International Conference on Autonomous Agents, Bologna, Italy, 608–615.Martin D. , Burstein M. , Hobbs J. , Lassila O. , McDermott D. , McIlraith S. , Narayanan S. , Paolucci M. , Parsia B. , Payne T. , Sirin E. , Srinivasan N. , Sycara K. 2004. Owl-s: Semantic Markup for Web Services. http://www.w3.org/Submission/OWL-S/Eng Keong Lua, Crowcroft, J., Pias, M., Sharma, R., & Lim, S. (2005). A survey and comparison of peer-to-peer overlay network schemes. IEEE Communications Surveys & Tutorials, 7(2), 72-93. doi:10.1109/comst.2005.1610546Liang J. , Kumar R. , Ross K. 2005. Understanding kazaa. In Proceedings of the 5th New York Metro Area Networking Workshop (NYMAN), New York, USA.Ko, S. Y., Gupta, I., & Jo, Y. (2008). A new class of nature-inspired algorithms for self-adaptive peer-to-peer computing. ACM Transactions on Autonomous and Adaptive Systems, 3(3), 1-34. doi:10.1145/1380422.1380426Kleinberg J. 2001. Small-world phenomena and the dynamics of information. In Advances in Neural Information Processing Systems (NIPS), Dietterich, T. G., Becker, S. & Ghahramani, Z. (eds). MIT Press, 431–438.Jha S. , Chalasani P. , Shehory O. , Sycara K. 1998. A formal treatment of distributed matchmaking. In Proceedings of the 2nd International Conference on Autonomous Agents, Sycara, K. P. & Wooldridge, M. (eds). ACM, 457–458.Huhns, M. N. (2002). Agents as Web services. IEEE Internet Computing, 6(4), 93-95. doi:10.1109/mic.2002.1020332He Q. , Yan J. , Yang Y. , Kowalczyk R. , Jin H. 2008. Chord4s: A p2p-based decentralised service discovery approach. In IEEE International Conference on Services Computing, Honolulu, Hawaii, USA, 1, 221–228.Lv Q. , Cao P. , Cohen E. , Li K. , Shenker S. 2002. Search and replication in unstructured peer-to-peer networks. In Proceedings of the 16th International Conference on Supercomputing, ICS '02. ACM, 84–95.Maymounkov P. , Mazieres D. 2002. Kademlia: a peer-to-peer information system based on the xor metric. Proceedings of the 1st International Workshop on Peer-to Peer Systems (IPTPS02), Cambridge, MA, USA.Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., & Balakrishnan, H. (2001). Chord. ACM SIGCOMM Computer Communication Review, 31(4), 149-160. doi:10.1145/964723.383071Fernández A. , Ossowski S. , Vasirani M. 2008. General Architecture. CASCOM: Intelligent Service Coordination in the Semantic Web. Whitestein Series in Software Agent Technologies and Autonomic Computing, 143–160.Ding D. , Liu L. , Schmeck H. 2010. Service discovery in self-organizing service-oriented environments. In Proceedings of the 2010 IEEE Asia-Pacific Services Computing Conference. IEEE Computer Society, 717–724.Crespo A. , Garcia-Molina H. 2004. Semantic overlay networks for p2p systems. In Proceedings of the 3rd International Workshop on Agents and Peer-to-Peer Computing, Lecture Notes in Computer Science, 3601, 1–13. Springer.Rao J. , Su X. 2004. A survey of automated web service composition methods. In Proceedings of the 1st International Workshop on Semantic Web Services and Web Process Composition, SWSWPC 2004, San Diego, CA, USA, 43–54.Constantinescu I. , Faltings B. 2003. Efficient matchmaking and directory services. In Web Intelligence. IEEE Computer Society, 75–81.Cong Z. , Fernández A. 2010. Behavioral matchmaking of semantic web services. In Proceedings of the 4th International Joint Workshop on Service Matchmaking and Resource Retrieval in the Semantic Web (SMR2), Karlsruhe, Germany, 667, 131–140.Cholvi V. , Rodero-Merino L. 2007. Using random walks to find resources in unstructured self-organized p2p networks. In Proceedings of the IEEE Workshop on Dependable Application Support in Self-Organizing Networks, Edinburgh, UK, 51–56.Vázquez-Salceda J. , Vasconcelos W. W. , Padget J. , Dignum F. , Clarke S. , Roig M. P. 2010. Alive: an agent-based framework for dynamic and robust service-oriented applications. In Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1, AAMAS '10, International Foundation for Autonomous Agents and Multiagent Systems, 1637–1638.Liu L. , Schmeck H. 2010. Enabling self-organising service level management with automated negotiation. In Proceedings of the 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, WI-IAT '10, Huang, J. X., Ghorbani, A. A., Hacid, M.-S. & Yamaguchi, T. (eds). IEEE Computer Society, 42–45.Campo C. , Martin A. , Garcia C. , Breuer P. 2002. Service discovery in pervasive multi-agent systems. In AAMAS Workshop on Ubiquitous Agents on Embedded, Wearable, and Mobile Agents, Bologna, Italy.Brazier, F. M. T., Kephart, J. O., Parunak, H. V. D., & Huhns, M. N. (2009). Agents and Service-Oriented Computing for Autonomic Computing: A Research Agenda. IEEE Internet Computing, 13(3), 82-87. doi:10.1109/mic.2009.51Bisnik N. , Abouzeid A. 2005. Modeling and analysis of random walk search algorithms in p2p networks. In Proceedings of the 2nd International Workshop on Hot Topics in Peer-to-Peer Systems, Anglano, C. & Mancini, L. V. (eds). IEEE Computer Society, 95–103.Huhns, M. N., Singh, M. P., Burstein, M., Decker, K., Durfee, E., Finin, T., … Zavala, L. (2005). Research Directions for Service-Oriented Multiagent Systems. IEEE Internet Computing, 9(6), 65-70. doi:10.1109/mic.2005.132Ben-Ami D. , Shehory O. 2005. A comparative evaluation of agent location mechanisms in large scale mas. In Proceedings of the 4th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS '05, Pechoucek, M., Steiner, D. & Thompson, S. (eds). ACM, 339–346.Basters U. , Klusch M. 2006. Rs2d: Fast adaptive search for semantic web services in unstructured p2p networks. In International Semantic Web Conference, Lecture Notes in Computer Science 4273, 87–100. Springer.Barabási, A.-L., & Albert, R. (1999). Emergence of Scaling in Random Networks. Science, 286(5439), 509-512. doi:10.1126/science.286.5439.509Liu G. , Wang Y. , Orgun M. 2010. Optimal social trust path selection in complex social networks. In Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI). AAAI Press, 1391–1398.Adamic, L., & Adar, E. (2005). How to search a social network. Social Networks, 27(3), 187-203. doi:10.1016/j.socnet.2005.01.007Kalogeraki V. , Gunopulos D. , Zeinalipour-Yazti D. 2002. A local search mechanism for peer-to-peer networks. In Proceedings of the Eleventh International Conference on Information and Knowledge Management (CIKM '02). ACM, 300–307.Babaoglu O. , Meling H. , Montresor A. 2002. Anthill: a framework for the development of agent-based peer-to-peer systems. In Proceedings of the 22nd International Conference on Distributed Computing Systems, Vienna, Austria, 15–22.Yang B. , Garcia-Molina H. 2002. Efficient search in peer-to-peer networks. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS).Mokhtar S. , Kaul A. , Georgantas N. , Issarny V. 2006. Towards efficient matching of semantic web service capabilities. In Proceedings of International Workshop on Web Services – Modeling and Testing, Palermo, Italy.Fernández A. , Vasirani M. , Cáceres C. , Ossowski S. 2006. Role-based service description and discovery. In AAMAS-06 Workshop on Service-Oriented Computing and Agent-Based Engineering, 1–14.Bailey J. 2006. Fast discovery of interesting collections of web services. In WI '06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, 152–160.Rowstron A. I. T. , Druschel P. 2001. Pastry: scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms Heidelberg, Middleware '01, Sventek, J. & Coulson, G. (eds). Springer-Verlag, 329–350.Kleinberg J. 2006. Complex networks and decentralized search algorithms. In Proceedings of the International Congress of Mathematicians (ICM), Madrid, Spain.Bachlechner D. , Siorpaes K. , Fensel D. , Toma I. 2006. Web service discovery – a reality check. In Proceedings of the 3rd European Semantic Web Conference, Seoul, South Korea.Lopes, A. L., & Botelho, L. M. (2008). Improving Multi-Agent Based Resource Coordination in Peer-to-Peer Networks. Journal of Networks, 3(2). doi:10.4304/jnw.3.2.38-47Klusch M. , Fries B. , Sycara K. 2006. Automated semantic web service discovery with owls-mx. In Proceedings of the 5th International Joint Conference on Autonomous Agents and Multiagent Systems, AAMAS '06, Nakashima, H., Wellman, M. P., Weiss, G. & Stone, P. (eds). ACM, 915–922.Ogston E. , Vassiliadis S. 2001a. Local distributed agent matchmaking. In Proceedings of the 9th International Conference on Cooperative Information Systems, Trento, Italy.Nguyen V. , Martel C. 2005. Analyzing and characterizing small-world graphs. In SODA '05: Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics.Amaral, L. A. N., & Ottino, J. M. (2004). Complex networks. The European Physical Journal B - Condensed Matter, 38(2), 147-162. doi:10.1140/epjb/e2004-00110-5Crespo A. , Garcia-Molina H. 2002. Routing Indices For Peer-to-Peer Systems. In Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS'02). IEEE Computer Society, 23.Manku G. S. , Bawa M. , Raghavan P. , Inc V. 2003. Symphony: Distributed hashing in a small world. In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems, Seattle, USA, 127–140.Chawathe Y. , Ratnasamy S. , Breslau L. , Lanham N. , Shenker S. 2003. Making gnutella-like p2p systems scalable. In Proceedings of the 2003 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM '03, Feldmann, A., Zitterbart, M., Crowcroft, J. & Wetherall, D. (eds). ACM, 407–418.Yu S. , Liu J. , Le J. 2004. Decentralized web service organization combining semantic web and peer to peer computing. In ECOWS, Lecture Notes in Computer Science 3250, 116–127. Springer.Chaari S. , Badr Y. , Biennier F. 2008. Enhancing web service selection by qos-based ontology and ws-policy. In Proceedings of the 2008 ACM Symposium on Applied Computing, SAC '08, Wainwright, R. L. & Haddad, H. (eds). ACM, 2426–2431.Michlmayr E. 2006. Ant algorithms for search in unstructured peer-to-peer networks. In Proceedings of the 22nd International Conference on Data Engineering (ICDE), Atlanta, GA, USA.Perryea C. , Chung S. 2006. Community-based service discovery. In Proceedings of the International Conference on Web Services, Chicago, IL, USA, 903–906.Upadrashta Y. , Vassileva J. , Grassmann W. 2005. Social networks in peer-to-peer systems. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences, Big Island, Hawaii, USA.Satyanarayanan, M. (2001). Pervasive computing: vision and challenges. IEEE Personal Communications, 8(4), 10-17. doi:10.1109/98.943998Kota R. , Gibbins N. , Jennings N. R 2009. Self-organising agent organisations. In Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems – Volume 2, AAMAS '09. International Foundation for Autonomous Agents and Multiagent Systems, 797–804.Kleinberg, J. M. (2000). Navigation in a small world. Nature, 406(6798), 845-845. doi:10.1038/35022643Watts, D. J. (2004). The «New» Science of Networks. Annual Review of Sociology, 30(1), 243-270. doi:10.1146/annurev.soc.30.020404.104342Risson, J., & Moors, T. (2006). Survey of research towards robust peer-to-peer networks: Search methods. Computer Networks, 50(17), 3485-3521. doi:10.1016/j.comnet.2006.02.001PAPAZOGLOU, M. P., TRAVERSO, P., DUSTDAR, S., & LEYMANN, F. (2008). SERVICE-ORIENTED COMPUTING: A RESEARCH ROADMAP. International Journal of Cooperative Information Systems, 17(02), 223-255. doi:10.1142/s0218843008001816Shvaiko P. , Euzenat J. 2008. Ten challenges for ontology matching. In On the Move to Meaningful Internet Systems: OTM 2008, Meersman, R. & Tari, Z. (eds), Lecture Notes in Computer Science 5332, 1164–1182. Springer.BOCCALETTI, S., LATORA, V., MORENO, Y., CHAVEZ, M., & HWANG, D. (2006). Complex networks: Structure and dynamics. Physics Reports, 424(4-5), 175-308. doi:10.1016/j.physrep.2005.10.009Bianchini D. , Antonellis V. D. , Melchiori M. 2009. Service-based semantic search in p2p systems. In Proceedings of the 2009 Seventh IEEE European Conference on Web Services, ECOWS '09, Eshuis, R., Grefen, P. & Papadopoulos, G. A. (eds). IEEE Computer Society, 7–16.Bromuri S. , Urovi V. , Morge M. , Stathis K. , Toni F. 2009. A multi-agent system for service discovery, selection and negotiation. In Proceedings of the 8th International Joint Conference on Autonomous Agents and Multiagent Systems, Sierra, C. & Castelfranchi, C. (eds). International Foundation for Autonomous Agents and Multiagent Systems, 1395–1396.Gummadi, P. K., Saroiu, S., & Gribble, S. D. (2002). A measurement study of Napster and Gnutella as examples of peer-to-peer file sharing systems. ACM SIGCOMM Computer Communication Review, 32(1), 82. doi:10.1145/510726.510756Tsoumakos D. , Roussopoulos N. 2003. Adaptive probabilistic search for peer-to-peer networks. In Peer-to-Peer Computing, Linköping, Sweeden, 102–109.Schmidt, C., & Parashar, M. (2004). A Peer-to-Peer Approach to Web Service Discovery. World Wide Web, 7(2), 211-229. doi:10.1023/b:wwwj.0000017210.55153.3dDimakopoulos V. V. , Pitoura E. 2003. A peer-to-peer approach to resource discovery in multi-agent systems. In Proceedings of Cooperative Information Agents, Lecture Notes in Computer Science 2782, 62–77. Springer.Skoutas D. , Sacharidis D. , Kantere V. , Sellis T. 2008. Efficient semantic web service discovery in centralized and p2p environments. In The Semantic Web – ISWC 2008, Sheth, A., Staab, S., Dean, M., Paolucci, M., Maynard, D., Finin, T. & Thirunarayan, K. (eds), Lecture Notes in Computer Science 5318, 583–598. Springer-Verlag.Val E. D. , Rebollo M. 2007. Service Discovery and Composition in Multiagent Systems. In Proceedings of 5th European Workshop On Multi-Agent Systems (EUMAS 2007). Association Tunisienne D'Intelligence Artificielle, 197–212.Srinivasan N. , Paolucci M. , Sycara K. 2004. Adding owl-s to uddi, implementation and throughput. In First International Workshop on Semantic Web Services and Web Process Composition (SWSWPC 2004), San Diego, CA, USA.Thadakamalla, H. P., Albert, R., & Kumara, S. R. T. (2007). Search in spatial scale-free networks. New Journal of Physics, 9(6), 190-190. doi:10.1088/1367-2630/9/6/190Papazoglou, M. P., Traverso, P., Dustdar, S., & Leymann, F. (2007). Service-Oriented Computing: State of the Art and Research Challenges. Computer, 40(11), 38-45. doi:10.1109/mc.2007.400Travers, J., & Milgram, S. (1969). An Experimental Study of the Small World Problem. Sociometry, 32(4), 425. doi:10.2307/2786545Val E. D. , Rebollo M. , Botti V. 2011. Introducing homophily to improve semantic service search in a self-adaptive system. In Proceedings of the 10th International Conference on Autonomous Agents and Multiagent Systems, Taipei, Taiwan.Xiao Fan Wang, & Guanrong Chen. (2003). Complex networks: Small-world, scale-free and beyond. IEEE Circuits and Systems Magazine, 3(1), 6-20. doi:10.1109/mcas.2003.1228503Argente, E., Botti, V., Carrascosa, C., Giret, A., Julian, V., & Rebollo, M. (2010). An abstract architecture for virtual organizations: The THOMAS approach. Knowledge and Information Systems, 29(2), 379-403. doi:10.1007/s10115-010-0349-1Watts, D. J. (2002). Identity and Search in Social Networks. Science, 296(5571), 1302-1305. doi:10.1126/science.1070120Simsek Ö. , Jensen D. 2005. Decentralized search in networks using homophily and degree disparity. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, UK, 304–310.Vanthournout, K., Deconinck, G., & Belmans, R. (2005). A taxonomy for resource discovery. Personal and Ubiquitous Computing, 9(2), 81-89. doi:10.1007/s00779-004-0312-9Watts, D. J., & Strogatz, S. H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440-442. doi:10.1038/30918Wei, Y., & Blake, M. B. (2010). Service-Oriented Computing and Cloud Computing: Challenges and Opportunities. IEEE Internet Computing, 14(6), 72-75. doi:10.1109/mic.2010.147Weyns, D., & Georgeff, M. (2010). Self-Adaptation Using Multiagent Systems. IEEE Software, 27(1), 86-91. doi:10.1109/ms.2010.18Pirró G. , Trunfio P. , Talia D. , Missier P. , Goble C. 2010. Ergot: a semantic-based system for service discovery in distributed infrastructures. In Proceedings of the 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid), Melbourne, Australia, 263–272.Yang B. , Garcia-Molina H. 2003. Designing a super-peer network. International Conference on Data Engineering, Bangalore, India, 49.Zhang H. , Croft W. B. , Levine B. , Lesser V. 2004a. A multi-agent approach for peer-to-peer based information retrieval system. In Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems – Volume 1, AAMAS '04. IEEE Computer Society, 456–463.Zhang, H., Goel, A., & Govindan, R. (2004). Using the small-world model to improve Freenet performance. Computer Networks, 46(4), 555-574. doi:10.1016/j.comnet.2004.05.004Sycara, K., Paolucci, M., Soudry, J., & Srinivasan, N. (2004). Dynamic discovery and coordination of agent-based semantic web services. IEEE Internet Computing, 8(3), 66-73. doi:10.1109/mic.2004.1297276Dell'Amico M. 2006. Highly clustered networks with preferential attachment to close nodes. In Proceedings of the European Conference on Complex Systems 2006, Oxford, UK.Mullender, S. J., & Vitányi, P. M. B. (1988). Distributed match-making. Algorithmica, 3(1-4), 367-391. doi:10.1007/bf01762123McIlraith, S. A., Son, T. C., & Honglei Zeng. (2001). Semantic Web services. IEEE Intelligent Systems, 16(2), 46-53. doi:10.1109/5254.920599Gkantsidis, C., Mihail, M., & Saberi, A. (2006). Random walks in peer-to-peer networks: Algorithms and evaluation. Performance Evaluation, 63(3), 241-263. doi:10.1016/j.peva.2005.01.002Zhong M. 2006. Popularity-biased random walks for peer-to-peer search under the square-root principle. In Proceedings of the 5th International Workshop on Peer-to-Peer Systems (IPTPS), Santa Barbara, CA, USA.Cao J. , Yao Y. , Zheng X. , Liu B. 2010. Semantic-based self-organizing mechanism for service registry and discovery. In Proceedings of the 14th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Shanghai, China, 345–350.Ratnasamy S. , Francis P. , Handley M. , Karp R. , Shenker S. 2001. A scalable content-addressable network. In Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM '01), Cruz, R. & Varghese, G. (eds). ACM.Ouksel A. , Babad Y. , Tesch T. 2004. Matchmaking software agents in b2b markets. In Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS'04), Big Island, Hawaii, USA.Slivkins A. 2005. Distance estimation and object

    Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives

    Get PDF
    © ACM, 2020. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, Vol. 53, No. 5, Article 95. Publication date: September 2020. https://doi.org/10.1145/3403956[EN] Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.This work has received funding from the European Union's Horizon 2020 (H2020) research and innovation program under the FET-HPC Grant Agreement No. 801137 (RECIPE). Jaume Abella was also partially supported by the Ministry of Economy and Competitiveness of Spain under Contract No. TIN2015-65316-P and under Ramon y Cajal Postdoctoral Fellowship No. RYC-2013-14717, as well as by the HiPEAC Network of Excellence. Ramon Canal is partially supported by the Generalitat de Catalunya under Contract No. 2017SGR0962.Canal, R.; Hernández Luz, C.; Tornero-Gavilá, R.; Cilardo, A.; Massari, G.; Reghenzani, F.; Fornaciari, W.... (2020). Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives. ACM Computing Surveys. 53(5):1-32. https://doi.org/10.1145/3403956S132535Abella, J., Hernandez, C., Quinones, E., Cazorla, F. J., Conmy, P. R., Azkarate-askasua, M., … Vardanega, T. (2015). WCET analysis methods: Pitfalls and challenges on their trustworthiness. 10th IEEE International Symposium on Industrial Embedded Systems (SIES). doi:10.1109/sies.2015.7185039E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324. E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324.Agullo, E., Giraud, L., Salas, P., & Zounon, M. (2016). Interpolation-Restart Strategies for Resilient Eigensolvers. SIAM Journal on Scientific Computing, 38(5), C560-C583. doi:10.1137/15m1042115Al-Qawasmeh, A. M., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2015). Power and Thermal-Aware Workload Allocation in Heterogeneous Data Centers. IEEE Transactions on Computers, 64(2), 477-491. doi:10.1109/tc.2013.116ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest. ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest.Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33. doi:10.1109/tdsc.2004.2Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., & McIntosh-Smith, S. (2016). Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer. SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. doi:10.1109/sc.2016.54Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., & Cappello, F. (2017). Toward General Software Level Silent Data Corruption Detection for Parallel Applications. IEEE Transactions on Parallel and Distributed Systems, 28(12), 3642-3655. doi:10.1109/tpds.2017.2735971M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer. M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer.P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA]. P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA].F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14. F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14.F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283 F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283Chan, C. S., Pan, B., Gross, K., Vaidyanathan, K., & Rosing, T. Š. (2014). Correcting vibration-induced performance degradation in enterprise servers. ACM SIGMETRICS Performance Evaluation Review, 41(3), 83-88. doi:10.1145/2567529.2567555Chantem, T., Hu, X. S., & Dick, R. P. (2011). Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(10), 1884-1897. doi:10.1109/tvlsi.2010.2058873Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. (s. f.). Pinpoint: problem determination in large, dynamic Internet services. Proceedings International Conference on Dependable Systems and Networks. doi:10.1109/dsn.2002.1029005Chen, Z. (2011). Algorithm-based recovery for iterative methods without checkpointing. Proceedings of the 20th international symposium on High performance distributed computing - HPDC ’11. doi:10.1145/1996130.1996142Chen, Z. (2013). Online-ABFT. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’13. doi:10.1145/2442516.2442533Coskun, A. K., Rosing, T. S., Mihic, K., De Micheli, G., & Leblebici, Y. (2006). Analysis and Optimization of MPSoC Reliability. Journal of Low Power Electronics, 2(1), 56-69. doi:10.1166/jolpe.2006.007G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119. G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119.Cupertino, L., Da Costa, G., Oleksiak, A., Pia¸tek, W., Pierson, J.-M., Salom, J., … Zilio, T. (2015). Energy-efficient, thermal-aware modeling and simulation of data centers: The CoolEmAll approach and evaluation results. Ad Hoc Networks, 25, 535-553. doi:10.1016/j.adhoc.2014.11.002Dally, W. J. (1991). Express cubes: improving the performance of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 40(9), 1016-1023. doi:10.1109/12.83652Dauwe, D., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2018). Resilience-Aware Resource Management for Exascale Computing Systems. IEEE Transactions on Sustainable Computing, 3(4), 332-345. doi:10.1109/tsusc.2018.2797890R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814 R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814Di, S., & Cappello, F. (2016). Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. IEEE Transactions on Parallel and Distributed Systems, 27(10), 2809-2823. doi:10.1109/tpds.2016.2517639Di, S., Guo, H., Gupta, R., Pershey, E. R., Snir, M., & Cappello, F. (2019). Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. IEEE Transactions on Parallel and Distributed Systems, 30(2), 361-374. doi:10.1109/tpds.2018.2864184Di, S., Robert, Y., Vivien, F., & Cappello, F. (2017). Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model. IEEE Transactions on Parallel and Distributed Systems, 28(1), 244-259. doi:10.1109/tpds.2016.2546248J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer. J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer.DOWNING, S., & SOCIE, D. (1982). Simple rainflow counting algorithms. International Journal of Fatigue, 4(1), 31-40. doi:10.1016/0142-1123(82)90018-4Eghbalkhah, B., Kamal, M., Afzali-Kusha, H., Afzali-Kusha, A., Ghaznavi-Ghoushchi, M. B., & Pedram, M. (2015). Workload and temperature dependent evaluation of BTI-induced lifetime degradation in digital circuits. Microelectronics Reliability, 55(8), 1152-1162. doi:10.1016/j.microrel.2015.06.004Gottscho, M., Shoaib, M., Govindan, S., Sharma, B., Wang, D., & Gupta, P. (2017). Measuring the Impact of Memory Errors on Application  Performance. IEEE Computer Architecture Letters, 16(1), 51-55. doi:10.1109/lca.2016.2599513Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., … Sengupta, S. (2011). VL2. Communications of the ACM, 54(3), 95-104. doi:10.1145/1897852.1897877Heroux, M. A., Bartlett, R. A., Howle, V. E., Hoekstra, R. J., Hu, J. J., Kolda, T. G., … Stanley, K. S. (2005). An overview of the Trilinos project. ACM Transactions on Mathematical Software, 31(3), 397-423. doi:10.1145/1089014.1089021Hoffmann, G. A., Trivedi, K. S., & Malek, M. (2007). A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability, 56(4), 615-628. doi:10.1109/tr.2007.909764Hsiao, M. Y., Carter, W. C., Thomas, J. W., & Stringfellow, W. R. (1981). Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress. IBM Journal of Research and Development, 25(5), 453-468. doi:10.1147/rd.255.0453Hughes, G. F., Murray, J. F., Kreutz-Delgado, K., & Elkan, C. (2002). Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51(3), 350-357. doi:10.1109/tr.2002.802886S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301 S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301Hussain, H., Malik, S. U. R., Hameed, A., Khan, S. U., Bickler, G., Min-Allah, N., … Rayes, A. (2013). A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 39(11), 709-736. doi:10.1016/j.parco.2013.09.009Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html. Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html.Jha, S., Formicola, V., Martino, C. D., Dalton, M., Kramer, W. T., Kalbarczyk, Z., & Iyer, R. K. (2018). Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters. IEEE Transactions on Dependable and Secure Computing, 15(6), 915-930. doi:10.1109/tdsc.2017.2737537Kiciman, E., & Fox, A. (2005). Detecting Application-Level Failures in Component-Based Internet Services. IEEE Transactions on Neural Networks, 16(5), 1027-1041. doi:10.1109/tnn.2005.853411Kim, T., Sun, Z., Cook, C., Zhao, H., Li, R., Wong, D., & Tan, S. X.-D. (2016). Invited - Cross-layer modeling and optimization for electromigration induced reliability. Proceedings of the 53rd Annual Design Automation Conference. doi:10.1145/2897937.2905010Kurowski, K., Oleksiak, A., Piątek, W., Piontek, T., Przybyszewski, A., & Węglarz, J. (2013). DCworms – A tool for simulation of energy efficiency in distributed computing infrastructures. Simulation Modelling Practice and Theory, 39, 135-151. doi:10.1016/j.simpat.2013.08.007Langou, J., Chen, Z., Bosilca, G., & Dongarra, J. (2008). Recovery Patterns for Iterative Methods in a Parallel Unstable Environment. SIAM Journal on Scientific Computing, 30(1), 102-116. doi:10.1137/040620394J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin. J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin.Laprie, J.-C. (s. f.). DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY. Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ’ Highlights from Twenty-Five Years’. doi:10.1109/ftcsh.1995.532603Lasance, C. J. M. (2003). Thermally driven reliability issues in microelectronic systems: status-quo and challenges. Microelectronics Reliability, 43(12), 1969-1974. doi:10.1016/s0026-2714(03)00183-5Yinglung Liang, Yanyong Zhang, Sivasubramaniam, A., Jette, M., & Sahoo, R. (s. f.). BlueGene/L Failure Analysis and Prediction Models. International Conference on Dependable Systems and Networks (DSN’06). doi:10.1109/dsn.2006.18Lin, T.-T. Y., & Siewiorek, D. P. (1990). Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, 39(4), 419-432. doi:10.1109/24.58720Losada, N., González, P., Martín, M. J., Bosilca, G., Bouteiller, A., & Teranishi, K. (2020). Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems, 106, 467-481. doi:10.1016/j.future.2020.01.026Lyons, R. E., & Vanderkulk, W. (1962). The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development, 6(2), 200-209. doi:10.1147/rd.62.0200M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281. M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281.Moody, A., Bronevetsky, G., Mohror, K., & de Supinski, B. (2010). Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. doi:10.2172/984082Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf. Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf.Mulas, F., Atienza, D., Acquaviva, A., Carta, S., Benini, L., & De Micheli, G. (2009). Thermal Balancing Policy for Multiprocessor Stream Computing Platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(12), 1870-1882. doi:10.1109/tcad.2009.2032372Oleksiak, A., Kierzynka, M., Piatek, W., Agosta, G., Barenghi, A., Brandolese, C., … Janssen, U. (2017). M2DC – Modular Microserver DataCentre with heterogeneous hardware. Microprocessors and Microsystems, 52, 117-130. doi:10.1016/j.micpro.2017.05.019Oxley, M. A., Jonardi, E., Pasricha, S., Maciejewski, A. A., Siegel, H. J., Burns, P. J., & Koenig, G. A. (2018). Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers. Journal of Parallel and Distributed Computing, 112, 126-139. doi:10.1016/j.jpdc.2017.04.015K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811 K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811Park, S.-M., & Humphrey, M. (2011). Predictable High-Performance Computing Using Feedback Control and Admission Control. IEEE Transactions on Parallel and Distributed Systems, 22(3), 396-411. doi:10.1109/tpds.2010.100Pfefferman, J. D., & Cernuschi-Frias, B. (2002). A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability, 51(4), 434-442. doi:10.1109/tr.2002.804733Rangan, K. K., Wei, G.-Y., & Brooks, D. (2009). Thread motion. ACM SIGARCH Computer Architecture News, 37(3), 302-313. doi:10.1145/1555815.1555793Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf. Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf.F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714 F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680 F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680Salfner, F., Schieschke, M., & Malek, M. (2006). Predicting failures of computer systems: a case study for a telecommunication system. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. doi:10.1109/ipdps.2006.1639672Shi, L., Chen, H., Sun, J., & Li, K. (2012). vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, 61(6), 804-816. doi:10.1109/tc.2011.112D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd. D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd.Singh, S., & Chana, I. (2016). A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges. Journal of Grid Computing, 14(2), 217-264. doi:10.1007/s10723-015-9359-2Slegel, T. J., Averill, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., … Webb, C. F. (1999). IBM’s S/390 G5 microprocessor design. IEEE Micro, 19(2), 12-23. doi:10.1109/40.755464Sridhar, A., Sabry, M. M., & Atienza, D. (2014). A Semi-Analytical Thermal Modeling Framework for Liquid-Cooled ICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33(8), 1145-1158. doi:10.1109/tcad.2014.2323194Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., & Gurumurthi, S. (2015). Memory Errors in Modern Systems. ACM SIGARCH Computer Architecture News, 43(1), 297-310. doi:10.1145/2786763.2694348Stathis, J. H. (2018). The physics of NBTI: What do we really know? 2018 IEEE International Reliability Physics Symposium (IRPS). doi:10.1109/irps.2018.8353539Stellner, G. (s. f.). CoCheck: checkpointing and process migration for MPI. Proceedings of International Conference on Parallel Processing. doi:10.1109/ipps.1996.508106Stone, J. E., Gohara, D., & Shi, G. (2010). OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering, 12(3), 66-73. doi:10.1109/mcse.2010.69Subasi, O., Di, S., Bautista-Gomez, L., Balaprakash, P., Unsal, O., Labarta, J., … Cappello, F. (2018). Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustainable Computing: Informatics and Systems, 19, 277-290. doi:10.1016/j.suscom.2018.01.004Tang, D., & Iyer, R. K. (1993). Dependability measurement and modeling of a multicomputer system. IEEE Transactions on Computers, 42(1), 62-75. doi:10.1109/12.192214D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf. D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf.Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. (2002). Predictive algorithms in the management of computer systems. IBM Systems Journal, 41(3), 461-474. doi:10.1147/sj.413.0461Vinoski, S. (2007). Reliability with Erlang. IEEE Internet Com

    Analyzing Users' Activity in On-line Social Networks over Time through a Multi-Agent Framework

    Full text link
    [EN] The number of people and organizations using online social networks as a new way of communication is continually increasing. Messages that users write in networks and their interactions with other users leave a digital trace that is recorded. In order to understand what is going on in these virtual environments, it is necessary systems that collect, process, and analyze the information generated. The majority of existing tools analyze information related to an online event once it has finished or in a specific point of time (i.e., without considering an in-depth analysis of the evolution of users activity during the event). They focus on an analysis based on statistics about the quantity of information generated in an event. In this article, we present a multi-agent system that automates the process of gathering data from users activity in social networks and performs an in-depth analysis of the evolution of social behavior at different levels of granularity in online events based on network theory metrics. We evaluated its functionality analyzing users activity in events on Twitter.This work is partially supported by the PROME-TEOII/2013/019, TIN2014-55206-R, TIN2015-65515-C4-1-R, H2020-ICT-2015-688095.Del Val Noguera, E.; Martínez, C.; Botti, V. (2016). Analyzing Users' Activity in On-line Social Networks over Time through a Multi-Agent Framework. Soft Computing. 20(11):4331-4345. https://doi.org/10.1007/s00500-016-2301-0S433143452011Ahn Y-Y, Han S, Kwak H, Moon S, Jeong H (2007) Analysis of topological characteristics of huge online social networking services. In: Proceedings of the 16th WWW, pp 835–844Bastiaensens S, Vandebosch H, Poels K, Cleemput KV, DeSmet A, Bourdeaudhuij ID (2014) Cyberbullying on social network sites. an experimental study into behavioural intentions to help the victim or reinforce the bully. Comput Hum Behav 31:259–271Benevenuto F, Rodrigues T, Cha M, Almeida V (2009) Characterizing user behavior in online social networks. In: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference. ACM, pp 49–62Borge-Holthoefer J, Rivero A, García I, Cauhé E, Ferrer A, Ferrer D, Francos D, Iñiguez D, Pérez MP, Ruiz G et al (2011) Structural and dynamical patterns on online social networks: the Spanish may 15th movement as a case study. PLoS One 6(8):e23883Borondo J, Morales AJ, Losada JC, Benito RM (2013) Characterizing and modeling an electoral campaign in the context of Twitter: 2011 Spanish presidential election as a case studyCatanese SA, De Meo P, Ferrara E, Fiumara G, Provetti A (2011) Crawling facebook for social network analysis purposes. In: Proceedings of the international conference on web intelligence, mining and semantics. ACM, p 52Cha M, Mislove A, Gummadi KP (2009) A measurement-driven analysis of information propagation in the flickr social network. In: Proceedings of the 18th international conference on World Wide Web. ACM, pp 721–730del Val E, Martínez C, Botti V (2015a) A multi-agent framework for the analysis of users behavior over time in on-line social networks. In: 10th International conference on soft computing models in industrial and environmental applications. Springer, Berlin, pp 191–201del Val E, Rebollo M, Botti V (2015b) Does the type of event influence how user interactions evolve on twitter? PLOS One 10(5):e0124049Eurostat (2016a) Internet use statistics—individuals. http://ec.europa.eu/eurostat/statistics-explained/index.php/Internet_use_statistics_-_individuals . Accessed 29 April 2016Eurostat (2016b) Social media—statistics on the use by enterprises. http://ec.europa.eu/eurostat/statistics-explained/index.php/Social_media_-_statistics_on_the_use_by_enterprises#Further_Eurostat_information . Accessed 29 April 2016García Fornes AM, Rodrigo Solaz M, Terrasa Barrena AM, Inglada J, Javier V, Jorge Cano J, Mulet Mengual L, Palomares Chust A, Búrdalo Rapa LA, Giret Boggino AS et al (2015) Magentix 2 user’s manualGolbeck J, Robles C, Turner K (2011) Predicting personality with social media. In: CHI’11, pp 253–262Guimerà R, Llorente A, Moro E, Sales-Pardo M (2012) Predicting human preferences using the block structure of complex social networks. PloS One 7(9):e44620Huberman BA, Romero DM, Wu F (2008) Social networks that matter: Twitter under the microscope. arXiv preprint arXiv:0812.1045Jamali M, Abolhassani H (2006) Different aspects of social network analysis. In: 2006 IEEE/WIC/ACM international conference on web intelligence (WI 2006 main conference proceedings)(WI’06). IEEE, pp 66–72Jiang Y, Jiang J (2014) Understanding social networks from a multiagent perspective. Parallel Distrib Syst IEEE Trans 25(10):2743–2759Kossinets G, Watts D (2006) Empirical analysis of an evolving social network. Science 311(5757):88–90Kumar R, Novak J, Tomkins A (2010) Structure and evolution of online social networks. In: Yu PS, Han J, Faloutsos C (eds) Link mining: models, algorithms, and applications. Springer, New York, pp 337–357Lazer D (2009) Life in the network: the coming age of computational social science. Science 323(5915):721–723Leskovec J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Trans Web 1(1):5Licoppe C, Smoreda Z (2005) Are social networks technologically embedded? How networks are changing today with changes in communication technology. Soc Netw 27(4):317–335Lotan G, Graeff E, Ananny M, Gaffney D, Pearce I, Boyd D (2011) The revolutions were tweeted: information flows during the 2011 tunisian and egyptian revolutions. Int J Commun 5:1375–1405Peña-López I, Congosto M, Aragón P (2013) Spanish indignados and the evolution of 15M: towards networked para-institutions. Big data: challenges and opportunities, pp 25–26Perliger A, Pedahzur A (2011) Social network analysis in the study of terrorism and political violence. PS Polit Sci Polit 44:45–50Romero DM, Galuba W, Asur S, Huberman BA (2011a) Influence and passivity in social media. In: Proceedings of the 20th WWW, pp 113–114Romero DM, Meeder B, Kleinberg J (2011b) Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on Twitter. In: Proceedings of the 20th WWW, pp 695–704Stockman FN, Doreian P, (1997) Evolution of social networks: processes and principles. In: Doreian P, Stokman FN (eds) Evolution of social networks. Routledge, London, pp 233–250Traud AL, Mucha PJ, Porter MA (2012) Social structure of facebook networks. Phys A Stat Mech Its Appl 391(16):4165–4180Ugander J, Karrer B, Backstrom L, Marlow C (2011) The anatomy of the Facebook social graph. arXiv preprint arXiv:1111.4503Valero S, del Val E, Alemany J, Botti V (2015) Using magentix2 in smart-home environments. In: 10th International conference on soft computing models in industrial and environmental applications. Springer, Berlin, pp 27–37Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge University Press, CambridgeWersm (2015) How much data is generated every minute on social media? http://wersm.com/how-much-data-is-generated-every-minute-on-social-media/ . Accessed 29 April 201

    On the Benefits of the Remote GPU Virtualization Mechanism: the rCUDA Case

    Get PDF
    [EN] Graphics processing units (GPUs) are being adopted in many computing facilities given their extraordinary computing power, which makes it possible to accelerate many general purpose applications from different domains. However, GPUs also present several side effects, such as increased acquisition costs as well as larger space requirements. They also require more powerful energy supplies. Furthermore, GPUs still consume some amount of energy while idle, and their utilization is usually low for most workloads. In a similar way to virtual machines, the use of virtual GPUs may address the aforementioned concerns. In this regard, the remote GPU virtualization mechanism allows an application being executed in a node of the cluster to transparently use the GPUs installed at other nodes. Moreover, this technique allows to share the GPUs present in the computing facility among the applications being executed in the cluster. In this way, several applications being executed in different (or the same) cluster nodes can share 1 or more GPUs located in other nodes of the cluster. Sharing GPUs should increase overall GPU utilization, thus reducing the negative impact of the side effects mentioned before. Reducing the total amount of GPUs installed in the cluster may also be possible. In this paper, we explore some of the benefits that remote GPU virtualization brings to clusters. For instance, this mechanism allows an application to use all the GPUs present in the computing facility. Another benefit of this technique is that cluster throughput, measured as jobs completed per time unit, is noticeably increased when this technique is used. In this regard, cluster throughput can be doubled for some workloads. Furthermore, in addition to increase overall GPU utilization, total energy consumption can be reduced up to 40%. This may be key in the context of exascale computing facilities, which present an important energy constraint. Other benefits are related to the cloud computing domain, where a GPU can be easily shared among several virtual machines. Finally, GPU migration (and therefore server consolidation) is one more benefit of this novel technique.Generalitat Valenciana, Grant/Award Number: PROMETEOII/2013/009; MINECO and FEDER, Grant/Award Number: TIN2014-53495-RSilla Jiménez, F.; Iserte Agut, S.; Reaño González, C.; Prades, J. (2017). On the Benefits of the Remote GPU Virtualization Mechanism: the rCUDA Case. Concurrency and Computation Practice and Experience. 29(13):1-17. https://doi.org/10.1002/cpe.4072S1172913Wu H Diamos G Sheard T Red Fox: An execution environment for relational query processing on GPUs Proceedings of Annual IEEE/ACM International Symposium on Code Generation and Optimization CGO '14 Orlando, FL, USA ACM 2014 44:44 44:54Playne DP Hawick KA Data parallel three-dimensional cahn-hilliard field equation simulation on GPUs with CUDA Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA Las Vegas, Nevada, USA 2009Yamazaki, I., Dong, T., Solcà, R., Tomov, S., Dongarra, J., & Schulthess, T. (2013). Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Concurrency and Computation: Practice and Experience, 26(16), 2652-2666. doi:10.1002/cpe.3152Yuancheng Luo D Canny edge detection on NVIDIA CUDA IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2008. CVPRW '08 Anchorage, AK, USA IEEE 2008 1 8Surkov, V. (2010). Parallel option pricing with Fourier space time-stepping method on graphics processing units. Parallel Computing, 36(7), 372-380. doi:10.1016/j.parco.2010.02.006Agarwal, P. K., Hampton, S., Poznanovic, J., Ramanthan, A., Alam, S. R., & Crozier, P. S. (2012). Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures. Concurrency and Computation: Practice and Experience, 25(10), 1356-1375. doi:10.1002/cpe.2943Yoo, A. B., Jette, M. A., & Grondona, M. (2003). SLURM: Simple Linux Utility for Resource Management. Lecture Notes in Computer Science, 44-60. doi:10.1007/10968987_3Silla F Prades J Iserte S Reaño C Remote GPU virtualization: Is it useful The 2nd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era Barcelona, Spain IEEE Computer Society 2016 41 48Liang TY Chang YW GridCuda: A grid-enabled CUDA programming toolkit 2011 IEEE Workshops of International Conference on Advanced Information Networking and Applications (WAINA) Biopolis, Singapore IEEE 2011 141 146Oikawa M Kawai A Nomura K Yasuoka K Yoshikawa K Narumi T DS-CUDA: A middleware to use many GPUs in the cloud environment Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis SCC '12 IEEE Computer Society Washington, DC, USA 2012 1207 1214Giunta G Montella R Agrillo G Coviello G A GPGPU transparent virtualization component for high performance computing clouds Euro-Par 2010 - Parallel Processing Ischia, Italy Springer 2010Shi L Chen H Sun J vCUDA: GPU accelerated high performance computing in virtual machines IEEE International Symposium on Parallel & Distributed Processing, 2009. IPDPS 2009 Rome, Italy IEEE 2009 1 11Gupta V Gavrilovska A Schwan K GViM: GPU-accelerated virtual machines Proceedings of the 3rd ACM Workshop on System-level Virtualization for High Performance Computing Nuremberg, Germany 2009 17 24Peña, A. J., Reaño, C., Silla, F., Mayo, R., Quintana-Ortí, E. S., & Duato, J. (2014). A complete and efficient CUDA-sharing solution for HPC clusters. Parallel Computing, 40(10), 574-588. doi:10.1016/j.parco.2014.09.011CUDA API Reference Manual 7.5 https://developer.nvidia.com/cuda-toolkit 2016Merritt AM Gupta V Verma A Gavrilovska A Schwan K Shadowfax: Scaling in heterogeneous cluster systems via GPGPU assemblies Proceedings of the 5th International Workshop on Virtualization Technologies in Distributed Computing VTDC '11 ACM New York, NY, USA 2011 3 10Shadowfax II - scalable implementation of GPGPU assemblies http://keeneland.gatech.edu/software/keeneland/kidronNVIDIA The NVIDIA GPU Computing SDK Version 5.5 2013iperf3: A TCP, UDP, and SCTP network bandwidth measurement tool https://github.com/esnet/iperf 2016Reaño C Silla F Shainer G Schultz S Local and remote GPUs perform similar with EDR 100G InfiniBand Proceedings of the Industrial Track of the 16th International Middleware Conference Middleware Industry '15 Vancouver, Canada 2015Reaño, C., Silla, F., Castelló, A., Peña, A. J., Mayo, R., Quintana-Ortí, E. S., & Duato, J. (2014). Improving the user experience of the rCUDA remote GPU virtualization framework. Concurrency and Computation: Practice and Experience, 27(14), 3746-3770. doi:10.1002/cpe.3409Iserte S Castelló A Mayo R Slurm support for remote GPU virtualization: Implementation and performance study 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 2014 318 325Vouzis, P. D., & Sahinidis, N. V. (2010). GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics, 27(2), 182-188. doi:10.1093/bioinformatics/btq644Brown, W. M., Kohlmeyer, A., Plimpton, S. J., & Tharrington, A. N. (2012). Implementing molecular dynamics on hybrid high performance computers – Particle–particle particle-mesh. Computer Physics Communications, 183(3), 449-459. doi:10.1016/j.cpc.2011.10.012Liu, Y., Schmidt, B., Liu, W., & Maskell, D. L. (2010). CUDA–MEME: Accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Pattern Recognition Letters, 31(14), 2170-2177. doi:10.1016/j.patrec.2009.10.009Pronk, S., Páll, S., Schulz, R., Larsson, P., Bjelkmar, P., Apostolov, R., … Lindahl, E. (2013). GROMACS 4.5: a high-throughput and highly parallel open source molecular simulation toolkit. Bioinformatics, 29(7), 845-854. doi:10.1093/bioinformatics/btt055Klus, P., Lam, S., Lyberg, D., Cheung, M., Pullan, G., McFarlane, I., … Lam, B. Y. (2012). BarraCUDA - a fast short read sequence aligner using graphics processing units. BMC Research Notes, 5(1), 27. doi:10.1186/1756-0500-5-27Kurtz, S., Phillippy, A., Delcher, A. L., Smoot, M., Shumway, M., Antonescu, C., & Salzberg, S. L. (2004). Genome Biology, 5(2), R12. doi:10.1186/gb-2004-5-2-r12Chang, C.-C., & Lin, C.-J. (2011). LIBSVM. ACM Transactions on Intelligent Systems and Technology, 2(3), 1-27. doi:10.1145/1961189.1961199Phillips, J. C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., … Schulten, K. (2005). Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 26(16), 1781-1802. doi:10.1002/jcc.20289NVIDIA Popular GPU-Accelerated Applications Catalog http://www.nvidia.es/content/tesla/pdf/gpu-accelerated-applications-for-hpc.pdf 2016Walters JP Younge AJ Kang D-I GPU-passthrough performance: A comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL applications 7th IEEE International Conference on Cloud Computing (CLOUD 2014) Anchorage, AK, USA 2014Yang C-T Wang H-Y Ou W-S Liu Y-T Hsu C-H On implementation of GPU virtualization using PCI pass-through 2012 IEEE 4th International Conference on Cloud Computing Technology and Science (CLOUDCOM) Taipei, Taiwan 2012 711 716Pérez F Reaño C Silla F Providing CUDA acceleration to KVM virtual machines in InfiniBand clusters with rCUDA Proceedings of the International Conference on Distributed Applications and Interoperable Systems Crete, Greece 2016Jo, H., Jeong, J., Lee, M., & Choi, D. H. (2013). Exploiting GPUs in Virtual Machine for BioCloud. BioMed Research International, 2013, 1-11. doi:10.1155/2013/939460Prades J Reaño C Silla F CUDA acceleration for Xen virtual machines in Infiniband clusters with rCUDA Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PPoPP '16 Barcelona, Spain 2016Mellanox Mellanox OFED for Linux User Manual 2015Liu, Y., Wirawan, A., & Schmidt, B. (2013). CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinformatics, 14(1). doi:10.1186/1471-2105-14-117Takizawa H Sato K Komatsu K Kobayashi H CheCUDA: A checkpoint/restart tool for CUDA applications Proceedings of the 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies Hiroshima, Japan 200

    On the Effect of using rCUDA to Provide CUDA Acceleration to Xen Virtual Machines

    Full text link
    [EN] Nowadays, many data centers use virtual machines (VMs) in order to achieve a more efficient use of hardware resources. The use of VMs provides a reduction in equipment and maintenance expenses as well as a lower electricity consumption. Nevertheless, current virtualization solutions, such as Xen, do not easily provide graphics processing units (GPUs) to applications running in the virtualized domain with the flexibility usually required in data centers (i.e., managing virtual GPU instances and concurrently sharing them among several VMs). Therefore, the execution of GPU-accelerated applications within VMs is hindered by this lack of flexibility. In this regard, remote GPU virtualization solutions may address this concern. In this paper we analyze the use of the remote GPU virtualization mechanism to accelerate scientific applications running inside Xen VMs. We conduct our study with six different applications, namely CUDA-MEME, CUDASW++, GPU-BLAST, LAMMPS, a triangle count application, referred to as TRICO, and a synthetic benchmark used to emulate different application behaviors. Our experiments show that the use of remote GPU virtualization is a feasible approach to address the current concerns of sharing GPUs among several VMs, featuring a very low overhead if an InfiniBand fabric is already present in the cluster.This work was funded by the Generalitat Valenciana under Grant PROMETEO/2017/077. Authors are also grateful for the generous support provided by Mellanox Technologies Inc.Prades, J.; Reaño González, C.; Silla Jiménez, F. (2019). On the Effect of using rCUDA to Provide CUDA Acceleration to Xen Virtual Machines. Cluster Computing. 22(1):185-204. https://doi.org/10.1007/s10586-018-2845-0185204221Kernel-Based Virtual Machine, KVM. http://www.linux-kvm.org (2015). Accessed 19 Oct 2015Xen Project. http://www.xenproject.org/ (2015). Accessed 19 Oct 2015VMware Virtualization. http://www.vmware.com/ (2015). Accessed 19 Oct 2015Oracle VM VirtualBox. http://www.virtualbox.org/ (2015). Accessed 19 Oct 2015Semnanian, A., Pham, J., Englert, B., Wu, X.: Virtualization technology and its impact on computer hardware architecture. In: Proceedings of the Information Technology: New Generations, ITNG, pp. 719–724 (2011)Felter, W., Ferreira, A., Rajamony, R., Rubio, J.: An updated performance comparison of virtual machines and linux containers. In: IBM Research Report (2014)Zhang, J., Lu, X., Arnold, M., Panda, D.: MVAPICH2 over OpenStack with SR-IOV: an efficient approach to build HPC Clouds. In: Proceedings of the IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid, pp. 71–80 (2015)Wu, H., Diamos, G., Sheard, T., Aref, M., Baxter, S., Garland, M., Yalamanchili, S.: Red Fox: an execution environment for relational query processing on GPUs. In: Proceedings of the International Symposium on Code Generation and Optimization, CGO (2014)Playne, D.P., Hawick, K.A.: Data parallel three-dimensional Cahn-Hilliard field equation simulation on GPUs with CUDA. In: Proceedings of the Parallel and Distributed Processing Techniques and Applications, PDPTA, pp. 104–110 (2009)Yamazaki, I., Dong, T., Solcà, R., Tomov, S., Dongarra, J., Schulthess, T.: Tridiagonalization of a dense symmetric matrix on multiple GPUs and its application to symmetric eigenvalue problems. Concurr. Comput.: Pract. Exp. 26(16), 2652–2666 (2014)Luo, D.Y.: Canny edge detection on NVIDIA CUDA. In: Proceedings of the Computer Vision and Pattern Recognition Workshops, CVPR Workshops, pp. 1–8 (2008)Surkov, V.: Parallel option pricing with Fourier space time-stepping method on graphics processing units. Parallel Comput. 36(7), 372–380 (2010)Agarwal, P.K., Hampton, S., Poznanovic, J., Ramanthan, A., Alam, S.R., Crozier, P.S.: Performance modeling of microsecond scale biological molecular dynamics simulations on heterogeneous architectures. Concurr. Comput.: Pract. Exp. 25(10), 1356–1375 (2013)Luo, G.H., Huang, S.K., Chang, Y.S., Yuan, S.M.: A parallel bees algorithm implementation on GPU. J. Syst. Arch. 60(3), 271–279 (2014)NVIDIA GRID Technology. http://www.nvidia.com/object/grid-technology.html (2015). Accessed 19 Oct 2015Song, J., et al: KVMGT: a full GPU virtualization solution. In: KVM Forum (2014)AMD Multiuser GPU, Hardware-Based Virtualized Solution. http://www.amd.com/Documents/Multiuser-GPU-Datasheet.pdf (2015). Accessed 19 Oct 2015V-GPU: GPU Virtualization. https://github.com/zillians/platform_manifest_vgpu (2015). Accessed 19 Oct 2015Oikawa, M., Kawai, A., Nomura, K., Yasuoka, K., Yoshikawa, K., Narumi, T.: DS-CUDA: a middleware to use many GPUs in the cloud environment. In: Proceedings of the SC Companion: High Performance Computing, Networking Storage and Analysis, SCC, pp. 1207–1214 (2012)Reaño, C., Silla, F., Shainer, G., Schultz, S.: Local and remote GPUs perform similar with EDR 100G InfiniBand. In: Proceedings of the Industrial Track of the 16th International Middleware Conference, ACM, Middleware Industry ’15, pp. 4:1–4:7 (2015)Reaño, C., Silla, F., Duato, J.: Enhancing the rCUDA remote GPU virtualization framework: from a prototype to a production solution. In: Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, IEEE Press, CCGrid ’17, pp. 695–698 (2017)Shi, L., Chen, H., Sun, J.: vCUDA: GPU accelerated high performance computing in virtual machines. In: Proceedings of the IEEE Parallel and Distributed Processing Symposium, IPDPS, pp. 1–11 (2009)Liang, T.Y., Chang, Y.W.: GridCuda: A grid-enabled CUDA programming toolkit. In: Proceedings of the IEEE Advanced Information Networking and Applications Workshops, WAINA, pp. 141–146 (2011)Giunta, G., Montella, R., Agrillo, G., Coviello, G.: A GPGPU transparent virtualization component for high performance computing clouds. In: Proceedings of the Euro-Par Parallel Processing, Euro-Par, pp. 379–391 (2010)Gupta, V., Gavrilovska, A., Schwan, K., Kharche, H., Tolia, N., Talwar, V., Ranganathan, P. GViM: GPU-accelerated virtual machines. In: Proceedings of the ACM Workshop on System-level Virtualization for High Performance Computing, HPCVirt, pp. 17–24 (2009)Merritt, A.M., Gupta, V., Verma, A., Gavrilovska, A., Schwan, K.: Shadowfax: scaling in heterogeneous cluster systems via GPGPU assemblies. In: Proceedings of the International Workshop on Virtualization Technologies in Distributed Computing, VTDC, pp. 3–10 (2011)Shadowfax II—Scalable Implementation of GPGPU Assemblies. http://keeneland.gatech.edu/software/keeneland/kidron (2015). Accessed 19 Oct 2015Walters, J.P., Younge, A.J., Kang, D.I., Yao, K.T., Kang, M., Crago, S.P., Fox, G.C.: GPU-passthrough performance: a comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL applications. In: Proceedings of the IEEE International Conference on Cloud Computing, CLOUD (2014)Yang, C.T., Wang, H.Y., Ou, W.S., Liu, Y.T., Hsu, C.H.: On implementation of GPU virtualization using PCI pass-through. In: Proceedings of the IEEE Cloud Computing Technology and Science, CloudCom, pp. 711–716 (2012)Jo, H., Jeong, J., Lee, M., Choi, D.H.: Exploiting GPUs in virtual machine for BioCloud. BioMed Res. Int. 2013, 11 (2013). https://doi.org/10.1155/2013/939460NVIDIA: CUDA C Programming Guide 7.5. http://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf (2015a). Accessed 19 Oct 2015NVIDIA: CUDA Runtime API Reference Manual 7.5. http://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf (2015b). Accessed 19 Oct 2015NVIDIA: The NVIDIA GPU Computing SDK Version 5.5 (2013)iperf3: A TCP, UDP, and SCTP Network Bandwidth Measurement Tool. https://github.com/esnet/iperf (2015). Accessed 19 Oct 2015Reaño, C., Silla, F.: Reducing the performance gap of remote GPU virtualization with InfiniBand Connect-IB. In: 2016 IEEE Symposium on Computers and Communication (ISCC), pp. 920–925 (2016)Mellanox: Connect-IB Single and Dual QSFP+ Port PCI Express Gen3 x16 Adapter Card User Manual. http://www.mellanox.com/related-docs/user_manuals/Connect-IB_Single_and_Dual_QSFP+_Port_PCI_Express_Gen3_%20x16_Adapter_Card_User_Manual.pdf (2014a). Accessed 19 Oct 2015Mellanox: ConnectX-3 VPI Single and Dual QSFP+ Port Adapter Card User Manual 1.7. http://www.mellanox.com/related-docs/user_manuals/ConnectX-3_VPI_Single_and_Dual_QSFP_Port_Adapter_Card_User_Manual.pdf (2013). Accessed 19 Oct 2015Pérez, F., Reaño, C., Silla, F.: Providing CUDA acceleration to KVM virtual machines in InfiniBand clusters with rCUDA. In: 16th International Conference Distributed Applications and Interoperable Systems (DAIS), pp. 82–95. Springer International Publishing (2016)Mellanox: Mellanox OFED for Linux User Manual. http://www.mellanox.com/related-docs/prod_software/Mellanox_OFED_Linux_User_Manual_v2.3-1.0.1.pdf (2014b). Accessed 19 Oct 2015Reaño, C., Mayo, R., Quintana-Ortí, E., Silla, F., Duato, J., Peña, A.: Influence of InfiniBand FDR on the performance of remote GPU virtualization. In: Proceedings of the IEEE International Conference on Cluster Computing, CLUSTER, pp. 1–8 (2013)Laboratories, S.N.: LAMMPS Molecular Dynamics Simulator. http://lammps.sandia.gov/ (2013). Accessed 19 Oct 2015Liu, Y., Schmidt, B., Liu, W., Maskell, D.L.: CUDA-MEME: accelerating motif discovery in biological sequences using CUDA-enabled graphics processing units. Pattern Recognit. Lett. 31(14), 2170–2177 (2010)Liu, Y., Wirawan, A., Schmidt, B.: CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinformat. 14(1), 1–10 (2013)Vouzis, P.D., Sahinidis, N.V.: GPU-BLAST: using graphics processors to accelerate protein sequence alignment. Bioinformatics 27(2), 182–188 (2011)NVIDIA: NVIDIA Popular GPU-Accelerated Applications Catalog. http://www.nvidia.com/content/gpu-applications/PDF/GPU-apps-catalog-mar2015.pdf (2015c). Accessed 19 Oct 2015Liu, Y. CUDA-MEME. https://sites.google.com/site/yongchaosoftware/mcuda-meme (2014). Accessed 19 Oct 2015Polak, A.: Counting triangles in large graphs on GPU. In: IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 740–746 (2016)Prades, J., Silla, F.: Turning GPUs into floating devices over the cluster: the Beauty of GPU Migration. In: Proceedings of the 6th Workshop on Heterogeneous and Unconventional Cluster Architectures and Applications (HUCAA) (2017

    On Evaluating Commercial Cloud Services: A Systematic Review

    Full text link
    Background: Cloud Computing is increasingly booming in industry with many competing providers and services. Accordingly, evaluation of commercial Cloud services is necessary. However, the existing evaluation studies are relatively chaotic. There exists tremendous confusion and gap between practices and theory about Cloud services evaluation. Aim: To facilitate relieving the aforementioned chaos, this work aims to synthesize the existing evaluation implementations to outline the state-of-the-practice and also identify research opportunities in Cloud services evaluation. Method: Based on a conceptual evaluation model comprising six steps, the Systematic Literature Review (SLR) method was employed to collect relevant evidence to investigate the Cloud services evaluation step by step. Results: This SLR identified 82 relevant evaluation studies. The overall data collected from these studies essentially represent the current practical landscape of implementing Cloud services evaluation, and in turn can be reused to facilitate future evaluation work. Conclusions: Evaluation of commercial Cloud services has become a world-wide research topic. Some of the findings of this SLR identify several research gaps in the area of Cloud services evaluation (e.g., the Elasticity and Security evaluation of commercial Cloud services could be a long-term challenge), while some other findings suggest the trend of applying commercial Cloud services (e.g., compared with PaaS, IaaS seems more suitable for customers and is particularly important in industry). This SLR study itself also confirms some previous experiences and reveals new Evidence-Based Software Engineering (EBSE) lessons
    corecore