73 research outputs found

    Design of Multi-Gigabit Network Interconnect Elements and Protocols for a Data Acquisition System in Radiation Environments

    Get PDF
    Modern High Energy Physics experiments (HEP) explore the fundamental nature of matter in more depth than ever before and thereby benefit greatly from the advances in the field of communication technology. The huge data volumes generated by the increasingly precise detector setups pose severe problems for the Data Acquisition Systems (DAQ), which are used to process and store this information. In addition, detector setups and their read-out electronics need to be synchronized precisely to allow a later correlation of experiment events accurately in time. Moreover, the substantial presence of charged particles from accelerator-generated beams results in strong ionizing radiation levels, which has a severe impact on the electronic systems. This thesis recommends an architecture for unified network protocol IP cores with custom developed physical interfaces for the use of reliable data acquisition systems in strong radiation environments. Special configured serial bidirectional point-to-point interconnects are proposed to realize high speed data transmission, slow control access, synchronization and global clock distribution on unified links to reduce costs and to gain compact and efficient read-out setups. Special features are the developed radiation hardened functional units against single and multiple bit upsets, and the common interface for statistical error and diagnosis information, which integrates well into the protocol capabilities and eases the error handling in large experiment setups. Many innovative designs for several custom FPGA and ASIC platforms have been implemented and are described in detail. Special focus is placed on the physical layers and network interface elements from high-speed serial LVDS interconnects up to 20 Gb/s SSTL links in state-of-the-art process technology. The developed IP cores are fully tested by an adapted verification environment for electronic design automation tools and also by live application. They are available in a global repository allowing a broad usage within further HEP experiments

    Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives

    Get PDF
    © ACM, 2020. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in ACM Computing Surveys, Vol. 53, No. 5, Article 95. Publication date: September 2020. https://doi.org/10.1145/3403956[EN] Performance and power constraints come together with Complementary Metal Oxide Semiconductor technology scaling in future Exascale systems. Technology scaling makes each individual transistor more prone to faults and, due to the exponential increase in the number of devices per chip, to higher system fault rates. Consequently, High-performance Computing (HPC) systems need to integrate prediction, detection, and recovery mechanisms to cope with faults efficiently. This article reviews fault detection, fault prediction, and recovery techniques in HPC systems, from electronics to system level. We analyze their strengths and limitations. Finally, we identify the promising paths to meet the reliability levels of Exascale systems.This work has received funding from the European Union's Horizon 2020 (H2020) research and innovation program under the FET-HPC Grant Agreement No. 801137 (RECIPE). Jaume Abella was also partially supported by the Ministry of Economy and Competitiveness of Spain under Contract No. TIN2015-65316-P and under Ramon y Cajal Postdoctoral Fellowship No. RYC-2013-14717, as well as by the HiPEAC Network of Excellence. Ramon Canal is partially supported by the Generalitat de Catalunya under Contract No. 2017SGR0962.Canal, R.; Hernández Luz, C.; Tornero-Gavilá, R.; Cilardo, A.; Massari, G.; Reghenzani, F.; Fornaciari, W.... (2020). Predictive Reliability and Fault Management in Exascale Systems: State of the Art and Perspectives. ACM Computing Surveys. 53(5):1-32. https://doi.org/10.1145/3403956S132535Abella, J., Hernandez, C., Quinones, E., Cazorla, F. J., Conmy, P. R., Azkarate-askasua, M., … Vardanega, T. (2015). WCET analysis methods: Pitfalls and challenges on their trustworthiness. 10th IEEE International Symposium on Industrial Embedded Systems (SIES). doi:10.1109/sies.2015.7185039E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324. E. Agullo L. Giraud A. Guermouche J. Roman and M. Zounon. 2013. Towards resilient parallel linear Krylov solvers: Recover-restart strategies. INRIA Research Report RR-8324.Agullo, E., Giraud, L., Salas, P., & Zounon, M. (2016). Interpolation-Restart Strategies for Resilient Eigensolvers. SIAM Journal on Scientific Computing, 38(5), C560-C583. doi:10.1137/15m1042115Al-Qawasmeh, A. M., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2015). Power and Thermal-Aware Workload Allocation in Heterogeneous Data Centers. IEEE Transactions on Computers, 64(2), 477-491. doi:10.1109/tc.2013.116ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest. ARM. 2017. ARM Reliability Availability and Serviceability (RAS) Specification—ARMv8 for the ARMv8-A Architecture Profile. White paper. Retrieved from https://developer.arm.com/docs/ddi0587/latest.Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. (2004). Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33. doi:10.1109/tdsc.2004.2Bautista-Gomez, L., Zyulkyarov, F., Unsal, O., & McIntosh-Smith, S. (2016). Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer. SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. doi:10.1109/sc.2016.54Berrocal, E., Bautista-Gomez, L., Di, S., Lan, Z., & Cappello, F. (2017). Toward General Software Level Silent Data Corruption Detection for Parallel Applications. IEEE Transactions on Parallel and Distributed Systems, 28(12), 3642-3655. doi:10.1109/tpds.2017.2735971M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer. M.-A. Breuer and A. D. Friedman. 1976. Diagnosis 8 Reliable Design of Digital Systems. Springer.P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA]. P. Bridges K. Ferreira M. Heroux and M. Hoemmen. 2012. Fault-tolerant linear solvers via selective reliability. ArXiv e-prints June 2012. arXiv:1206.1390 [math.NA].F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14. F. Cappello A. Geist W. Gropp S. Kale B. Kramer and M. Snir. 2014. Toward exascale resilience: 2014 update. Supercomput. Front. Innovat. 1 1 (2014). http://superfri.org/superfri/article/view/14.F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283 F. J. Cazorla L. Kosmidis E. Mezzetti C. Hernandez J. Abella and T. Vardanega. 2019. Probabilistic worst-case timing analysis: Taxonomy and comprehensive survey. ACM Comput. Surv. 52 1 Article 14 (Feb. 2019) 35 pages. DOI:https://doi.org/10.1145/3301283Chan, C. S., Pan, B., Gross, K., Vaidyanathan, K., & Rosing, T. Š. (2014). Correcting vibration-induced performance degradation in enterprise servers. ACM SIGMETRICS Performance Evaluation Review, 41(3), 83-88. doi:10.1145/2567529.2567555Chantem, T., Hu, X. S., & Dick, R. P. (2011). Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 19(10), 1884-1897. doi:10.1109/tvlsi.2010.2058873Chen, M. Y., Kiciman, E., Fratkin, E., Fox, A., & Brewer, E. (s. f.). Pinpoint: problem determination in large, dynamic Internet services. Proceedings International Conference on Dependable Systems and Networks. doi:10.1109/dsn.2002.1029005Chen, Z. (2011). Algorithm-based recovery for iterative methods without checkpointing. Proceedings of the 20th international symposium on High performance distributed computing - HPDC ’11. doi:10.1145/1996130.1996142Chen, Z. (2013). Online-ABFT. Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming - PPoPP ’13. doi:10.1145/2442516.2442533Coskun, A. K., Rosing, T. S., Mihic, K., De Micheli, G., & Leblebici, Y. (2006). Analysis and Optimization of MPSoC Reliability. Journal of Low Power Electronics, 2(1), 56-69. doi:10.1166/jolpe.2006.007G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119. G. Da Costa A. Oleksiak W. Piatek J. Salom and L. Sisó. 2015. Minimization of costs and energy consumption in a data center by a workload-based capacity management. In Energy Efficient Data Centers S. Klingert M. Chinnici and M. Rey Porto (Eds.). Springer International Publishing Cham 102--119.Cupertino, L., Da Costa, G., Oleksiak, A., Pia¸tek, W., Pierson, J.-M., Salom, J., … Zilio, T. (2015). Energy-efficient, thermal-aware modeling and simulation of data centers: The CoolEmAll approach and evaluation results. Ad Hoc Networks, 25, 535-553. doi:10.1016/j.adhoc.2014.11.002Dally, W. J. (1991). Express cubes: improving the performance of k-ary n-cube interconnection networks. IEEE Transactions on Computers, 40(9), 1016-1023. doi:10.1109/12.83652Dauwe, D., Pasricha, S., Maciejewski, A. A., & Siegel, H. J. (2018). Resilience-Aware Resource Management for Exascale Computing Systems. IEEE Transactions on Sustainable Computing, 3(4), 332-345. doi:10.1109/tsusc.2018.2797890R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814 R. I. Davis and A. Burns. 2011. A survey of hard real-time scheduling for multiprocessor systems. ACM Comput. Surv. 43 4 Article 35 (Oct. 2011) 44 pages. DOI:https://doi.org/10.1145/1978802.1978814Di, S., & Cappello, F. (2016). Adaptive Impact-Driven Detection of Silent Data Corruption for HPC Applications. IEEE Transactions on Parallel and Distributed Systems, 27(10), 2809-2823. doi:10.1109/tpds.2016.2517639Di, S., Guo, H., Gupta, R., Pershey, E. R., Snir, M., & Cappello, F. (2019). Exploring Properties and Correlations of Fatal Events in a Large-Scale HPC System. IEEE Transactions on Parallel and Distributed Systems, 30(2), 361-374. doi:10.1109/tpds.2018.2864184Di, S., Robert, Y., Vivien, F., & Cappello, F. (2017). Toward an Optimal Online Checkpoint Solution under a Two-Level HPC Checkpoint Model. IEEE Transactions on Parallel and Distributed Systems, 28(1), 244-259. doi:10.1109/tpds.2016.2546248J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer. J. Dongarra T. Herault and Y. Robert. 2015. Fault Tolerance Techniques for High-Performance Computing. Springer.DOWNING, S., & SOCIE, D. (1982). Simple rainflow counting algorithms. International Journal of Fatigue, 4(1), 31-40. doi:10.1016/0142-1123(82)90018-4Eghbalkhah, B., Kamal, M., Afzali-Kusha, H., Afzali-Kusha, A., Ghaznavi-Ghoushchi, M. B., & Pedram, M. (2015). Workload and temperature dependent evaluation of BTI-induced lifetime degradation in digital circuits. Microelectronics Reliability, 55(8), 1152-1162. doi:10.1016/j.microrel.2015.06.004Gottscho, M., Shoaib, M., Govindan, S., Sharma, B., Wang, D., & Gupta, P. (2017). Measuring the Impact of Memory Errors on Application  Performance. IEEE Computer Architecture Letters, 16(1), 51-55. doi:10.1109/lca.2016.2599513Greenberg, A., Hamilton, J. R., Jain, N., Kandula, S., Kim, C., Lahiri, P., … Sengupta, S. (2011). VL2. Communications of the ACM, 54(3), 95-104. doi:10.1145/1897852.1897877Heroux, M. A., Bartlett, R. A., Howle, V. E., Hoekstra, R. J., Hu, J. J., Kolda, T. G., … Stanley, K. S. (2005). An overview of the Trilinos project. ACM Transactions on Mathematical Software, 31(3), 397-423. doi:10.1145/1089014.1089021Hoffmann, G. A., Trivedi, K. S., & Malek, M. (2007). A Best Practice Guide to Resource Forecasting for Computing Systems. IEEE Transactions on Reliability, 56(4), 615-628. doi:10.1109/tr.2007.909764Hsiao, M. Y., Carter, W. C., Thomas, J. W., & Stringfellow, W. R. (1981). Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress. IBM Journal of Research and Development, 25(5), 453-468. doi:10.1147/rd.255.0453Hughes, G. F., Murray, J. F., Kreutz-Delgado, K., & Elkan, C. (2002). Improved disk-drive failure warnings. IEEE Transactions on Reliability, 51(3), 350-357. doi:10.1109/tr.2002.802886S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301 S. Hukerikar and C. Engelmann. 2017. Resilience design patterns: A structured approach to resilience at extreme scale. Supercomput. Front. Innov. 4 3 (2017). DOI:https://doi.org/10.14529/jsfi170301Hussain, H., Malik, S. U. R., Hameed, A., Khan, S. U., Bickler, G., Min-Allah, N., … Rayes, A. (2013). A survey on resource allocation in high performance distributed computing systems. Parallel Computing, 39(11), 709-736. doi:10.1016/j.parco.2013.09.009Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html. Intel Corporation. [n.d.]. Intel Xeon Processor E7 Family: Reliability Availability and Serviceability. White paper. https://www.intel.com/content/www/us/en/processors/xeon/xeon-e7-family-ras-server-paper.html.Jha, S., Formicola, V., Martino, C. D., Dalton, M., Kramer, W. T., Kalbarczyk, Z., & Iyer, R. K. (2018). Resiliency of HPC Interconnects: A Case Study of Interconnect Failures and Recovery in Blue Waters. IEEE Transactions on Dependable and Secure Computing, 15(6), 915-930. doi:10.1109/tdsc.2017.2737537Kiciman, E., & Fox, A. (2005). Detecting Application-Level Failures in Component-Based Internet Services. IEEE Transactions on Neural Networks, 16(5), 1027-1041. doi:10.1109/tnn.2005.853411Kim, T., Sun, Z., Cook, C., Zhao, H., Li, R., Wong, D., & Tan, S. X.-D. (2016). Invited - Cross-layer modeling and optimization for electromigration induced reliability. Proceedings of the 53rd Annual Design Automation Conference. doi:10.1145/2897937.2905010Kurowski, K., Oleksiak, A., Piątek, W., Piontek, T., Przybyszewski, A., & Węglarz, J. (2013). DCworms – A tool for simulation of energy efficiency in distributed computing infrastructures. Simulation Modelling Practice and Theory, 39, 135-151. doi:10.1016/j.simpat.2013.08.007Langou, J., Chen, Z., Bosilca, G., & Dongarra, J. (2008). Recovery Patterns for Iterative Methods in a Parallel Unstable Environment. SIAM Journal on Scientific Computing, 30(1), 102-116. doi:10.1137/040620394J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin. J. C. Laprie (Ed.). 1995. Dependability—Its Attributes Impairments and Means. Springer-Verlag Berlin.Laprie, J.-C. (s. f.). DEPENDABLE COMPUTING AND FAULT TOLERANCE : CONCEPTS AND TERMINOLOGY. Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ’ Highlights from Twenty-Five Years’. doi:10.1109/ftcsh.1995.532603Lasance, C. J. M. (2003). Thermally driven reliability issues in microelectronic systems: status-quo and challenges. Microelectronics Reliability, 43(12), 1969-1974. doi:10.1016/s0026-2714(03)00183-5Yinglung Liang, Yanyong Zhang, Sivasubramaniam, A., Jette, M., & Sahoo, R. (s. f.). BlueGene/L Failure Analysis and Prediction Models. International Conference on Dependable Systems and Networks (DSN’06). doi:10.1109/dsn.2006.18Lin, T.-T. Y., & Siewiorek, D. P. (1990). Error log analysis: statistical modeling and heuristic trend analysis. IEEE Transactions on Reliability, 39(4), 419-432. doi:10.1109/24.58720Losada, N., González, P., Martín, M. J., Bosilca, G., Bouteiller, A., & Teranishi, K. (2020). Fault tolerance of MPI applications in exascale systems: The ULFM solution. Future Generation Computer Systems, 106, 467-481. doi:10.1016/j.future.2020.01.026Lyons, R. E., & Vanderkulk, W. (1962). The Use of Triple-Modular Redundancy to Improve Computer Reliability. IBM Journal of Research and Development, 6(2), 200-209. doi:10.1147/rd.62.0200M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281. M. Médard and S. S. Lumetta. 2003. Network Reliability and Fault Tolerance. American Cancer Society. Retrieved from arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/0471219282.eot281.Moody, A., Bronevetsky, G., Mohror, K., & de Supinski, B. (2010). Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System. doi:10.2172/984082Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf. Moor Insights 8 Strategy. 2017. AMD EPYC Brings New RAS Capability. White paper. Retrieved from https://www.amd.com/system/files/2017-06/AMD-EPYC-Brings-New-RAS-Capability.pdf.Mulas, F., Atienza, D., Acquaviva, A., Carta, S., Benini, L., & De Micheli, G. (2009). Thermal Balancing Policy for Multiprocessor Stream Computing Platforms. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(12), 1870-1882. doi:10.1109/tcad.2009.2032372Oleksiak, A., Kierzynka, M., Piatek, W., Agosta, G., Barenghi, A., Brandolese, C., … Janssen, U. (2017). M2DC – Modular Microserver DataCentre with heterogeneous hardware. Microprocessors and Microsystems, 52, 117-130. doi:10.1016/j.micpro.2017.05.019Oxley, M. A., Jonardi, E., Pasricha, S., Maciejewski, A. A., Siegel, H. J., Burns, P. J., & Koenig, G. A. (2018). Rate-based thermal, power, and co-location aware resource management for heterogeneous data centers. Journal of Parallel and Distributed Computing, 112, 126-139. doi:10.1016/j.jpdc.2017.04.015K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811 K. O’brien I. Pietri R. Reddy A. Lastovetsky and R. Sakellariou. 2017. A survey of power and energy predictive models in HPC systems and applications. ACM Comput. Surv. 50 3 Article 37 (June 2017) 38 pages. DOI:https://doi.org/10.1145/3078811Park, S.-M., & Humphrey, M. (2011). Predictable High-Performance Computing Using Feedback Control and Admission Control. IEEE Transactions on Parallel and Distributed Systems, 22(3), 396-411. doi:10.1109/tpds.2010.100Pfefferman, J. D., & Cernuschi-Frias, B. (2002). A nonparametric nonstationary procedure for failure prediction. IEEE Transactions on Reliability, 51(4), 434-442. doi:10.1109/tr.2002.804733Rangan, K. K., Wei, G.-Y., & Brooks, D. (2009). Thread motion. ACM SIGARCH Computer Architecture News, 37(3), 302-313. doi:10.1145/1555815.1555793Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf. Paolo Rech. [n.d.]. Reliability Issues in Current and Future Supercomputers. Retrieved from http://energysfe.ufsc.br/slides/Paolo-Rech-260917.pdf.F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714 F. Reghenzani G. Massari and W. Fornaciari. 2019. The real-time Linux kernel: A survey on PREEMPT_RT. Comput. Surveys 52 1 Article 18 (Feb. 2019) 36 pages. DOI:https://doi.org/10.1145/3297714F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680 F. Salfner M. Lenk and M. Malek. 2010. A survey of online failure prediction methods. ACM Comput. Surv. 42 3 Article 10 (March 2010) 42 pages. DOI:https://doi.org/10.1145/1670679.1670680Salfner, F., Schieschke, M., & Malek, M. (2006). Predicting failures of computer systems: a case study for a telecommunication system. Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. doi:10.1109/ipdps.2006.1639672Shi, L., Chen, H., Sun, J., & Li, K. (2012). vCUDA: GPU-Accelerated High-Performance Computing in Virtual Machines. IEEE Transactions on Computers, 61(6), 804-816. doi:10.1109/tc.2011.112D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd. D. P. Siewiorek and R. S. Swarz. 1998. Reliable Computer Systems 3rd ed. A. K. Peters Ltd.Singh, S., & Chana, I. (2016). A Survey on Resource Scheduling in Cloud Computing: Issues and Challenges. Journal of Grid Computing, 14(2), 217-264. doi:10.1007/s10723-015-9359-2Slegel, T. J., Averill, R. M., Check, M. A., Giamei, B. C., Krumm, B. W., Krygowski, C. A., … Webb, C. F. (1999). IBM’s S/390 G5 microprocessor design. IEEE Micro, 19(2), 12-23. doi:10.1109/40.755464Sridhar, A., Sabry, M. M., & Atienza, D. (2014). A Semi-Analytical Thermal Modeling Framework for Liquid-Cooled ICs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 33(8), 1145-1158. doi:10.1109/tcad.2014.2323194Sridharan, V., DeBardeleben, N., Blanchard, S., Ferreira, K. B., Stearley, J., Shalf, J., & Gurumurthi, S. (2015). Memory Errors in Modern Systems. ACM SIGARCH Computer Architecture News, 43(1), 297-310. doi:10.1145/2786763.2694348Stathis, J. H. (2018). The physics of NBTI: What do we really know? 2018 IEEE International Reliability Physics Symposium (IRPS). doi:10.1109/irps.2018.8353539Stellner, G. (s. f.). CoCheck: checkpointing and process migration for MPI. Proceedings of International Conference on Parallel Processing. doi:10.1109/ipps.1996.508106Stone, J. E., Gohara, D., & Shi, G. (2010). OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems. Computing in Science & Engineering, 12(3), 66-73. doi:10.1109/mcse.2010.69Subasi, O., Di, S., Bautista-Gomez, L., Balaprakash, P., Unsal, O., Labarta, J., … Cappello, F. (2018). Exploring the capabilities of support vector machines in detecting silent data corruptions. Sustainable Computing: Informatics and Systems, 19, 277-290. doi:10.1016/j.suscom.2018.01.004Tang, D., & Iyer, R. K. (1993). Dependability measurement and modeling of a multicomputer system. IEEE Transactions on Computers, 42(1), 62-75. doi:10.1109/12.192214D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf. D. Turnbull and N. Alldrin. 2003. Failure Prediction in Hardware Systems. Tech. rep. University of California San Diego CA. Retrieved from http://www.cs.ucsd.edu/ dturnbul/Papers/ServerPrediction.pdf.Vilalta, R., Apte, C. V., Hellerstein, J. L., Ma, S., & Weiss, S. M. (2002). Predictive algorithms in the management of computer systems. IBM Systems Journal, 41(3), 461-474. doi:10.1147/sj.413.0461Vinoski, S. (2007). Reliability with Erlang. IEEE Internet Com

    Design and Validation of Network-on-Chip Architectures for the Next Generation of Multi-synchronous, Reliable, and Reconfigurable Embedded Systems

    Get PDF
    NETWORK-ON-CHIP (NoC) design is today at a crossroad. On one hand, the design principles to efficiently implement interconnection networks in the resource-constrained on-chip setting have stabilized. On the other hand, the requirements on embedded system design are far from stabilizing. Embedded systems are composed by assembling together heterogeneous components featuring differentiated operating speeds and ad-hoc counter measures must be adopted to bridge frequency domains. Moreover, an unmistakable trend toward enhanced reconfigurability is clearly underway due to the increasing complexity of applications. At the same time, the technology effect is manyfold since it provides unprecedented levels of system integration but it also brings new severe constraints to the forefront: power budget restrictions, overheating concerns, circuit delay and power variability, permanent fault, increased probability of transient faults. Supporting different degrees of reconfigurability and flexibility in the parallel hardware platform cannot be however achieved with the incremental evolution of current design techniques, but requires a disruptive approach and a major increase in complexity. In addition, new reliability challenges cannot be solved by using traditional fault tolerance techniques alone but the reliability approach must be also part of the overall reconfiguration methodology. In this thesis we take on the challenge of engineering a NoC architectures for the next generation systems and we provide design methods able to overcome the conventional way of implementing multi-synchronous, reliable and reconfigurable NoC. Our analysis is not only limited to research novel approaches to the specific challenges of the NoC architecture but we also co-design the solutions in a single integrated framework. Interdependencies between different NoC features are detected ahead of time and we finally avoid the engineering of highly optimized solutions to specific problems that however coexist inefficiently together in the final NoC architecture. To conclude, a silicon implementation by means of a testchip tape-out and a prototype on a FPGA board validate the feasibility and effectivenes

    Network-on-Chip

    Get PDF
    Addresses the Challenges Associated with System-on-Chip Integration Network-on-Chip: The Next Generation of System-on-Chip Integration examines the current issues restricting chip-on-chip communication efficiency, and explores Network-on-chip (NoC), a promising alternative that equips designers with the capability to produce a scalable, reusable, and high-performance communication backbone by allowing for the integration of a large number of cores on a single system-on-chip (SoC). This book provides a basic overview of topics associated with NoC-based design: communication infrastructure design, communication methodology, evaluation framework, and mapping of applications onto NoC. It details the design and evaluation of different proposed NoC structures, low-power techniques, signal integrity and reliability issues, application mapping, testing, and future trends. Utilizing examples of chips that have been implemented in industry and academia, this text presents the full architectural design of components verified through implementation in industrial CAD tools. It describes NoC research and developments, incorporates theoretical proofs strengthening the analysis procedures, and includes algorithms used in NoC design and synthesis. In addition, it considers other upcoming NoC issues, such as low-power NoC design, signal integrity issues, NoC testing, reconfiguration, synthesis, and 3-D NoC design. This text comprises 12 chapters and covers: The evolution of NoC from SoC—its research and developmental challenges NoC protocols, elaborating flow control, available network topologies, routing mechanisms, fault tolerance, quality-of-service support, and the design of network interfaces The router design strategies followed in NoCs The evaluation mechanism of NoC architectures The application mapping strategies followed in NoCs Low-power design techniques specifically followed in NoCs The signal integrity and reliability issues of NoC The details of NoC testing strategies reported so far The problem of synthesizing application-specific NoCs Reconfigurable NoC design issues Direction of future research and development in the field of NoC Network-on-Chip: The Next Generation of System-on-Chip Integration covers the basic topics, technology, and future trends relevant to NoC-based design, and can be used by engineers, students, and researchers and other industry professionals interested in computer architecture, embedded systems, and parallel/distributed systems

    New Fault Detection, Mitigation and Injection Strategies for Current and Forthcoming Challenges of HW Embedded Designs

    Full text link
    Tesis por compendio[EN] Relevance of electronics towards safety of common devices has only been growing, as an ever growing stake of the functionality is assigned to them. But of course, this comes along the constant need for higher performances to fulfill such functionality requirements, while keeping power and budget low. In this scenario, industry is struggling to provide a technology which meets all the performance, power and price specifications, at the cost of an increased vulnerability to several types of known faults or the appearance of new ones. To provide a solution for the new and growing faults in the systems, designers have been using traditional techniques from safety-critical applications, which offer in general suboptimal results. In fact, modern embedded architectures offer the possibility of optimizing the dependability properties by enabling the interaction of hardware, firmware and software levels in the process. However, that point is not yet successfully achieved. Advances in every level towards that direction are much needed if flexible, robust, resilient and cost effective fault tolerance is desired. The work presented here focuses on the hardware level, with the background consideration of a potential integration into a holistic approach. The efforts in this thesis have focused several issues: (i) to introduce additional fault models as required for adequate representativity of physical effects blooming in modern manufacturing technologies, (ii) to provide tools and methods to efficiently inject both the proposed models and classical ones, (iii) to analyze the optimum method for assessing the robustness of the systems by using extensive fault injection and later correlation with higher level layers in an effort to cut development time and cost, (iv) to provide new detection methodologies to cope with challenges modeled by proposed fault models, (v) to propose mitigation strategies focused towards tackling such new threat scenarios and (vi) to devise an automated methodology for the deployment of many fault tolerance mechanisms in a systematic robust way. The outcomes of the thesis constitute a suite of tools and methods to help the designer of critical systems in his task to develop robust, validated, and on-time designs tailored to his application.[ES] La relevancia que la electrónica adquiere en la seguridad de los productos ha crecido inexorablemente, puesto que cada vez ésta copa una mayor influencia en la funcionalidad de los mismos. Pero, por supuesto, este hecho viene acompañado de una necesidad constante de mayores prestaciones para cumplir con los requerimientos funcionales, al tiempo que se mantienen los costes y el consumo en unos niveles reducidos. En este escenario, la industria está realizando esfuerzos para proveer una tecnología que cumpla con todas las especificaciones de potencia, consumo y precio, a costa de un incremento en la vulnerabilidad a múltiples tipos de fallos conocidos o la introducción de nuevos. Para ofrecer una solución a los fallos nuevos y crecientes en los sistemas, los diseñadores han recurrido a técnicas tradicionalmente asociadas a sistemas críticos para la seguridad, que ofrecen en general resultados sub-óptimos. De hecho, las arquitecturas empotradas modernas ofrecen la posibilidad de optimizar las propiedades de confiabilidad al habilitar la interacción de los niveles de hardware, firmware y software en el proceso. No obstante, ese punto no está resulto todavía. Se necesitan avances en todos los niveles en la mencionada dirección para poder alcanzar los objetivos de una tolerancia a fallos flexible, robusta, resiliente y a bajo coste. El trabajo presentado aquí se centra en el nivel de hardware, con la consideración de fondo de una potencial integración en una estrategia holística. Los esfuerzos de esta tesis se han centrado en los siguientes aspectos: (i) la introducción de modelos de fallo adicionales requeridos para la representación adecuada de efectos físicos surgentes en las tecnologías de manufactura actuales, (ii) la provisión de herramientas y métodos para la inyección eficiente de los modelos propuestos y de los clásicos, (iii) el análisis del método óptimo para estudiar la robustez de sistemas mediante el uso de inyección de fallos extensiva, y la posterior correlación con capas de más alto nivel en un esfuerzo por recortar el tiempo y coste de desarrollo, (iv) la provisión de nuevos métodos de detección para cubrir los retos planteados por los modelos de fallo propuestos, (v) la propuesta de estrategias de mitigación enfocadas hacia el tratamiento de dichos escenarios de amenaza y (vi) la introducción de una metodología automatizada de despliegue de diversos mecanismos de tolerancia a fallos de forma robusta y sistemática. Los resultados de la presente tesis constituyen un conjunto de herramientas y métodos para ayudar al diseñador de sistemas críticos en su tarea de desarrollo de diseños robustos, validados y en tiempo adaptados a su aplicación.[CA] La rellevància que l'electrònica adquireix en la seguretat dels productes ha crescut inexorablement, puix cada volta més aquesta abasta una major influència en la funcionalitat dels mateixos. Però, per descomptat, aquest fet ve acompanyat d'un constant necessitat de majors prestacions per acomplir els requeriments funcionals, mentre es mantenen els costos i consums en uns nivells reduïts. Donat aquest escenari, la indústria està fent esforços per proveir una tecnologia que complisca amb totes les especificacions de potència, consum i preu, tot a costa d'un increment en la vulnerabilitat a diversos tipus de fallades conegudes, i a la introducció de nous tipus. Per oferir una solució a les noves i creixents fallades als sistemes, els dissenyadors han recorregut a tècniques tradicionalment associades a sistemes crítics per a la seguretat, que en general oferixen resultats sub-òptims. De fet, les arquitectures empotrades modernes oferixen la possibilitat d'optimitzar les propietats de confiabilitat en habilitar la interacció dels nivells de hardware, firmware i software en el procés. Tot i això eixe punt no està resolt encara. Es necessiten avanços a tots els nivells en l'esmentada direcció per poder assolir els objectius d'una tolerància a fallades flexible, robusta, resilient i a baix cost. El treball ací presentat se centra en el nivell de hardware, amb la consideració de fons d'una potencial integració en una estratègia holística. Els esforços d'esta tesi s'han centrat en els següents aspectes: (i) la introducció de models de fallada addicionals requerits per a la representació adequada d'efectes físics que apareixen en les tecnologies de fabricació actuals, (ii) la provisió de ferramentes i mètodes per a la injecció eficient del models proposats i dels clàssics, (iii) l'anàlisi del mètode òptim per estudiar la robustesa de sistemes mitjançant l'ús d'injecció de fallades extensiva, i la posterior correlació amb capes de més alt nivell en un esforç per retallar el temps i cost de desenvolupament, (iv) la provisió de nous mètodes de detecció per cobrir els reptes plantejats pels models de fallades proposats, (v) la proposta d'estratègies de mitigació enfocades cap al tractament dels esmentats escenaris d'amenaça i (vi) la introducció d'una metodologia automatitzada de desplegament de diversos mecanismes de tolerància a fallades de forma robusta i sistemàtica. Els resultats de la present tesi constitueixen un conjunt de ferramentes i mètodes per ajudar el dissenyador de sistemes crítics en la seua tasca de desenvolupament de dissenys robustos, validats i a temps adaptats a la seua aplicació.Espinosa García, J. (2016). New Fault Detection, Mitigation and Injection Strategies for Current and Forthcoming Challenges of HW Embedded Designs [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/73146TESISCompendi

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today’s points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Special Topics in Information Technology

    Get PDF
    This open access book presents thirteen outstanding doctoral dissertations in Information Technology from the Department of Electronics, Information and Bioengineering, Politecnico di Milano, Italy. Information Technology has always been highly interdisciplinary, as many aspects have to be considered in IT systems. The doctoral studies program in IT at Politecnico di Milano emphasizes this interdisciplinary nature, which is becoming more and more important in recent technological advances, in collaborative projects, and in the education of young researchers. Accordingly, the focus of advanced research is on pursuing a rigorous approach to specific research topics starting from a broad background in various areas of Information Technology, especially Computer Science and Engineering, Electronics, Systems and Control, and Telecommunications. Each year, more than 50 PhDs graduate from the program. This book gathers the outcomes of the thirteen best theses defended in 2020-21 and selected for the IT PhD Award. Each of the authors provides a chapter summarizing his/her findings, including an introduction, description of methods, main achievements and future work on the topic. Hence, the book provides a cutting-edge overview of the latest research trends in Information Technology at Politecnico di Milano, presented in an easy-to-read format that will also appeal to non-specialists

    Signaling in 3-D integrated circuits, benefits and challenges

    Get PDF
    Three-dimensional (3-D) or vertical integration is a design and packaging paradigm that can mitigate many of the increasing challenges related to the design of modern integrated systems. 3-D circuits have recently been at the spotlight, since these circuits provide a potent approach to enhance the performance and integrate diverse functions within amulti-plane stack. Clock networks consume a great portion of the power dissipated in a circuit. Therefore, designing a low-power clock network in synchronous circuits is an important task. This requirement is stricter for 3-D circuits due to the increased power densities. Synchronization issues can be more challenging for 3-D circuits since a clock path can spread across several planes with different physical and electrical characteristics. Consequently, designing low power clock networks for 3-D circuits is an important issue. Resonant clock networks are considered efficient low-power alternatives to conventional clock distribution schemes. These networks utilize additional inductive circuits to reduce power while delivering a full swing clock signal to the sink nodes. In this research, a design method to apply resonant clocking to synthesized clock trees is proposed. Manufacturing processes for 3-D circuits include some additional steps as compared to standard CMOS processes which makes 3-D circuits more susceptible to manufacturing defects and lowers the overall yield of the bonded 3-D stack. Testing is another complicated task for 3-D ICs, where pre-bond test is a prerequisite. Pre-bond testability, in turn, presents new challenges to 3-D clock network design primarily due to the incomplete clock distribution networks prior to the bonding of the planes. A design methodology of resonant 3-D clock networks that support wireless pre-bond testing is introduced. To efficiently address this issue, inductive links are exploited to wirelessly transmit the clock signal to the disjoint resonant clock networks. The inductors comprising the LC tanks are used as the receiver circuit for the links, essentially eliminating the need for additional circuits and/or interconnect resources during pre-bond test. Recent FPGAs are quite complex circuits which provide reconfigurablity at the cost of lower performance and higher power consumption as compared to ASIC circuits. Exploiting a large number of programmable switches, routing structures are mainly responsible for performance degradation in FPAGs. Employing 3-D technology can providemore efficient switches which drastically improve the performance and reduce the power consumption of the FPGA. RRAM switches are one of the most promising candidates to improve the FPGA routing architecture thanks to their low on-resistance and non-volatility. Along with the configurable switches, buffers are the other important element of the FPGAs routing structure. Different characteristics of RRAM switches change the properties of signal paths in RRAM-based FPGAs. The on resistance of RRAMswitches is considerably lower than CMOS pass gate switches which results in lower RC delay for RRAM-based routing paths. This different nature in critical path and signal delay in turn affect the need for intermediate buffers. Thus the buffer allocation should be reconsidered. In the last part of this research, the effect of intermediate buffers on signal propagation delay is studied and a modified buffer allocation scheme for RRAM-based FPGA routing path is proposed

    Special Topics in Information Technology

    Get PDF
    This open access book presents thirteen outstanding doctoral dissertations in Information Technology from the Department of Electronics, Information and Bioengineering, Politecnico di Milano, Italy. Information Technology has always been highly interdisciplinary, as many aspects have to be considered in IT systems. The doctoral studies program in IT at Politecnico di Milano emphasizes this interdisciplinary nature, which is becoming more and more important in recent technological advances, in collaborative projects, and in the education of young researchers. Accordingly, the focus of advanced research is on pursuing a rigorous approach to specific research topics starting from a broad background in various areas of Information Technology, especially Computer Science and Engineering, Electronics, Systems and Control, and Telecommunications. Each year, more than 50 PhDs graduate from the program. This book gathers the outcomes of the thirteen best theses defended in 2020-21 and selected for the IT PhD Award. Each of the authors provides a chapter summarizing his/her findings, including an introduction, description of methods, main achievements and future work on the topic. Hence, the book provides a cutting-edge overview of the latest research trends in Information Technology at Politecnico di Milano, presented in an easy-to-read format that will also appeal to non-specialists

    Inter-module Interfacing techniques for SoCs with multiple clock domains to address challenges in modern deep sub-micron technologies

    Get PDF
    Miniaturization of integrated circuits (ICs) due to the improvement in lithographic techniques in modem deep sub-micron (DSM) technologies allows several complex processing elements to coexist in one IC, which are called System-on-Chip. As a first contribution, this thesis quantitatively analyzes the severity of timing constraints associated with Clock Distribution Network (CDN) in modem DSM technologies and shows that different processing elements may work in different dock domains to alleviate these constraints. Such systems are known as Globally Asynchronous Locally Synchronous (GALS) systems. It is imperative that different processing elements of a GALS system need to communicate with each other through some interfacing technique, and these interfaces can be asynchronous or synchronous. Conventionally, the asynchronous interfaces are described at the Register Transfer Logic (RTL) or system level. Such designs are susceptible to certain design constraints that cannot be addressed at higher abstraction levels; crosstalk glitch is one such constraint. This thesis initially identifies, using an analytical model, the possibility of asynchronous interface malfunction due to crosstalk glitch propagation. Next, we characterize crosstalk glitch propagation under normal operating conditions for two different classes of asynchronous protocols, namely bundled data protocol based and delay insensitive asynchronous designs. Subsequently, we propose a logic abstraction level modeling technique, which provides a framework to the designer to verify the asynchronous protocols against crosstalk glitches. The utility of this modeling technique is demonstrated experimentally on a Xilinx Virtex-II Pro FPGA. Furthermore, a novel methodology is proposed to quench such crosstalk glitch propagation through gating the asynchronous interface from sending the signal during potential glitch vulnerable instances. This methodology is termed as crosstalk glitch gating. This technique is successfully applied to obtain crosstalk glitch quenching in the representative interfaces. This thesis also addresses the dock skew challenges faced by high-performance synchronous interfacing methodologies in modem DSM technologies. The proposed methodology allows communicating modules to run at a frequency that is independent of the dock skew. Leveraging a novel clock-scheduling algorithm, our technique permits a faster module to communicate safely with a slower module without slowing down. Safe data communications for mesochronous schemes and for the cases when communicating modules have dock frequency ratios of integer or coprime numbers are theoretically explained and experimentally demonstrated. A clock-scheduling technique to dynamically accommodate phase variations is also proposed. These methods are implemented to the Xilinx Virtex II Pro technology. Experiments prove that the proposed interfacing scheme allows modules to communicate data safely, for mesochronous schemes, at 350 MHz, which is the limit of the technology used, under a dock skew of more than twice the time period (i.e. a dock skew of 12 ns
    • …
    corecore