Search CORE

445 research outputs found

DRAGON: Decentralized fault tolerance in edge federations

Author: Casale G
Jennings NR
Tuli S
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/08/2022
Field of study

Edge Federation is a new computing paradigm that seamlessly interconnects the resources of multiple edge service providers. A key challenge in such systems is the deployment of latency-critical and AI based resource-intensive applications in constrained devices. To address this challenge, we propose a novel memory-efficient deep learning based model, namely generative optimization networks (GON). Unlike GANs, GONs use a single network to both discriminate input and generate samples, significantly reducing their memory footprint. Leveraging the low memory footprint of GONs, we propose a decentralized fault-tolerance method called DRAGON that runs simulations (as per a digital modeling twin) to quickly predict and optimize the performance of the edge federation. Extensive experiments with real-world edge computing benchmarks on multiple Raspberry-Pi based federated edge configurations show that DRAGON can outperform the baseline methods in fault-detection and Quality of Service (QoS) metrics. Specifically, the proposed method gives higher F1 scores for fault-detection than the best deep learning (DL) method, while consuming lower memory than the heuristic methods. This allows for improvement in energy consumption, response time and service level agreement violations by up to 74, 63 and 82 percent, respectively

Spiral - Imperial College Digital Repository

Contego: An Adaptive Framework for Integrating Security Tasks in Real-Time Systems

Author: Bobba Rakesh B.
Hasan Monowar
Mohan Sibin
Pellizzoni Rodolfo
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 29th Euromicro Conference on Real-Time Systems (ECRTS 2017)
Publication date: 01/01/2017
Field of study

Embedded real-time systems (RTS) are pervasive. Many modern RTS are exposed to unknown security flaws, and threats to RTS are growing in both number and sophistication. However, until recently, cyber-security considerations were an afterthought in the design of such systems. Any security mechanisms integrated into RTS must (a) co-exist with the real-time tasks in the system and (b) operate without impacting the timing and safety constraints of the control logic. We introduce Contego, an approach to integrating security tasks into RTS without affecting temporal requirements. Contego is specifically designed for legacy systems, viz., the real-time control systems in which major alterations of the system parameters for constituent tasks is not always feasible. Contego combines the concept of opportunistic execution with hierarchical scheduling to maintain compatibility with legacy systems while still providing flexibility by allowing security tasks to operate in different modes. We also define a metric to measure the effectiveness of such integration. We evaluate Contego using synthetic workloads as well as with an implementation on a realistic embedded platform (an open-source ARM CPU running real-time Linux)

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Real-Time Wireless Sensor-Actuator Networks for Cyber-Physical Systems

Author: Saifullah Abusayeed
Publication venue: Washington University Open Scholarship
Publication date: 01/09/2014
Field of study

A cyber-physical system (CPS) employs tight integration of, and coordination between computational, networking, and physical elements. Wireless sensor-actuator networks provide a new communication technology for a broad range of CPS applications such as process control, smart manufacturing, and data center management. Sensing and control in these systems need to meet stringent real-time performance requirements on communication latency in challenging environments. There have been limited results on real-time scheduling theory for wireless sensor-actuator networks. Real-time transmission scheduling and analysis for wireless sensor-actuator networks requires new methodologies to deal with unique characteristics of wireless communication. Furthermore, the performance of a wireless control involves intricate interactions between real-time communication and control. This thesis research tackles these challenges and make a series of contributions to the theory and system for wireless CPS. (1) We establish a new real-time scheduling theory for wireless sensor-actuator networks. (2) We develop a scheduling-control co-design approach for holistic optimization of control performance in a wireless control system. (3) We design and implement a wireless sensor-actuator network for CPS in data center power management. (4) We expand our research to develop scheduling algorithms and analyses for real-time parallel computing to support computation-intensive CPS

Washington University St. Louis: Open Scholarship

Scheduling Algorithms for Parallel Execution of Computer Programs

Author: Samadzadeh Farideh Ansari-jafari
Publication venue: 'Oklahoma State University Library'
Publication date: 01/07/1992
Field of study

Computer Scienc

SHAREOK repository

TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

Author: Chen Xu
Chen Yuheng
Guo Yongqiang
Li Kangyu
Li Qingping
Li Shigang
Wu Baodong
Xia Lei
Xiang Tieyao
Publication venue
Publication date: 18/10/2023
Field of study

Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields. However, training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months. Due to the inevitable hardware and software failures in large-scale clusters, maintaining uninterrupted and long-duration training is extremely challenging. As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency. To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system. In this work, we design three key subsystems: the training pipeline automatic fault tolerance and recovery mechanism named Transom Operator and Launcher (TOL), the training task multi-dimensional metric automatic anomaly detection system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous access automatic fault tolerance and recovery technology named Transom Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks, while TEE is responsible for task monitoring and anomaly reporting. TEE detects training anomalies and reports them to TOL, who automatically enters the fault tolerance strategy to eliminate abnormal nodes and restart the training task. And the asynchronous checkpoint saving and loading functionality provided by TCE greatly shorten the fault tolerance overhead. The experimental results indicate that TRANSOM significantly enhances the efficiency of large-scale LLM training on clusters. Specifically, the pre-training time for GPT3-175B has been reduced by 28%, while checkpoint saving and loading performance have improved by a factor of 20.Comment: 14 pages, 9 figure

arXiv.org e-Print Archive

Sequence Prediction in Real-time Systems

Author: Arets Cody F.
Publication venue
Publication date: 01/09/2022
Field of study

Pure OAI Repository