Search CORE

1,093 research outputs found

Artificial intelligence driven anomaly detection for big data systems

Author: Alnafessah Ahmad
Publication venue: Computing, Imperial College London
Publication date: 01/06/2022
Field of study

The main goal of this thesis is to contribute to the research on automated performance anomaly detection and interference prediction by implementing Artificial Intelligence (AI) solutions for complex distributed systems, especially for Big Data platforms within cloud computing environments. The late detection and manual resolutions of performance anomalies and system interference in Big Data systems may lead to performance violations and financial penalties. Motivated by this issue, we propose AI-based methodologies for anomaly detection and interference prediction tailored to Big Data and containerized batch platforms to better analyze system performance and effectively utilize computing resources within cloud environments. Therefore, new precise and efficient performance management methods are the key to handling performance anomalies and interference impacts to improve the efficiency of data center resources. The first part of this thesis contributes to performance anomaly detection for in-memory Big Data platforms. We examine the performance of Big Data platforms and justify our choice of selecting the in-memory Apache Spark platform. An artificial neural network-driven methodology is proposed to detect and classify performance anomalies for batch workloads based on the RDD characteristics and operating system monitoring metrics. Our method is evaluated against other popular machine learning algorithms (ML), as well as against four different monitoring datasets. The results prove that our proposed method outperforms other ML methods, typically achieving 98–99% F-scores. Moreover, we prove that a random start instant, a random duration, and overlapped anomalies do not significantly impact the performance of our proposed methodology. The second contribution addresses the challenge of anomaly identification within an in-memory streaming Big Data platform by investigating agile hybrid learning techniques. We develop TRACK (neural neTwoRk Anomaly deteCtion in sparK) and TRACK-Plus, two methods to efficiently train a class of machine learning models for performance anomaly detection using a fixed number of experiments. Our model revolves around using artificial neural networks with Bayesian Optimization (BO) to find the optimal training dataset size and configuration parameters to efficiently train the anomaly detection model to achieve high accuracy. The objective is to accelerate the search process for finding the size of the training dataset, optimizing neural network configurations, and improving the performance of anomaly classification. A validation based on several datasets from a real Apache Spark Streaming system is performed, demonstrating that the proposed methodology can efficiently identify performance anomalies, near-optimal configuration parameters, and a near-optimal training dataset size while reducing the number of experiments up to 75% compared with naïve anomaly detection training. The last contribution overcomes the challenges of predicting completion time of containerized batch jobs and proactively avoiding performance interference by introducing an automated prediction solution to estimate interference among colocated batch jobs within the same computing environment. An AI-driven model is implemented to predict the interference among batch jobs before it occurs within system. Our interference detection model can alleviate and estimate the task slowdown affected by the interference. This model assists the system operators in making an accurate decision to optimize job placement. Our model is agnostic to the business logic internal to each job. Instead, it is learned from system performance data by applying artificial neural networks to establish the completion time prediction of batch jobs within the cloud environments. We compare our model with three other baseline models (queueing-theoretic model, operational analysis, and an empirical method) on historical measurements of job completion time and CPU run-queue size (i.e., the number of active threads in the system). The proposed model captures multithreading, operating system scheduling, sleeping time, and job priorities. A validation based on 4500 experiments based on the DaCapo benchmarking suite was carried out, confirming the predictive efficiency and capabilities of the proposed model by achieving up to 10% MAPE compared with the other models.Open Acces

Spiral - Imperial College Digital Repository

Scalability resilience framework using application-level fault injection for cloud-based software services

Author: Al-Said Ahmad Amro
Andras Peter
Publication venue: Springer
Publication date: 01/01/2022
Field of study

This paper presents an investigation into the effect of faults on the scalability resilience of cloud-based software services. The study introduces an experimental framework using the Application-Level Fault Injection (ALFI) to investigate how the faults at the application level affect the scalability resilience and behaviour of cloud-based software services. Previous studies on scalability analysis of cloud-based software services provide a baseline of the scalability behaviour of such services, allowing to conduct in-depth scalability investigation of these services. Experimental analysis on the EC2 cloud using a real-world cloud-based software service is used to demonstrate the framework, considering delay latency of software faults with two varied settings and two demand scenarios. The experimental approach is explained in detail. Here we simulate delay latency injection with two different times, 800 and 1600 ms, and compare the results with the baseline data. The results show that the proposed approach allows a fair assessment of the fault scenario’s impact on the cloud software service’s scalability resilience. We explain the use of the methodology to determine the impact of injected faults on the scalability behaviour and resilience of cloud-based software services

Keele Research Repository

Repository@Napier

The Cloud-to-Thing Continuum

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

The Internet of Things offers massive societal and economic opportunities while at the same time significant challenges, not least the delivery and management of the technical infrastructure underpinning it, the deluge of data generated from it, ensuring privacy and security, and capturing value from it. This Open Access Pivot explores these challenges, presenting the state of the art and future directions for research but also frameworks for making sense of this complex area. This book provides a variety of perspectives on how technology innovations such as fog, edge and dew computing, 5G networks, and distributed intelligence are making us rethink conventional cloud computing to support the Internet of Things. Much of this book focuses on technical aspects of the Internet of Things, however, clear methodologies for mapping the business value of the Internet of Things are still missing. We provide a value mapping framework for the Internet of Things to address this gap. While there is much hype about the Internet of Things, we have yet to reach the tipping point. As such, this book provides a timely entrée for higher education educators, researchers and students, industry and policy makers on the technologies that promise to reshape how society interacts and operates

OAPEN Library

Data Labeling for Fault Detection in Cloud: A Test Suite-Based Active Learning Approach

Author: Bagora Prateek
Publication venue
Publication date: 24/05/2023
Field of study

Cloud computing enables ubiquitous on-demand network access to a shared pool of configurable computing resources with minimal management efforts from the user. It has evolved as a key computing paradigm to enable a wide variety of applications such as e-commerce, social networks, high-performance computing, mission-critical applications, and Internet of Things (IoT). Ensuring the quality of service of applications deployed in inherently complex and fault-prone cloud environments is of utmost concern to service providers and end users. Machine learning-based fault management solutions enable proactive identification and mitigation of faults in cloud environments to attain the desired reliability, though they require labeled cloud metrics data for training and evaluation. Moreover, the high dynamicity in cloud environments brings forth emerging data distributions, which necessitate frequent labeling of cloud metrics data stemming from an evolving data distribution for model adaptation. In this thesis, we study the problem of data labeling for fault detection in cloud environments, paying close attention to the phenomenon of evolving cloud metric data distributions. More specifically, we propose a test suite-based active learning framework for automated labeling of cloud metrics data with the corresponding cloud system state while accounting for emerging fault patterns and data or concept drifts. We implemented our solution on a cloud testbed and introduced various emerging data distribution scenarios to evaluate the proposed framework's labeling efficacy over known and emerging data distributions. According to our evaluation results, the proposed framework achieves about 41% higher weighted F1-score and 34% higher average Area Under One-vs-Rest Receiver Operating Characteristic curves (OvR ROC AUC score) than a system without any adaptation for emerging data distributions

Concordia University Research Repository

START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks

Author: Buyya R
Casale G
Garraghan P
Gill SS
Jennings N
Tuli S
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/11/2021
Field of study

A common performance problem in large-scale cloud systems is dealing with straggler tasks that are slow running instances which increase the overall response time. Such tasks impact the system's QoS and the SLA. There is a need for automatic straggler detection and mitigation mechanisms that execute jobs without violating the SLA. Prior work typically builds reactive models that focus first on detection and then mitigation of straggler tasks, which leads to delays. Other works use prediction based proactive mechanisms, but ignore volatile task characteristics. We propose a Straggler Prediction and Mitigation Technique (START) that is able to predict which tasks might be stragglers and dynamically adapt scheduling to achieve lower response times. START analyzes all tasks and hosts based on compute and network resource consumption using an Encoder LSTM network to predict and mitigate expected straggler tasks. This reduces the SLA violation rate and execution time without compromising QoS. Specifically, we use the CloudSim toolkit to simulate START and compare it with IGRU-SD, SGC, Dolly, GRASS, NearestFit and Wrangler in terms of QoS parameters. Experiments show that START reduces execution time, resource contention, energy and SLA violations by 13%, 11%, 16%, 19%, compared to the state-of-the-art

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

Queen Mary Research Online

Lancaster E-Prints

Sustainable Edge Computing: Challenges and Future Directions

Author: Arroba Patricia
Buyya Rajkumar
Cárdenas Román
Moya José M.
Risco-Martín José L.
Publication venue
Publication date: 10/04/2023
Field of study

An increasing amount of data is being injected into the network from IoT (Internet of Things) applications. Many of these applications, developed to improve society's quality of life, are latency-critical and inject large amounts of data into the network. These requirements of IoT applications trigger the emergence of Edge computing paradigm. Currently, data centers are responsible for a global energy use between 2% and 3%. However, this trend is difficult to maintain, as bringing computing infrastructures closer to the edge of the network comes with its own set of challenges for energy efficiency. In this paper, we propose our approach for the sustainability of future computing infrastructures to provide (i) an energy-efficient and economically viable deployment, (ii) a fault-tolerant automated operation, and (iii) a collaborative resource management to improve resource efficiency. We identify the main limitations of applying Cloud-based approaches close to the data sources and present the research challenges to Edge sustainability arising from these constraints. We propose two-phase immersion cooling, formal modeling, machine learning, and energy-centric federated management as Edge-enabling technologies. We present our early results towards the sustainability of an Edge infrastructure to demonstrate the benefits of our approach for future computing environments and deployments.Comment: 26 pages, 16 figure

arXiv.org e-Print Archive

Security and trust in cloud computing and IoT through applying obfuscation, diversification, and trusted computing technologies

Author: Hosseinzadeh Shohreh
Publication venue: fi=Turku Centre for Computer Science|en=Turku Centre for Computer Science|
Publication date: 28/11/2020
Field of study

Cloud computing and Internet of Things (IoT) are very widely spread and commonly used technologies nowadays. The advanced services offered by cloud computing have made it a highly demanded technology. Enterprises and businesses are more and more relying on the cloud to deliver services to their customers. The prevalent use of cloud means that more data is stored outside the organization’s premises, which raises concerns about the security and privacy of the stored and processed data. This highlights the significance of effective security practices to secure the cloud infrastructure. The number of IoT devices is growing rapidly and the technology is being employed in a wide range of sectors including smart healthcare, industry automation, and smart environments. These devices collect and exchange a great deal of information, some of which may contain critical and personal data of the users of the device. Hence, it is highly significant to protect the collected and shared data over the network; notwithstanding, the studies signify that attacks on these devices are increasing, while a high percentage of IoT devices lack proper security measures to protect the devices, the data, and the privacy of the users. In this dissertation, we study the security of cloud computing and IoT and propose software-based security approaches supported by the hardware-based technologies to provide robust measures for enhancing the security of these environments. To achieve this goal, we use obfuscation and diversification as the potential software security techniques. Code obfuscation protects the software from malicious reverse engineering and diversification mitigates the risk of large-scale exploits. We study trusted computing and Trusted Execution Environments (TEE) as the hardware-based security solutions. Trusted Platform Module (TPM) provides security and trust through a hardware root of trust, and assures the integrity of a platform. We also study Intel SGX which is a TEE solution that guarantees the integrity and confidentiality of the code and data loaded onto its protected container, enclave. More precisely, through obfuscation and diversification of the operating systems and APIs of the IoT devices, we secure them at the application level, and by obfuscation and diversification of the communication protocols, we protect the communication of data between them at the network level. For securing the cloud computing, we employ obfuscation and diversification techniques for securing the cloud computing software at the client-side. For an enhanced level of security, we employ hardware-based security solutions, TPM and SGX. These solutions, in addition to security, ensure layered trust in various layers from hardware to the application. As the result of this PhD research, this dissertation addresses a number of security risks targeting IoT and cloud computing through the delivered publications and presents a brief outlook on the future research directions.Pilvilaskenta ja esineiden internet ovat nykyään hyvin tavallisia ja laajasti sovellettuja tekniikkoja. Pilvilaskennan pitkälle kehittyneet palvelut ovat tehneet siitä hyvin kysytyn teknologian. Yritykset enenevässä määrin nojaavat pilviteknologiaan toteuttaessaan palveluita asiakkailleen. Vallitsevassa pilviteknologian soveltamistilanteessa yritykset ulkoistavat tietojensa käsittelyä yrityksen ulkopuolelle, minkä voidaan nähdä nostavan esiin huolia taltioitavan ja käsiteltävän tiedon turvallisuudesta ja yksityisyydestä. Tämä korostaa tehokkaiden turvallisuusratkaisujen merkitystä osana pilvi-infrastruktuurin turvaamista. Esineiden internet -laitteiden lukumäärä on nopeasti kasvanut. Teknologiana sitä sovelletaan laajasti monilla sektoreilla, kuten älykkäässä terveydenhuollossa, teollisuusautomaatiossa ja älytiloissa. Sellaiset laitteet keräävät ja välittävät suuria määriä informaatiota, joka voi sisältää laitteiden käyttäjien kannalta kriittistä ja yksityistä tietoa. Tästä syystä johtuen on erittäin merkityksellistä suojata verkon yli kerättävää ja jaettavaa tietoa. Monet tutkimukset osoittavat esineiden internet -laitteisiin kohdistuvien tietoturvahyökkäysten määrän olevan nousussa, ja samaan aikaan suuri osuus näistä laitteista ei omaa kunnollisia teknisiä ominaisuuksia itse laitteiden tai niiden käyttäjien yksityisen tiedon suojaamiseksi. Tässä väitöskirjassa tutkitaan pilvilaskennan sekä esineiden internetin tietoturvaa ja esitetään ohjelmistopohjaisia tietoturvalähestymistapoja turvautumalla osittain laitteistopohjaisiin teknologioihin. Esitetyt lähestymistavat tarjoavat vankkoja keinoja tietoturvallisuuden kohentamiseksi näissä konteksteissa. Tämän saavuttamiseksi työssä sovelletaan obfuskaatiota ja diversifiointia potentiaalisiana ohjelmistopohjaisina tietoturvatekniikkoina. Suoritettavan koodin obfuskointi suojaa pahantahtoiselta ohjelmiston takaisinmallinnukselta ja diversifiointi torjuu tietoturva-aukkojen laaja-alaisen hyödyntämisen riskiä. Väitöskirjatyössä tutkitaan luotettua laskentaa ja luotettavan laskennan suoritusalustoja laitteistopohjaisina tietoturvaratkaisuina. TPM (Trusted Platform Module) tarjoaa turvallisuutta ja luottamuksellisuutta rakentuen laitteistopohjaiseen luottamukseen. Pyrkimyksenä on taata suoritusalustan eheys. Työssä tutkitaan myös Intel SGX:ää yhtenä luotettavan suorituksen suoritusalustana, joka takaa suoritettavan koodin ja datan eheyden sekä luottamuksellisuuden pohjautuen suojatun säiliön, saarekkeen, tekniseen toteutukseen. Tarkemmin ilmaistuna työssä turvataan käyttöjärjestelmä- ja sovellusrajapintatasojen obfuskaation ja diversifioinnin kautta esineiden internet -laitteiden ohjelmistokerrosta. Soveltamalla samoja tekniikoita protokollakerrokseen, työssä suojataan laitteiden välistä tiedonvaihtoa verkkotasolla. Pilvilaskennan turvaamiseksi työssä sovelletaan obfuskaatio ja diversifiointitekniikoita asiakaspuolen ohjelmistoratkaisuihin. Vankemman tietoturvallisuuden saavuttamiseksi työssä hyödynnetään laitteistopohjaisia TPM- ja SGX-ratkaisuja. Tietoturvallisuuden lisäksi nämä ratkaisut tarjoavat monikerroksisen luottamuksen rakentuen laitteistotasolta ohjelmistokerrokseen asti. Tämän väitöskirjatutkimustyön tuloksena, osajulkaisuiden kautta, vastataan moniin esineiden internet -laitteisiin ja pilvilaskentaan kohdistuviin tietoturvauhkiin. Työssä esitetään myös näkemyksiä jatkotutkimusaiheista

UTUPub

DRAGON: Decentralized fault tolerance in edge federations

Author: Casale G
Jennings NR
Tuli S
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 16/08/2022
Field of study

Edge Federation is a new computing paradigm that seamlessly interconnects the resources of multiple edge service providers. A key challenge in such systems is the deployment of latency-critical and AI based resource-intensive applications in constrained devices. To address this challenge, we propose a novel memory-efficient deep learning based model, namely generative optimization networks (GON). Unlike GANs, GONs use a single network to both discriminate input and generate samples, significantly reducing their memory footprint. Leveraging the low memory footprint of GONs, we propose a decentralized fault-tolerance method called DRAGON that runs simulations (as per a digital modeling twin) to quickly predict and optimize the performance of the edge federation. Extensive experiments with real-world edge computing benchmarks on multiple Raspberry-Pi based federated edge configurations show that DRAGON can outperform the baseline methods in fault-detection and Quality of Service (QoS) metrics. Specifically, the proposed method gives higher F1 scores for fault-detection than the best deep learning (DL) method, while consuming lower memory than the heuristic methods. This allows for improvement in energy consumption, response time and service level agreement violations by up to 74, 63 and 82 percent, respectively

Spiral - Imperial College Digital Repository