696 research outputs found

    DeepFT: Fault-tolerant edge computing using a self-supervised deep surrogate model

    Get PDF
    The emergence of latency-critical AI applications has been supported by the evolution of the edge computing paradigm. However, edge solutions are typically resource-constrained, posing reliability challenges due to heightened contention for compute capacities and faulty application behavior in the presence of overload conditions. Although a large amount of generated log data can be mined for fault prediction, labeling this data for training is a manual process and thus a limiting factor for automation. Due to this, many companies resort to unsupervised fault-tolerance models. Yet, failure models of this kind can incur a loss of accuracy when they need to adapt to non-stationary workloads and diverse host characteristics. Thus, we propose a novel modeling approach, DeepFT, to proactively avoid system overloads and their adverse effects by optimizing the task scheduling decisions. DeepFT uses a deep-surrogate model to accurately predict and diagnose faults in the system and co-simulation based self-supervised learning to dynamically adapt the model in volatile settings. Experimentation on an edge cluster shows that DeepFT can outperform state-of-the-art methods in fault-detection and QoS metrics. Specifically, DeepFT gives the highest F1 scores for fault-detection, reducing service deadline violations by up to 37% while also improving response time by up to 9%

    Regularized Bottleneck with Early Labeling

    Get PDF
    International audienceSmall IoT devices, such as drones and lightweight battery-powered robots, are emerging as a major platform for the deployment of AI/ML capabilities. Autonomous and semiautonomous device operation relies on the systematic use of deep neural network models for solving complex tasks, such as image classification. The challenging restrictions of these devices in terms of computing capabilities, network connectivity, and power consumption are the main limits to the accuracy of latencysensitive inferences. This paper presents ReBEL, a split computing architecture enabling the dynamic remote offload of partial computations or, in alternative, a local approximate labeling based on a jointly-trained classifier. Our approach combines elements of head network distillation, early exit classification, and bottleneck injection with the goal of reducing the average endto-end latency of AI/ML inference on constrained IoT devices

    Deep Learning for Automated Experimentation in Scanning Transmission Electron Microscopy

    Full text link
    Machine learning (ML) has become critical for post-acquisition data analysis in (scanning) transmission electron microscopy, (S)TEM, imaging and spectroscopy. An emerging trend is the transition to real-time analysis and closed-loop microscope operation. The effective use of ML in electron microscopy now requires the development of strategies for microscopy-centered experiment workflow design and optimization. Here, we discuss the associated challenges with the transition to active ML, including sequential data analysis and out-of-distribution drift effects, the requirements for the edge operation, local and cloud data storage, and theory in the loop operations. Specifically, we discuss the relative contributions of human scientists and ML agents in the ideation, orchestration, and execution of experimental workflows and the need to develop universal hyper languages that can apply across multiple platforms. These considerations will collectively inform the operationalization of ML in next-generation experimentation.Comment: Review Articl

    Anomaly Detection Methods for Log Files

    Get PDF
    Tato práce se věnuje metodám detekce anomálií aplikovaným na soubory logů. Současné metody detekce anomálií obvykle používají tradiční přístup ke zpracování logů. Nejprve se soubory logů zpracují jejich parsováním, které transformuje textové informace na nespecifická strukturovaná data. Poté jsou data převedena na číselnou reprezentaci. Extrakce příznaků často souvisí s technikami používanými pro zpracování přirozeného jazyka. Tradiční přístup však vyžaduje rozsáhlé oborové znalosti a přeučení modelu, když se objeví nové typy logů. Díky nedávným pokrokům v oblasti zpracování přirozeného jazyka můžeme přímo naučit vnoření slov namísto extrakce příznaků založené na parsování logů. Navrhujeme nové modely založené na autoenkodérech využívajících vnoření slov, protože jsou doporučovanou volbou v oblasti detekce anomálií. Kromě toho experimentujeme s různými technikami, které jsme začlenili do autoenkodérů, jako jsou konvoluční vrstvy a mechanismus self-attention. Ověřujeme, že autoenkodéry využívající konvoluční vrstvy jsou vhodné pro detekci anomálií v souborech logů. Dále ukazujeme, že přidání mechanismu self-attention do modelů může být výhodné a otevírá prostor pro budoucí práci a další výzkum. Závěrem můžeme konstatovat, že tradiční přístup v kombinaci s autoenkodérem může na poskytnuté testovací datové sadě dosáhnout působivých výsledků. Nicméně model AECNN1D dosahuje nejslibnějších výsledků mezi všemi modely, které využívají vnoření slov logů - metrika F1-score je 0,8597 na testovací datové sadě. Model AECNN1D je obecně použitelný pro nasazení do produkce, protože nemá žádné další požadavky ani nevyžaduje občasné přeučování.This thesis is dedicated to methods of anomaly detection applied to log files. The current state-of-the-art anomaly detection methods usually follow the traditional approach for log processing. Firstly, log files are processed by a log parsing technique which transforms text information into non-specific structured data. Next, the data is converted into a numerical representation. The feature extraction is often related to natural language processing techniques. However, the traditional approach requires extensive domain knowledge and retraining a particular model when new log messages become available. Thanks to the recent advancements in the natural language processing domain, we can directly learn embedding vectors instead of the feature extraction based on log parsing. We propose novel autoencoder-based models leveraging the embedding vectors since autoencoders are a recommended choice in the field of anomaly detection. Moreover, we experiment with various techniques which are incorporated into autoencoders, such as convolutional layers and the self-attention mechanism. We verify that the autoencoders utilizing convolutional layers are effective for anomaly detection in log files. Furthermore, we demonstrate that boosting the models with the self-attention mechanism might be advantageous and open room for future work and further research. Finally, we can conclude that the traditional approach combined with an autoencoder may achieve impressive results on the provided testing data set. Nonetheless, the AECNN1D model achieves the most promising results among models leveraging the embedding representation of logs - the F1-score is 0.8597 on the testing data set. The AECNN1D model is generally applicable to deploying into the production since no additional requirements or periodic retraining is necessary

    TPMCF: Temporal QoS Prediction using Multi-Source Collaborative Features

    Full text link
    Recently, with the rapid deployment of service APIs, personalized service recommendations have played a paramount role in the growth of the e-commerce industry. Quality-of-Service (QoS) parameters determining the service performance, often used for recommendation, fluctuate over time. Thus, the QoS prediction is essential to identify a suitable service among functionally equivalent services over time. The contemporary temporal QoS prediction methods hardly achieved the desired accuracy due to various limitations, such as the inability to handle data sparsity and outliers and capture higher-order temporal relationships among user-service interactions. Even though some recent recurrent neural-network-based architectures can model temporal relationships among QoS data, prediction accuracy degrades due to the absence of other features (e.g., collaborative features) to comprehend the relationship among the user-service interactions. This paper addresses the above challenges and proposes a scalable strategy for Temporal QoS Prediction using Multi-source Collaborative-Features (TPMCF), achieving high prediction accuracy and faster responsiveness. TPMCF combines the collaborative-features of users/services by exploiting user-service relationship with the spatio-temporal auto-extracted features by employing graph convolution and transformer encoder with multi-head self-attention. We validated our proposed method on WS-DREAM-2 datasets. Extensive experiments showed TPMCF outperformed major state-of-the-art approaches regarding prediction accuracy while ensuring high scalability and reasonably faster responsiveness.Comment: 10 Pages, 7 figure

    Distributed Training Large-Scale Deep Architectures

    Full text link
    Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we focus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks and overheads that hinter data parallelism. We then devise guidelines that help practitioners to configure an effective system and fine-tune parameters to achieve desired speedup. Specifically, we develop a procedure for setting minibatch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training

    Detecting Anomalies From Big Data System Logs

    Get PDF
    Nowadays, big data systems (e.g., Hadoop and Spark) are being widely adopted by many domains for offering effective data solutions, such as manufacturing, healthcare, education, and media. A common problem about big data systems is called anomaly, e.g., a status deviated from normal execution, which decreases the performance of computation or kills running programs. It is becoming a necessity to detect anomalies and analyze their causes. An effective and economical approach is to analyze system logs. Big data systems produce numerous unstructured logs that contain buried valuable information. However manually detecting anomalies from system logs is a tedious and daunting task. This dissertation proposes four approaches that can accurately and automatically analyze anomalies from big data system logs without extra monitoring overhead. Moreover, to detect abnormal tasks in Spark logs and analyze root causes, we design a utility to conduct fault injection and collect logs from multiple compute nodes. (1) Our first method is a statistical-based approach that can locate those abnormal tasks and calculate the weights of factors for analyzing the root causes. In the experiment, four potential root causes are considered, i.e., CPU, memory, network, and disk I/O. The experimental results show that the proposed approach is accurate in detecting abnormal tasks as well as finding the root causes. (2) To give a more reasonable probability result and avoid ad-hoc factor weights calculating, we propose a neural network approach to analyze root causes of abnormal tasks. We leverage General Regression Neural Network (GRNN) to identify root causes for abnormal tasks. The likelihood of reported root causes is presented to users according to the weighted factors by GRNN. (3) To further improve anomaly detection by avoiding feature extraction, we propose a novel approach by leveraging Convolutional Neural Networks (CNN). Our proposed model can automatically learn event relationships in system logs and detect anomaly with high accuracy. Our deep neural network consists of logkey2vec embeddings, three 1D convolutional layers, a dropout layer, and max pooling. According to our experiment, our CNN-based approach has better accuracy compared to other approaches using Long Short-Term Memory (LSTM) and Multilayer Perceptron (MLP) on detecting anomaly in Hadoop DistributedFile System (HDFS) logs. (4) To analyze system logs more accurately, we extend our CNN-based approach with two attention schemes to detect anomalies in system logs. The proposed two attention schemes focus on different features from CNN\u27s output. We evaluate our approaches with several benchmarks, and the attention-based CNN model shows the best performance among all state-of-the-art methods
    corecore