253 research outputs found

    Review and Analysis of Failure Detection and Prevention Techniques in IT Infrastructure Monitoring

    Get PDF
    Maintaining the health of IT infrastructure components for improved reliability and availability is a research and innovation topic for many years. Identification and handling of failures are crucial and challenging due to the complexity of IT infrastructure. System logs are the primary source of information to diagnose and fix failures. In this work, we address three essential research dimensions about failures, such as the need for failure handling in IT infrastructure, understanding the contribution of system-generated log in failure detection and reactive & proactive approaches used to deal with failure situations. This study performs a comprehensive analysis of existing literature by considering three prominent aspects as log preprocessing, anomaly & failure detection, and failure prevention. With this coherent review, we (1) presume the need for IT infrastructure monitoring to avoid downtime, (2) examine the three types of approaches for anomaly and failure detection such as a rule-based, correlation method and classification, and (3) fabricate the recommendations for researchers on further research guidelines. As far as the authors\u27 knowledge, this is the first comprehensive literature review on IT infrastructure monitoring techniques. The review has been conducted with the help of meta-analysis and comparative study of machine learning and deep learning techniques. This work aims to outline significant research gaps in the area of IT infrastructure failure detection. This work will help future researchers understand the advantages and limitations of current methods and select an adequate approach to their problem

    Network problems detection and classification by analyzing syslog data

    Get PDF
    Network troubleshooting is an important process which has a wide research field. The first step in troubleshooting procedures is to collect information in order to diagnose the problems. Syslog messages which are sent by almost all network devices contain a massive amount of data related to the network problems. It is found that in many studies conducted previously, analyzing syslog data which can be a guideline for network problems and their causes was used. Detecting network problems could be more efficient if the detected problems have been classified in terms of network layers. Classifying syslog data needs to identify the syslog messages that describe the network problems for each layer, taking into account the different formats of various syslog for vendors’ devices. This study provides a method to classify syslog messages that indicates the network problem in terms of network layers. The method used data mining tool to classify the syslog messages while the description part of the syslog message was used for classification process. Related syslog messages were identified; features were then selected to train the classifiers. Six classification algorithms were learned; LibSVM, SMO, KNN, Naïve Bayes, J48, and Random Forest. A real data set which was obtained from the Universiti Utara Malaysia’s (UUM) network devices is used for the prediction stage. Results indicate that SVM shows the best performance during the training and prediction stages. This study contributes to the field of network troubleshooting, and the field of text data classification

    EvLog: Evolving Log Analyzer for Anomalous Logs Identification

    Full text link
    Software logs record system activities, aiding maintainers in identifying the underlying causes for failures and enabling prompt mitigation actions. However, maintainers need to inspect a large volume of daily logs to identify the anomalous logs that reveal failure details for further diagnosis. Thus, how to automatically distinguish these anomalous logs from normal logs becomes a critical problem. Existing approaches alleviate the burden on software maintainers, but they are built upon an improper yet critical assumption: logging statements in the software remain unchanged. While software keeps evolving, our empirical study finds that evolving software brings three challenges: log parsing errors, evolving log events, and unstable log sequences. In this paper, we propose a novel unsupervised approach named Evolving Log analyzer (EvLog) to mitigate these challenges. We first build a multi-level representation extractor to process logs without parsing to prevent errors from the parser. The multi-level representations preserve the essential semantics of logs while leaving out insignificant changes in evolving events. EvLog then implements an anomaly discriminator with an attention mechanism to identify the anomalous logs and avoid the issue brought by the unstable sequence. EvLog has shown effectiveness in two real-world system evolution log datasets with an average F1 score of 0.955 and 0.847 in the intra-version setting and inter-version setting, respectively, which outperforms other state-of-the-art approaches by a wide margin. To our best knowledge, this is the first study on tackling anomalous logs over software evolution. We believe our work sheds new light on the impact of software evolution with the corresponding solutions for the log analysis community

    Report from GI-Dagstuhl Seminar 16394: Software Performance Engineering in the DevOps World

    Get PDF
    This report documents the program and the outcomes of GI-Dagstuhl Seminar 16394 "Software Performance Engineering in the DevOps World". The seminar addressed the problem of performance-aware DevOps. Both, DevOps and performance engineering have been growing trends over the past one to two years, in no small part due to the rise in importance of identifying performance anomalies in the operations (Ops) of cloud and big data systems and feeding these back to the development (Dev). However, so far, the research community has treated software engineering, performance engineering, and cloud computing mostly as individual research areas. We aimed to identify cross-community collaboration, and to set the path for long-lasting collaborations towards performance-aware DevOps. The main goal of the seminar was to bring together young researchers (PhD students in a later stage of their PhD, as well as PostDocs or Junior Professors) in the areas of (i) software engineering, (ii) performance engineering, and (iii) cloud computing and big data to present their current research projects, to exchange experience and expertise, to discuss research challenges, and to develop ideas for future collaborations

    Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing Systems

    Full text link
    Cloud computing systems fail in complex and unexpected ways due to unexpected combinations of events and interactions between hardware and software components. Fault injection is an effective means to bring out these failures in a controlled environment. However, fault injection experiments produce massive amounts of data, and manually analyzing these data is inefficient and error-prone, as the analyst can miss severe failure modes that are yet unknown. This paper introduces a new paradigm (fault injection analytics) that applies unsupervised machine learning on execution traces of the injected system, to ease the discovery and interpretation of failure modes. We evaluated the proposed approach in the context of fault injection experiments on the OpenStack cloud computing platform, where we show that the approach can accurately identify failure modes with a low computational cost.Comment: IEEE Transactions on Dependable and Secure Computing; 16 pages. arXiv admin note: text overlap with arXiv:1908.1164

    Data-Driven Approach for Automatic Telephony Threat Analysis and Campaign Detection

    Get PDF
    The growth of the telephone network and the availability of Voice over Internet Protocol (VoIP) have both contributed to the availability of a flexible and easy to use artifact for users, but also to a significant increase in cyber-criminal activity. These criminals use emergent technologies to conduct illegal and suspicious activities. For instance, they use VoIP’s flexibility to abuse and scam victims. A lot of interest has been expressed into the analysis and assessment of telephony cyber-threats. A better understanding of these types of abuse is required in order to detect, mitigate, and attribute these attacks. The purpose of this research work is to generate relevant and timely telephony abuse intelligence that can support the mitigation and/or the investigation of such activities. To achieve this objective, we present, in this thesis, the design and implementation of a Telephony Abuse Intelligence Framework (TAINT) that automatically aggregates, analyzes and reports on telephony abuse activities. Such a framework monitors and analyzes, in near-real-time, crowd-sourced telephony complaints data from various sources. We deploy our framework on a large dataset of telephony complaints, spanning over seven years, to provide in-depth insights and intelligence about merging telephony threats. The framework presented in this thesis is of paramount importance when it comes to the mitigation, the prevention and the attribution of telephony abuse incidents. We analyze the data and report on the complaint distribution, the used numbers and the spoofed callers’ identifiers. In addition, we identify and geo-locate the sources of the phone calls, and further investigate the underlying telephony threats. Moreover, we quantify the similarity between reported phone numbers to unveil potential groups that are behind specific telephony abuse activities that are actually launched as telephony abuse campaigns

    Do Androids Dream of Electric Sheep? On Privacy in the Android Supply Chain

    Get PDF
    The Android Open Source Project (AOSP) was first released by Google in 2008 and has since become the most used operating system [Andaf]. Thanks to the openness of its source code, any smartphone vendor or original equipment manufacturer (OEM) can modify and adapt Android to their specific needs, or add proprietary features before installing it on their devices in order to add custom features to differentiate themselves from competitors. This has created a complex and diverse supply chain, completely opaque to end-users, formed by manufacturers, resellers, chipset manufacturers, network operators, and prominent actors of the online industry that partnered with OEMs. Each of these stakeholders can pre-install extra apps, or implement proprietary features at the framework level. However, such customizations can create privacy and security threats to end-users. Preinstalled apps are privileged by the operating system, and can therefore access system APIs or personal data more easily than apps installed by the user. Unfortunately, despite these potential threats, there is currently no end-to-end control over what apps come pre-installed on a device and why, and no traceability of the different software and hardware components used in a given Android device. In fact, the landscape of pre-installed software in Android and its security and privacy implications has largely remained unexplored by researchers. In this thesis, I investigate the customization of Android devices and their impact on the privacy and security of end-users. Specifically, I perform the first large-scale and systematic analysis of pre-installed Android apps and the supply chain. To do so, I first develop an app, Firmware Scanner [Sca], to crowdsource close to 34,000 Android firmware versions from 1,000 different OEMs from all over the world. This dataset allows us to map the stakeholders involved in the supply chain and their relationships, from device manufacturers and mobile network operators to third-party organizations like advertising and tracking services, and social network platforms. I could identify multiple cases of privacy-invasive and potentially harmful behaviors. My results show a disturbing lack of transparency and control over the Android supply chain, thus showing that it can be damageable privacy- and security-wise to end-users. Next, I study the evolution of the Android permission system, an essential security feature of the Android framework. Coupled with other protection mechanisms such as process sandboxing, the permission system empowers users to control what sensitive resources (e.g., user contacts, the camera, location sensors) are accessible to which apps. The research community has extensively studied the permission system, but most previous studies focus on its limitations or specific attacks. In this thesis, I present an up-to-date view and longitudinal analysis of the evolution of the permissions system. I study how some lesser-known features of the permission system, specifically permission flags, can impact the permission granting process, making it either more restrictive or less. I then highlight how pre-installed apps developers use said flags in the wild and focus on the privacy and security implications. Specifically, I show the presence of third-party apps, installed as privileged system apps, potentially using said features to share resources with other third-party apps. Another salient feature of the permission system is its extensibility: apps can define their own custom permissions to expose features and data to other apps. However, little is known about how widespread the usage of custom permissions is, and what impact these permissions may have on users’ privacy and security. In the last part of this thesis, I investigate the exposure and request of custom permissions in the Android ecosystem and their potential for opening privacy and security risks. I gather a 2.2-million-app-large dataset of both pre-installed and publicly available apps using both Firmware Scanner and purpose-built app store crawlers. I find the usage of custom permissions to be pervasive, regardless of the origin of the apps, and seemingly growing over time. Despite this prevalence, I find that custom permissions are virtually invisible to end-users, and their purpose is mostly undocumented. While Google recommends that developers use their reverse domain name as the prefix of their custom permissions [Gpla], I find widespread violations of this recommendation, making sound attribution at scale virtually impossible. Through static analysis methods, I demonstrate that custom permissions can facilitate access to permission-protected system resources to apps that lack those permissions, without user awareness. Due to the lack of tools for studying such risks, I design and implement two tools, PermissionTracer [Pere] and PermissionTainter [Perd] to study custom permissions. I highlight multiple cases of concerning use of custom permissions by Android apps in the wild. In this thesis, I systematically studied, at scale, the vast and overlooked ecosystem of preinstalled Android apps. My results show a complete lack of control of the supply chain which is worrying, given the huge potential impact of pre-installed apps on the privacy and security of end-users. I conclude with a number of open research questions and future avenues for further research in the ecosystem of the supply chain of Android devices.This work has been supported by IMDEA Networks InstitutePrograma de Doctorado en Ingeniería Telemática por la Universidad Carlos III de MadridPresidente: Douglas Leith.- Secretario: Rubén Cuevas Rumín.- Vocal: Hamed Haddad

    Development and Validation of a Proof-of-Concept Prototype for Analytics-based Malicious Cybersecurity Insider Threat in a Real-Time Identification System

    Get PDF
    Insider threat has continued to be one of the most difficult cybersecurity threat vectors detectable by contemporary technologies. Most organizations apply standard technology-based practices to detect unusual network activity. While there have been significant advances in intrusion detection systems (IDS) as well as security incident and event management solutions (SIEM), these technologies fail to take into consideration the human aspects of personality and emotion in computer use and network activity, since insider threats are human-initiated. External influencers impact how an end-user interacts with both colleagues and organizational resources. Taking into consideration external influencers, such as personality, changes in organizational polices and structure, along with unusual technical activity analysis, would be an improvement over contemporary detection tools used for identifying at-risk employees. This would allow upper management or other organizational units to intervene before a malicious cybersecurity insider threat event occurs, or mitigate it quickly, once initiated. The main goal of this research study was to design, develop, and validate a proof-of-concept prototype for a malicious cybersecurity insider threat alerting system that will assist in the rapid detection and prediction of human-centric precursors to malicious cybersecurity insider threat activity. Disgruntled employees or end-users wishing to cause harm to the organization may do so by abusing the trust given to them in their access to available network and organizational resources. Reports on malicious insider threat actions indicated that insider threat attacks make up roughly 23% of all cybercrime incidents, resulting in $2.9 trillion in employee fraud losses globally. The damage and negative impact that insider threats cause was reported to be higher than that of outsider or other types of cybercrime incidents. Consequently, this study utilized weighted indicators to measure and correlate simulated user activity to possible precursors to malicious cybersecurity insider threat attacks. This study consisted of a mixed method approach utilizing an expert panel, developmental research, and quantitative data analysis using the developed tool on simulated data set. To assure validity and reliability of the indicators, a panel of subject matter experts (SMEs) reviewed the indicators and indicator categorizations that were collected from prior literature following the Delphi technique. The SMEs’ responses were incorporated into the development of a proof-of-concept prototype. Once the proof-of-concept prototype was completed and fully tested, an empirical simulation research study was conducted utilizing simulated user activity within a 16-month time frame. The results of the empirical simulation study were analyzed and presented. Recommendations resulting from the study also be provided

    Performance and Reliability Evaluation of Apache Kafka Messaging System

    Get PDF
    Streaming data is now flowing across various devices and applications around us. This type of data means any unbounded, ever growing, infinite data set which is continuously generated by all kinds of sources. Examples include sensor data transmitted among different Internet of Things (IoT) devices, user activity records collected on websites and payment requests sent from mobile devices. In many application scenarios, streaming data needs to be processed in real-time because its value can be futile over time. A variety of stream processing systems have been developed in the last decade and are evolving to address rising challenges. A typical stream processing system consists of multiple processing nodes in the topology of a DAG (directed acyclic graph). To build real-time streaming data pipelines across those nodes, message middleware technology is widely applied. As a distributed messaging system with high durability and scalability, Apache Kafka has become very popular among modern companies. It ingests streaming data from upstream applications and store the data in its distributed cluster, which provides a fault-tolerant data source for stream processors. Therefore, Kafka plays a critical role to ensure the completeness, correctness and timeliness of streaming data delivery. However, it is impossible to meet all the user requirements in real-time cases with a simple and fixed data delivery strategy. In this thesis, we address the challenge of choosing a proper configuration to guarantee both performance and reliability of Kafka for complex streaming application scenarios. We investigate the features that have an impact on the performance and reliability metrics. We propose a queueing based prediction model to predict the performance metrics, including producer throughput and packet latency of Kafka. We define two reliability metrics, the probability of message loss and the probability of message duplication. We create an ANN model to predict these metrics given unstable network metrics like network delay and packet loss rate. To collect sufficient training data we build a Docker-based Kafka testbed with a fault injection module. We use a new quality-of-service metric, timely throughput to help us choosing proper batch size in Kafka. Based on this metric, we propose a dynamic configuration method, which reactively guarantees both performance and reliability of Kafka under complex operation conditions
    • …
    corecore