82 research outputs found

    Towards Data-Driven Autonomics in Data Centers

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using generated data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating a predictive model for node failures. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing machine state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if machines will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%. We discuss the practicality of including our predictive model as the central component of a data-driven autonomic manager and operating it on-line with live data streams (rather than off-line on data logs). All of the scripts used for BigQuery and classification analyses are publicly available from the authors' website.Comment: 12 pages, 6 figure

    Towards Operator-less Data Centers Through Data-Driven, Predictive, Proactive Autonomics

    Get PDF
    Continued reliance on human operators for managing data centers is a major impediment for them from ever reaching extreme dimensions. Large computer systems in general, and data centers in particular, will ultimately be managed using predictive computational and executable models obtained through data-science tools, and at that point, the intervention of humans will be limited to setting high-level goals and policies rather than performing low-level operations. Data-driven autonomics, where management and control are based on holistic predictive models that are built and updated using live data, opens one possible path towards limiting the role of operators in data centers. In this paper, we present a data-science study of a public Google dataset collected in a 12K-node cluster with the goal of building and evaluating predictive models for node failures. Our results support the practicality of a data-driven approach by showing the effectiveness of predictive models based on data found in typical data center logs. We use BigQuery, the big data SQL platform from the Google Cloud suite, to process massive amounts of data and generate a rich feature set characterizing node state over time. We describe how an ensemble classifier can be built out of many Random Forest classifiers each trained on these features, to predict if nodes will fail in a future 24-hour window. Our evaluation reveals that if we limit false positive rates to 5%, we can achieve true positive rates between 27% and 88% with precision varying between 50% and 72%.This level of performance allows us to recover large fraction of jobs' executions (by redirecting them to other nodes when a failure of the present node is predicted) that would otherwise have been wasted due to failures. [...

    Mining Time-aware Actor-level Evolution Similarity for Link Prediction in Dynamic Network

    Get PDF
    Topological evolution over time in a dynamic network triggers both the addition and deletion of actors and the links among them. A dynamic network can be represented as a time series of network snapshots where each snapshot represents the state of the network over an interval of time (for example, a minute, hour or day). The duration of each snapshot denotes the temporal scale/sliding window of the dynamic network and all the links within the duration of the window are aggregated together irrespective of their order in time. The inherent trade-off in selecting the timescale in analysing dynamic networks is that choosing a short temporal window may lead to chaotic changes in network topology and measures (for example, the actors’ centrality measures and the average path length); however, choosing a long window may compromise the study and the investigation of network dynamics. Therefore, to facilitate the analysis and understand different patterns of actor-oriented evolutionary aspects, it is necessary to define an optimal window length (temporal duration) with which to sample a dynamic network. In addition to determining the optical temporal duration, another key task for understanding the dynamics of evolving networks is being able to predict the likelihood of future links among pairs of actors given the existing states of link structure at present time. This phenomenon is known as the link prediction problem in network science. Instead of considering a static state of a network where the associated topology does not change, dynamic link prediction attempts to predict emerging links by considering different types of historical/temporal information, for example the different types of temporal evolutions experienced by the actors in a dynamic network due to the topological evolution over time, known as actor dynamicities. Although there has been some success in developing various methodologies and metrics for the purpose of dynamic link prediction, mining actor-oriented evolutions to address this problem has received little attention from the research community. In addition to this, the existing methodologies were developed without considering the sampling window size of the dynamic network, even though the sampling duration has a large impact on mining the network dynamics of an evolutionary network. Therefore, although the principal focus of this thesis is link prediction in dynamic networks, the optimal sampling window determination was also considered

    Online semi-supervised learning in non-stationary environments

    Get PDF
    Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and balanced data, immediately or after some delay, to extract worthwhile knowledge from the continuous and rapid data streams. However, in many real-world applications such as Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of Things sensors and real-time data on the Internet. Manual labelling of these data streams is not practical due to time consumption and the need for domain expertise. Another challenge is learning under Non-Stationary Environments (NSEs), which occurs due to changes in the data distributions in a set of input variables and/or class labels. The problem of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms have no access to the true class labels directly when the concept evolves. Several approaches exist that deal with NSE and EVL in isolation. However, few algorithms address both issues simultaneously. This research directly responds to ILNSE’s challenge in proposing two novel algorithms “Predictor for Streaming Data with Scarce Labels” (PSDSL) and Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label scarcity issues in online machine learning. The key capabilities of PSDSL include learning from a small amount of labelled data in an incremental or online manner and being available to predict at any time. To achieve this, PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it continuously learns from incoming data and updates the model as new labelled or unlabelled data becomes available over time. Furthermore, it can predict under NSE conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier, which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch and adapt to the conditions. The PSDSL adapts to learning states between self-learning, micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of the data stream. HDWM makes use of “seed” learners of different types in an ensemble to maintain its diversity. The ensembles are simply the combination of predictive models grouped to improve the predictive performance of a single classifier. PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than existing approaches on most real-time data streams including randomised data instances. PSDSL performed significantly better than ‘Static’ i.e. the classifier is not updated after it is trained with the first examples in the data streams. When applied to MOA-generated data streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC, while SCARGC performed the same as the Static. PSDSL achieved better average prediction accuracies in a short time than SCARGC. The HDWM algorithm is evaluated on artificial and real-world data streams against existing well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic DWM algorithm. The results showed that HDWM performed significantly better than WMA and DWM. Also, when recurring concept drifts were present, the predictive performance of HDWM showed an improvement over DWM. In both drift and real-world streams, significance tests and post hoc comparisons found significant differences between algorithms, HDWM performed significantly better than DWM and WMA when applied to MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms benefit from the use of both forgetting and retaining the models. The algorithm also provides the independence of selecting the optimal base classifier in its ensemble depending on the problem. A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts during the cluster labelling process. In this process, PSDSL transforms the centroids’ information of micro-clusters into micro-instances and generates new clusters called Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and successfully guide the cluster labelling process after the concept drifts in the absence of true class labels. PSDSL has been evaluated on real-world problem ‘keystroke dynamics’, and the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC (81.6%), while the Static (49.0%) significantly degrades the performance due to changes in the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found highly fluctuated between (41.1% to 81.6%) based on different values of parameter ‘k’ (number of clusters), while PSDSL automatically determine the best values for this parameter

    Strategic Project Portfolio Management by Predicting Project Performance and Estimating Strategic Fit

    Get PDF
    Candidate project selections are extremely crucial for infrastructure construction companies. First, they determine how well the planned strategy will be realized during the following years. If the selected projects do not align with the competences of the organization major losses can occur during the projects’ execution phase. Second, participating in tendering competitions is costly manual labour and losing the bid directly increase the overhead costs of the organization. Still, contractors rarely utilize statistical methods to select projects that are more likely to be successful. In response to these two issues, a tool for project portfolio selection phase was developed based on existing literature about strategic fit estimation and project performance prediction. One way to define the strategic fit of a project is to evaluate the alignment between the characteristics of a project to the strategic objectives of an organisation. Project performance on the other-hand can be measured with various financial, technical, production, risk or human-resource related criteria. Depending on which measure is highlighted, the likelihood of succeeding with regards to a performance measure can be predicted with numerous machine learning methods of which decision trees were used in this study. By combining the strategic fit and likelihood of success measures, a two-by-two matrix was formed. The matrix can be used to categorize the project opportunities into four categories, ignore, analyse, cash-in and focus, that can guide candidate project selections. To test and demonstrate the performance of the matrix, the case company’s CRM data was used to estimate strategic fit and likelihood of succeeding in tendering competitions. First, the projects were plotted on the matrix and their position and accuracy was analysed per quartile. Afterwards, the project selections were simulated and compared against the case company’s real selections during a six-month period. The first implication after plotting the projects on the matrix was that only a handful of projects were positioned in the focus category of the matrix, which indicates a discrepancy between the planned strategy and the competences of the case company in tendering competitions. Second, the tendering competition outcomes were easier to predict in the low strategic fit quartiles as the project selections in them were more accurate than in the high strategic fit categories. Finally, the matrix also quite accurately filtered the worst low strategic fit projects out from the market. The simulation was done in two stages. First, by emphasizing the likelihood of success predictions the matrix increased the hit rate and average strategic fit of the selected project portfolio. When strategic fit values were emphasized on the other hand, the simulation did not yield useful results. The study contributes to the project portfolio management literature by developing a practice-oriented tool that emphasizes the strategical and statistical perspectives of the candidate project selection phase

    Unsupervised Algorithms to Detect Zero-Day Attacks: Strategy and Application

    Get PDF
    In the last decade, researchers, practitioners and companies struggled for devising mechanisms to detect cyber-security threats. Among others, those efforts originated rule-based, signature-based or supervised Machine Learning (ML) algorithms that were proven effective for detecting those intrusions that have already been encountered and characterized. Instead, new unknown threats, often referred to as zero-day attacks or zero-days, likely go undetected as they are often misclassified by those techniques. In recent years, unsupervised anomaly detection algorithms showed potential to detect zero-days. However, dedicated support for quantitative analyses of unsupervised anomaly detection algorithms is still scarce and often does not promote meta-learning, which has potential to improve classification performance. To such extent, this paper introduces the problem of zero-days and reviews unsupervised algorithms for their detection. Then, the paper applies a question-answer approach to identify typical issues in conducting quantitative analyses for zero-days detection, and shows how to setup and exercise unsupervised algorithms with appropriate tooling. Using a very recent attack dataset, we debate on i) the impact of features on the detection performance of unsupervised algorithms, ii) the relevant metrics to evaluate intrusion detectors, iii) means to compare multiple unsupervised algorithms, iv) the application of meta-learning to reduce misclassifications. Ultimately, v) we measure detection performance of unsupervised anomaly detection algorithms with respect to zero-days. Overall, the paper exemplifies how to practically orchestrate and apply an appropriate methodology, process and tool, providing even non-experts with means to select appropriate strategies to deal with zero-days

    Application of advanced machine learning techniques to early network traffic classification

    Get PDF
    The fast-paced evolution of the Internet is drawing a complex context which imposes demanding requirements to assure end-to-end Quality of Service. The development of advanced intelligent approaches in networking is envisioning features that include autonomous resource allocation, fast reaction against unexpected network events and so on. Internet Network Traffic Classification constitutes a crucial source of information for Network Management, being decisive in assisting the emerging network control paradigms. Monitoring traffic flowing through network devices support tasks such as: network orchestration, traffic prioritization, network arbitration and cyberthreats detection, amongst others. The traditional traffic classifiers became obsolete owing to the rapid Internet evolution. Port-based classifiers suffer from significant accuracy losses due to port masking, meanwhile Deep Packet Inspection approaches have severe user-privacy limitations. The advent of Machine Learning has propelled the application of advanced algorithms in diverse research areas, and some learning approaches have proved as an interesting alternative to the classic traffic classification approaches. Addressing Network Traffic Classification from a Machine Learning perspective implies numerous challenges demanding research efforts to achieve feasible classifiers. In this dissertation, we endeavor to formulate and solve important research questions in Machine-Learning-based Network Traffic Classification. As a result of numerous experiments, the knowledge provided in this research constitutes an engaging case of study in which network traffic data from two different environments are successfully collected, processed and modeled. Firstly, we approached the Feature Extraction and Selection processes providing our own contributions. A Feature Extractor was designed to create Machine-Learning ready datasets from real traffic data, and a Feature Selection Filter based on fast correlation is proposed and tested in several classification datasets. Then, the original Network Traffic Classification datasets are reduced using our Selection Filter to provide efficient classification models. Many classification models based on CART Decision Trees were analyzed exhibiting excellent outcomes in identifying various Internet applications. The experiments presented in this research comprise a comparison amongst ensemble learning schemes, an exploratory study on Class Imbalance and solutions; and an analysis of IP-header predictors for early traffic classification. This thesis is presented in the form of compendium of JCR-indexed scientific manuscripts and, furthermore, one conference paper is included. In the present work we study a wide number of learning approaches employing the most advance methodology in Machine Learning. As a result, we identify the strengths and weaknesses of these algorithms, providing our own solutions to overcome the observed limitations. Shortly, this thesis proves that Machine Learning offers interesting advanced techniques that open prominent prospects in Internet Network Traffic Classification.Departamento de Teoría de la Señal y Comunicaciones e Ingeniería TelemáticaDoctorado en Tecnologías de la Información y las Telecomunicacione

    A Risk-Based IoT Decision-Making Framework Based on Literature Review with Human Activity Recognition Case Studies

    Get PDF
    The Internet of Things (IoT) is a key and growing technology for many critical real-life applications, where it can be used to improve decision making. The existence of several sources of uncertainty in the IoT infrastructure, however, can lead decision makers into taking inappropriate actions. The present work focuses on proposing a risk-based IoT decision-making framework in order to effectively manage uncertainties in addition to integrating domain knowledge in the decision-making process. A structured literature review of the risks and sources of uncertainty in IoT decision-making systems is the basis for the development of the framework and Human Activity Recognition (HAR) case studies. More specifically, as one of the main targeted challenges, the potential sources of uncertainties in an IoT framework, at different levels of abstraction, are firstly reviewed and then summarized. The modules included in the framework are detailed, with the main focus given to a novel risk-based analytics module, where an ensemble-based data analytic approach, called Calibrated Random Forest (CRF), is proposed to extract useful information while quantifying and managing the uncertainty associated with predictions, by using confidence scores. Its output is subsequently integrated with domain knowledge-based action rules to perform decision making in a cost-sensitive and rational manner. The proposed CRF method is firstly evaluated and demonstrated on a HAR scenario in a Smart Home environment in case study I and is further evaluated and illustrated with a remote health monitoring scenario for a diabetes use case in case study II. The experimental results indicate that using the framework’s raw sensor data can be converted into meaningful actions despite several sources of uncertainty. The comparison of the proposed framework to existing approaches highlights the key metrics that make decision making more rational and transparent

    Performance Evaluation of Network Anomaly Detection Systems

    Get PDF
    Nowadays, there is a huge and growing concern about security in information and communication technology (ICT) among the scientific community because any attack or anomaly in the network can greatly affect many domains such as national security, private data storage, social welfare, economic issues, and so on. Therefore, the anomaly detection domain is a broad research area, and many different techniques and approaches for this purpose have emerged through the years. Attacks, problems, and internal failures when not detected early may badly harm an entire Network system. Thus, this thesis presents an autonomous profile-based anomaly detection system based on the statistical method Principal Component Analysis (PCADS-AD). This approach creates a network profile called Digital Signature of Network Segment using Flow Analysis (DSNSF) that denotes the predicted normal behavior of a network traffic activity through historical data analysis. That digital signature is used as a threshold for volume anomaly detection to detect disparities in the normal traffic trend. The proposed system uses seven traffic flow attributes: Bits, Packets and Number of Flows to detect problems, and Source and Destination IP addresses and Ports, to provides the network administrator necessary information to solve them. Via evaluation techniques, addition of a different anomaly detection approach, and comparisons to other methods performed in this thesis using real network traffic data, results showed good traffic prediction by the DSNSF and encouraging false alarm generation and detection accuracy on the detection schema. The observed results seek to contribute to the advance of the state of the art in methods and strategies for anomaly detection that aim to surpass some challenges that emerge from the constant growth in complexity, speed and size of today’s large scale networks, also providing high-value results for a better detection in real time.Atualmente, existe uma enorme e crescente preocupação com segurança em tecnologia da informação e comunicação (TIC) entre a comunidade científica. Isto porque qualquer ataque ou anomalia na rede pode afetar a qualidade, interoperabilidade, disponibilidade, e integridade em muitos domínios, como segurança nacional, armazenamento de dados privados, bem-estar social, questões econômicas, e assim por diante. Portanto, a deteção de anomalias é uma ampla área de pesquisa, e muitas técnicas e abordagens diferentes para esse propósito surgiram ao longo dos anos. Ataques, problemas e falhas internas quando não detetados precocemente podem prejudicar gravemente todo um sistema de rede. Assim, esta Tese apresenta um sistema autônomo de deteção de anomalias baseado em perfil utilizando o método estatístico Análise de Componentes Principais (PCADS-AD). Essa abordagem cria um perfil de rede chamado Assinatura Digital do Segmento de Rede usando Análise de Fluxos (DSNSF) que denota o comportamento normal previsto de uma atividade de tráfego de rede por meio da análise de dados históricos. Essa assinatura digital é utilizada como um limiar para deteção de anomalia de volume e identificar disparidades na tendência de tráfego normal. O sistema proposto utiliza sete atributos de fluxo de tráfego: bits, pacotes e número de fluxos para detetar problemas, além de endereços IP e portas de origem e destino para fornecer ao administrador de rede as informações necessárias para resolvê-los. Por meio da utilização de métricas de avaliação, do acrescimento de uma abordagem de deteção distinta da proposta principal e comparações com outros métodos realizados nesta tese usando dados reais de tráfego de rede, os resultados mostraram boas previsões de tráfego pelo DSNSF e resultados encorajadores quanto a geração de alarmes falsos e precisão de deteção. Com os resultados observados nesta tese, este trabalho de doutoramento busca contribuir para o avanço do estado da arte em métodos e estratégias de deteção de anomalias, visando superar alguns desafios que emergem do constante crescimento em complexidade, velocidade e tamanho das redes de grande porte da atualidade, proporcionando também alta performance. Ainda, a baixa complexidade e agilidade do sistema proposto contribuem para que possa ser aplicado a deteção em tempo real

    Anomalous behaviour detection for cyber defence in modern industrial control systems

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.The fusion of pervasive internet connectivity and emerging technologies in smart cities creates fragile cyber-physical-natural ecosystems. Industrial Control Systems (ICS) are intrinsic parts of smart cities and critical to modern societies. Not designed for interconnectivity or security, disruptor technologies enable ubiquitous computing in modern ICS. Aided by artificial intelligence and the industrial internet of things they transform the ICS environment towards better automation, process control and monitoring. However, investigations reveal that leveraging disruptive technologies in ICS creates security challenges exposing critical infrastructure to sophisticated threat actors including increasingly hostile, well-organised cybercrimes and Advanced Persistent Threats. Besides external factors, the prevalence of insider threats includes malicious intent, accidental hazards and professional errors. The sensing capabilities create opportunities to capture various data types. Apart from operational use, this data combined with artificial intelligence can be innovatively utilised to model anomalous behaviour as part of defence-in-depth strategies. As such, this research aims to investigate and develop a security mechanism to improve cyber defence in ICS. Firstly, this thesis contributes a Systematic Literature Review (SLR), which helps analyse frameworks and systems that address CPS’ cyber resilience and digital forensic incident response in smart cities. The SLR uncovers emerging themes and concludes several key findings. For example, the chronological analysis reveals key influencing factors, whereas the data source analysis points to a lack of real CPS datasets with prevalent utilisation of software and infrastructure-based simulations. Further in-depth analysis shows that cross-sector proposals or applications to improve digital forensics focusing on cyber resilience are addressed by a small number of research studies in some smart sectors. Next, this research introduces a novel super learner ensemble anomaly detection and cyber risk quantification framework to profile anomalous behaviour in ICS and derive a cyber risk score. The proposed framework and associated learning models are experimentally validated. The produced results are promising and achieve an overall F1-score of 99.13%, and an anomalous recall score of 99% detecting anomalies lasting only 17 seconds ranging from 0.5% to 89% of the dataset. Further, a one-class classification model is developed, leveraging stream rebalancing followed by adaptive machine learning algorithms and drift detection methods. The model is experimentally validated producing promising results including an overall Matthews Correlation Coefficient (MCC) score of 0.999 and the Cohen’s Kappa (K) score of 0.9986 on limited variable single-type anomalous behaviour per data stream. Wide data streams achieve an MCC score of 0.981 and a K score of 0.9808 in the prevalence of multiple types of anomalous instances. Additionally, the thesis scrutinises the applicability of the learning models to support digital forensic readiness. The research study presents the concept of digital witness and digital chain of custody in ICS. Following that, a use case integrating blockchain technologies into the design of ICS to support digital forensic readiness is discussed. In conclusion, the contributions of this research thesis help towards developing the next generation of state-of-the-art methods for anomalous behaviour detection in ICS defence-in-depth
    corecore