10,217 research outputs found

    Application of advanced machine learning techniques to early network traffic classification

    Get PDF
    The fast-paced evolution of the Internet is drawing a complex context which imposes demanding requirements to assure end-to-end Quality of Service. The development of advanced intelligent approaches in networking is envisioning features that include autonomous resource allocation, fast reaction against unexpected network events and so on. Internet Network Traffic Classification constitutes a crucial source of information for Network Management, being decisive in assisting the emerging network control paradigms. Monitoring traffic flowing through network devices support tasks such as: network orchestration, traffic prioritization, network arbitration and cyberthreats detection, amongst others. The traditional traffic classifiers became obsolete owing to the rapid Internet evolution. Port-based classifiers suffer from significant accuracy losses due to port masking, meanwhile Deep Packet Inspection approaches have severe user-privacy limitations. The advent of Machine Learning has propelled the application of advanced algorithms in diverse research areas, and some learning approaches have proved as an interesting alternative to the classic traffic classification approaches. Addressing Network Traffic Classification from a Machine Learning perspective implies numerous challenges demanding research efforts to achieve feasible classifiers. In this dissertation, we endeavor to formulate and solve important research questions in Machine-Learning-based Network Traffic Classification. As a result of numerous experiments, the knowledge provided in this research constitutes an engaging case of study in which network traffic data from two different environments are successfully collected, processed and modeled. Firstly, we approached the Feature Extraction and Selection processes providing our own contributions. A Feature Extractor was designed to create Machine-Learning ready datasets from real traffic data, and a Feature Selection Filter based on fast correlation is proposed and tested in several classification datasets. Then, the original Network Traffic Classification datasets are reduced using our Selection Filter to provide efficient classification models. Many classification models based on CART Decision Trees were analyzed exhibiting excellent outcomes in identifying various Internet applications. The experiments presented in this research comprise a comparison amongst ensemble learning schemes, an exploratory study on Class Imbalance and solutions; and an analysis of IP-header predictors for early traffic classification. This thesis is presented in the form of compendium of JCR-indexed scientific manuscripts and, furthermore, one conference paper is included. In the present work we study a wide number of learning approaches employing the most advance methodology in Machine Learning. As a result, we identify the strengths and weaknesses of these algorithms, providing our own solutions to overcome the observed limitations. Shortly, this thesis proves that Machine Learning offers interesting advanced techniques that open prominent prospects in Internet Network Traffic Classification.Departamento de Teoría de la Señal y Comunicaciones e Ingeniería TelemáticaDoctorado en Tecnologías de la Información y las Telecomunicacione

    A case study: Failure prediction in a real LTE network

    Get PDF
    Mobile traffic and number of connected devices have been increasing exponentially nowadays, with customer expectation from mobile operators in term of quality and reliability is higher and higher. This places pressure on operators to invest as well as to operate their growing infrastructures. As such, telecom network management becomes an essential problem. To reduce cost and maintain network performance, operators need to bring more automation and intelligence into their management system. Self-Organizing Networks function (SON) is an automation technology aiming to maximize performance in mobility networks by bringing autonomous adaptability and reducing human intervention in network management and operations. Three main areas of SON include self-configuration (auto-configuration when new element enter the network), self-optimization (optimization of the network parameters during operation) and self-healing (maintenance). The main purpose of the thesis is to illustrate how anomaly detection methods can be applied to SON functions, in particularly self-healing functions such as fault detection and cell outage management. The thesis is illustrated by a case study, in which the anomalies - in this case, the failure alarms, are predicted in advance using performance measurement data (PM data) collected from a real LTE network within a certain timeframe. Failures prediction or anomalies detection can help reduce cost and maintenance time in mobile network base stations. The author aims to answer the research questions: what anomaly detection models could detect the anomalies in advance, and what type of anomalies can be well-detected using those models. Using cross-validation, the thesis shows that random forest method is the best performing model out of the chosen ones, with F1-score of 0.58, 0.96 and 0.52 for the anomalies: Failure in Optical Interface, Temperature alarm, and VSWR minor alarm respectively. Those are also the anomalies can be well-detected by the model

    Big data analytics: a predictive analysis applied to cybersecurity in a financial organization

    Get PDF
    Project Work presented as partial requirement for obtaining the Master’s degree in Information Management, with a specialization in Knowledge Management and Business IntelligenceWith the generalization of the internet access, cyber attacks have registered an alarming growth in frequency and severity of damages, along with the awareness of organizations with heavy investments in cybersecurity, such as in the financial sector. This work is focused on an organization’s financial service that operates on the international markets in the payment systems industry. The objective was to develop a predictive framework solution responsible for threat detection to support the security team to open investigations on intrusive server requests, over the exponentially growing log events collected by the SIEM from the Apache Web Servers for the financial service. A Big Data framework, using Hadoop and Spark, was developed to perform classification tasks over the financial service requests, using Neural Networks, Logistic Regression, SVM, and Random Forests algorithms, while handling the training of the imbalance dataset through BEV. The main conclusions over the analysis conducted, registered the best scoring performances for the Random Forests classifier using all the preprocessed features available. Using the all the available worker nodes with a balanced configuration of the Spark executors, the most performant elapsed times for loading and preprocessing of the data were achieved using the column-oriented ORC with native format, while the row-oriented CSV format performed the best for the training of the classifiers.Com a generalização do acesso à internet, os ciberataques registaram um crescimento alarmante em frequência e severidade de danos causados, a par da consciencialização das organizações, com elevados investimentos em cibersegurança, como no setor financeiro. Este trabalho focou-se no serviço financeiro de uma organização que opera nos mercados internacionais da indústria de sistemas de pagamento. O objetivo consistiu no desenvolvimento uma solução preditiva responsável pela detecção de ameaças, por forma a dar suporte à equipa de segurança na abertura de investigações sobre pedidos intrusivos no servidor, relativamente aos exponencialmente crescentes eventos de log coletados pelo SIEM, referentes aos Apache Web Servers, para o serviço financeiro. Uma solução de Big Data, usando Hadoop e Spark, foi desenvolvida com o objectivo de executar tarefas de classificação sobre os pedidos do serviço financeiros, usando os algoritmos Neural Networks, Logistic Regression, SVM e Random Forests, solucionando os problemas associados ao treino de um dataset desequilibrado através de BEV. As principais conclusões sobre as análises realizadas registaram os melhores resultados de classificação usando o algoritmo Random Forests com todas as variáveis pré-processadas disponíveis. Usando todos os nós do cluster e uma configuração balanceada dos executores do Spark, os melhores tempos para carregar e pré-processar os dados foram obtidos usando o formato colunar ORC nativo, enquanto o formato CSV, orientado a linhas, apresentou os melhores tempos para o treino dos classificadores

    Bayes classifiers for imbalanced traffic accidents datasets

    Full text link
    [EN] Traffic accidents data sets are usually imbalanced, where the number of instances classified under the killed or severe injuries class (minority) is much lower than those classified under the slight injuries class (majority). This, however, supposes a challenging problem for classification algorithms and may cause obtaining a model that well cover the slight injuries instances whereas the killed or severe injuries instances are misclassified frequently. Based on traffic accidents data collected on urban and suburban roads in Jordan for three years (2009-2011); three different data balancing techniques were used: under sampling which removes some instances of the majority class, oversampling which creates new instances of the minority class and a mix technique that combines both. In addition, different Bayes classifiers were compared for the different imbalanced and balanced data sets: Averaged One-Dependence Estimators, Weightily Average One-Dependence Estimators, and Bayesian networks in order to identify factors that affect the severity of an accident. The results indicated that using the balanced data sets, especially those created using oversampling techniques, with Bayesian networks improved classifying a traffic accident according to its severity and reduced the misclassification of killed and severe injuries instances. On the other hand, the following variables were found to contribute to the occurrence of a killed causality or a severe injury in a traffic accident: number of vehicles involved, accident pattern, number of directions, accident type, lighting, surface condition, and speed limit. This work, to the knowledge of the authors, is the first that aims at analyzing historical data records for traffic accidents occurring in Jordan and the first to apply balancing techniques to analyze injury severity of traffic accidents. (C) 2015 Elsevier Ltd. All rights reserved.The authors are grateful to the Police Traffic Department in Jordan for providing the data necessary for this research. Griselda Lopez wishes to express her acknowledgement to the regional ministry of Economy, Innovation and Science of the regional government of Andalusia (Spain) for their scholarship to train teachers and researchers in Deficit Areas, which has made this work possible. The authors appreciate the reviewers' comments and effort in order to improve the paper.Mujalli, R.; López-Maldonado, G.; Garach, L. (2016). Bayes classifiers for imbalanced traffic accidents datasets. Accident Analysis & Prevention. 88:37-51. https://doi.org/10.1016/j.aap.2015.12.003S37518

    Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives

    Full text link
    Machine Learning has been steadily gaining traction for its use in Anomaly-based Network Intrusion Detection Systems (A-NIDS). Research into this domain is frequently performed using the KDD~CUP~99 dataset as a benchmark. Several studies question its usability while constructing a contemporary NIDS, due to the skewed response distribution, non-stationarity, and failure to incorporate modern attacks. In this paper, we compare the performance for KDD-99 alternatives when trained using classification models commonly found in literature: Neural Network, Support Vector Machine, Decision Tree, Random Forest, Naive Bayes and K-Means. Applying the SMOTE oversampling technique and random undersampling, we create a balanced version of NSL-KDD and prove that skewed target classes in KDD-99 and NSL-KDD hamper the efficacy of classifiers on minority classes (U2R and R2L), leading to possible security risks. We explore UNSW-NB15, a modern substitute to KDD-99 with greater uniformity of pattern distribution. We benchmark this dataset before and after SMOTE oversampling to observe the effect on minority performance. Our results indicate that classifiers trained on UNSW-NB15 match or better the Weighted F1-Score of those trained on NSL-KDD and KDD-99 in the binary case, thus advocating UNSW-NB15 as a modern substitute to these datasets.Comment: Paper accepted into Proceedings of IEEE International Conference on Computing, Communication and Security 2018 (ICCCS-2018) Statistics: 8 pages, 7 tables, 3 figures, 34 reference

    Predicting weighing deviations in the dispatch workflow process: a case study in a cement industry

    Get PDF
    The emergence of the Industry 4.0 concept and the profound digital transformation of the industry plays a crucial role in improving organisations' supply chain (SC) performance, consequently achieving a competitive advantage. The order fulfilment process (OFP) consists of one of the key business processes for the organization SC and represents a core process for the operational logistics flow. The dispatch workflow process consists of an integral part of the OFP and is also a crucial process in the SC of cement industry organizations. In this work, we focus on enhancing the order fulfilment process by improving the dispatch workflow process, specifically with respect to the cement loading process. Thus, we proposed a machine learning (ML) approach to predict weighing deviations in the cement loading process. We adopted a realistic and robust rolling window scheme to evaluate six classification models in a real-world case study, from which the random forest (RF) model provides the best predictive performance. We also extracted explainable knowledge from the RF classifier by using the Shapley additive explanations (SHAP) method, demonstrating the influence of each input data attribute used in the prediction process.This work is supported by European Structural and Investment Funds in the FEDER component, through the Operational Competitivenessand Internationalization Programme (COMPETE 2020 ) [Project nº 069716; Funding Reference: POCI-01-0247-FEDER-069716]

    The Smart in Smart Cities: A Framework for Image Classification Using Deep Learning

    Get PDF
    The need for a smart city is more pressing today due to the recent pandemic, lockouts, climate changes, population growth, and limitations on availability/access to natural resources. However, these challenges can be better faced with the utilization of new technologies. The zoning design of smart cities can mitigate these challenges. It identifies the main components of a new smart city and then proposes a general framework for designing a smart city that tackles these elements. Then, we propose a technology-driven model to support this framework. A mapping between the proposed general framework and the proposed technology model is then introduced. To highlight the importance and usefulness of the proposed framework, we designed and implemented a smart image handling system targeted at non-technical personnel. The high cost, security, and inconvenience issues may limit the cities’ abilities to adopt such solutions. Therefore, this work also proposes to design and implement a generalized image processing model using deep learning. The proposed model accepts images from users, then performs self-tuning operations to select the best deep network, and finally produces the required insights without any human intervention. This helps in automating the decision-making process without the need for a specialized data scientist

    Phishing Detection using Base Classifier and Ensemble Technique

    Get PDF
    Phishing attacks continue to pose a significant threat in today's digital landscape, with both individuals and organizations falling victim to these attacks on a regular basis. One of the primary methods used to carry out phishing attacks is through the use of phishing websites, which are designed to look like legitimate sites in order to trick users into giving away their personal information, including sensitive data such as credit card details and passwords. This research paper proposes a model that utilizes several benchmark classifiers, including LR, Bagging, RF, K-NN, DT, SVM, and Adaboost, to accurately identify and classify phishing websites based on accuracy, precision, recall, f1-score, and confusion matrix. Additionally, a meta-learner and stacking model were combined to identify phishing websites in existing systems. The proposed ensemble learning approach using stack-based meta-learners proved to be highly effective in identifying both legitimate and phishing websites, achieving an accuracy rate of up to 97.19%, with precision, recall, and f1 scores of 97%, 98%, and 98%, respectively. Thus, it is recommended that ensemble learning, particularly with stacking and its meta-learner variations, be implemented to detect and prevent phishing attacks and other digital cyber threats

    A Study of Deep Learning for Network Traffic Data Forecasting

    Full text link
    We present a study of deep learning applied to the domain of network traffic data forecasting. This is a very important ingredient for network traffic engineering, e.g., intelligent routing, which can optimize network performance, especially in large networks. In a nutshell, we wish to predict, in advance, the bit rate for a transmission, based on low-dimensional connection metadata ("flows") that is available whenever a communication is initiated. Our study has several genuinely new points: First, it is performed on a large dataset (~50 million flows), which requires a new training scheme that operates on successive blocks of data since the whole dataset is too large for in-memory processing. Additionally, we are the first to propose and perform a more fine-grained prediction that distinguishes between low, medium and high bit rates instead of just "mice" and "elephant" flows. Lastly, we apply state-of-the-art visualization and clustering techniques to flow data and show that visualizations are insightful despite the heterogeneous and non-metric nature of the data. We developed a processing pipeline to handle the highly non-trivial acquisition process and allow for proper data preprocessing to be able to apply DNNs to network traffic data. We conduct DNN hyper-parameter optimization as well as feature selection experiments, which clearly show that fine-grained network traffic forecasting is feasible, and that domain-dependent data enrichment and augmentation strategies can improve results. An outlook about the fundamental challenges presented by network traffic analysis (high data throughput, unbalanced and dynamic classes, changing statistics, outlier detection) concludes the article.Comment: 16 pages, 12 figures, 28th International Conference on Artificial Neural Networks (ICANN 2019
    • …
    corecore