9 research outputs found

    Incremental Sparse-PCA Feature Extraction For Data Streams

    Get PDF
    Intruders attempt to penetrate commercial systems daily and cause considerable financial losses for individuals and organizations. Intrusion detection systems monitor network events to detect computer security threats. An extensive amount of network data is devoted to detecting malicious activities. Storing, processing, and analyzing the massive volume of data is costly and indicate the need to find efficient methods to perform network data reduction that does not require the data to be first captured and stored. A better approach allows the extraction of useful variables from data streams in real time and in a single pass. The removal of irrelevant attributes reduces the data to be fed to the intrusion detection system (IDS) and shortens the analysis time while improving the classification accuracy. This dissertation introduces an online, real time, data processing method for knowledge extraction. This incremental feature extraction is based on two approaches. First, Chunk Incremental Principal Component Analysis (CIPCA) detects intrusion in data streams. Then, two novel incremental feature extraction methods, Incremental Structured Sparse PCA (ISSPCA) and Incremental Generalized Power Method Sparse PCA (IGSPCA), find malicious elements. Metrics helped compare the performance of all methods. The IGSPCA was found to perform as well as or better than CIPCA overall in term of dimensionality reduction, classification accuracy, and learning time. ISSPCA yielded better results for higher chunk values and greater accumulation ratio thresholds. CIPCA and IGSPCA reduced the IDS dataset to 10 principal components as opposed to 14 eigenvectors for ISSPCA. ISSPCA is more expensive in terms of learning time in comparison to the other techniques. This dissertation presents new methods that perform feature extraction from continuous data streams to find the small number of features necessary to express the most data variance. Data subsets derived from a few important variables render their interpretation easier. Another goal of this dissertation was to propose incremental sparse PCA algorithms capable to process data with concept drift and concept shift. Experiments using WaveForm and WaveFormNoise datasets confirmed this ability. Similar to CIPCA, the ISSPCA and IGSPCA updated eigen-axes as a function of the accumulation ratio value, forming informative eigenspace with few eigenvectors

    Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics

    Get PDF
    A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions. This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods. After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis

    Arhitektura sistema za prepoznavanje nepravilnosti u mrežnom saobraćaju zasnovano na analizi entropije

    Get PDF
    With the steady increase in reliance on computer networks in all aspects of life, computers and other connected devices have become more vulnerable to attacks, which exposes them to many major threats, especially in recent years. There are different systems to protect networks from these threats such as firewalls, antivirus programs, and data encryption, but it is still hard to provide complete protection for networks and their systems from the attacks, which are increasingly sophisticated with time. That is why it is required to use intrusion detection systems (IDS) on a large scale to be the second line of defense for computer and network systems along with other network security techniques. The main objective of intrusion detection systems is used to monitor network traffic and detect internal and external attacks. Intrusion detection systems represent an important focus of studies today, because most protection systems, no matter how good they are, can fail due to the emergence of new (unknown/predefined) types of intrusions. Most of the existing techniques detect network intrusions by collecting information about known types of attacks, so-called signature-based IDS, using them to recognize any attempt of attack on data or resources. The major problem of this approach is its inability to detect previously unknown attacks, even if these attacks are derived slightly from the known ones (the so-called zero-day attack). Also, it is powerless to detect encryption-related attacks. On the other hand, detecting abnormalities concerning conventional behavior (anomaly-based IDS) exceeds the abovementioned limitations. Many scientific studies have tended to build modern and smart systems to detect both known and unknown intrusions. In this research, an architecture that applies a new technique for IDS using an anomaly-based detection method based on entropy is introduced. Network behavior analysis relies on the profiling of legitimate network behavior in order to efficiently detect anomalous traffic deviations that indicate security threats. Entropy-based detection techniques are attractive due to their simplicity and applicability in real-time network traffic, with no need to train the system with labelled data. Besides the fact that the NetFlow protocol provides only a basic set of information about network communications, it is very beneficial for identifying zero-day attacks and suspicious behavior in traffic structure. Nevertheless, the challenge associated with limited NetFlow information combined with the simplicity of the entropy-based approach is providing an efficient and sensitive mechanism to detect a wide range of anomalies, including those of small intensity. However, a recent study found of generic entropy-based anomaly detection reports its vulnerability to deceit by introducing spoofed data to mask the abnormality. Furthermore, the majority of approaches for further classification of anomalies rely on machine learning, which brings additional complexity. Previously highlighted shortcomings and limitations of these approaches open up a space for the exploration of new techniques and methodologies for the detection of anomalies in network traffic in order to isolate security threats, which will be the main subject of the research in this thesis. Abstract An architrvture for network traffic anomaly detection system based on entropy analysis Page vii This research addresses all these issues by providing a systematic methodology with the main novelty in anomaly detection and classification based on the entropy of flow count and behavior features extracted from the basic data obtained by the NetFlow protocol. Two new approaches are proposed to solve these concerns. Firstly, an effective protection mechanism against entropy deception derived from the study of changes in several entropy types, such as Shannon, Rényi, and Tsallis entropies, as well as the measurement of the number of distinct elements in a feature distribution as a new detection metric. The suggested method improves the reliability of entropy approaches. Secondly, an anomaly classification technique was introduced to the existing entropy-based anomaly detection system. Entropy-based anomaly classification methods were presented and effectively confirmed by tests based on a multivariate analysis of the entropy changes of several features as well as aggregation by complicated feature combinations. Through an analysis of the most prominent security attacks, generalized network traffic behavior models were developed to describe various communication patterns. Based on a multivariate analysis of the entropy changes by anomalies in each of the modelled classes, anomaly classification rules were proposed and verified through the experiments. The concept of the behavior features is generalized, while the proposed data partitioning provides greater efficiency in real-time anomaly detection. The practicality of the proposed architecture for the implementation of effective anomaly detection and classification system in a general real-world network environment is demonstrated using experimental data

    Data balancing approaches in quality, defect, and pattern analysis

    Get PDF
    The imbalanced ratio of data is one of the most significant challenges in various industrial domains. Consequently, numerous data-balancing approaches have been proposed over the years. However, most of these data-balancing methods come with their own limitations that can potentially impact data-driven decision-making models in critical sectors such as product quality assurance, manufacturing defect identification, and pattern recognition in healthcare diagnostics. This dissertation addresses three research questions related to data-balancing approaches: 1) What are the scopes of data-balancing approaches toward the major and minor samples? 2) What is the effect of traditional Machine Learning (ML) and Synthetic Minority Over-sampling Technique (SMOTE)-based data-balancing on imbalanced data analysis? and 3) How does imbalanced data affect the performance of Deep Learning (DL)-based models? To achieve these objectives, this dissertation thoroughly analyzes existing reference works and identifies their limitations. It has been observed that most existing data-balancing approaches have several limitations, such as creating noise during oversampling, removing important information during undersampling, and being unable to perform well with multidimensional data. Furthermore, it has also been observed that SMOTE-based approaches have been the most widely used data-balancing approaches as they can create synthetic samples that are easy to implement compared to other existing techniques. However, SMOTE also has its limitations, and therefore, it is required to identify whether there is any significant effect of SMOTE-based oversampled approaches on ML-based data-driven models' performance. To do that, the study conducts several hypothesis tests considering several popular ML algorithms with and without hyperparameter settings. Based on the overall hypothesis, it is found that, in many cases based on the reference dataset, there is no significant performance improvement on data-driven ML models once the imbalanced data is balanced using SMOTE approaches. Additionally, the study finds that SMOTE-based synthetic samples often do not follow the Gaussian distribution or do not follow the same distribution of the data as the original dataset. Therefore, the study suggests that Generative Adversarial Network (GAN)-based approaches could be a better alternative to develop more realistic samples and might overcome the limitations of SMOTE-based data-balancing approaches. However, GAN is often difficult to train, and very limited studies demonstrate the promising outcome of GAN-based tabular data balancing as GAN is mainly developed for image data generation. Additionally, GAN is hard to train as it is computationally not efficient. To overcome such limitations, the present study proposes several data-balancing approaches such as GAN-based oversampling (GBO), Support Vector Machine (SVM)-SMOTE-GAN (SSG), and Borderline-SMOTE-GAN (BSGAN). The proposed approaches outperform existing SMOTE-based data-balancing approaches in various highly imbalanced tabular datasets and can produce realistic samples. Additionally, the oversampled data follows the distribution of the original dataset. The dissertation later examines two case scenarios where data-balancing approaches can play crucial roles, specifically in healthcare diagnostics and additive manufacturing. The study considers several Chest radiography (X-ray) and Computed Tomography (CT)-scan image datasets for the healthcare diagnostics scenario to detect patients with COVID-19 symptoms. The study employs six different Transfer Learning (TL) approaches, namely Visual Geometry Group (VGG)16, Residual Network (ResNet)50, ResNet101, Inception-ResNet Version 2 (InceptionResNetV2), Mobile Network version 2 (MobileNetV2), and VGG19. Based on the overall analysis, it has been observed that, except for the ResNet-based model, most of the TL models have been able to detect patients with COVID-19 symptoms with an accuracy of almost 99\%. However, one potential drawback of TL approaches is that the models have been learning from the wrong regions. For example, instead of focusing on the infected lung regions, the TL-based models have been focusing on the non-infected regions. To address this issue, the study has updated the TL-based models to reduce the models' wrong localization. Similarly, the study conducts an additional investigation on an imbalanced dataset containing defect and non-defect images of 3D-printed cylinders. The results show that TL-based models are unable to locate the defect regions, highlighting the challenge of detecting defects using imbalanced data. To address this limitation, the study proposes preprocessing-based approaches, including algorithms such as Region of Interest Net (ROIN), Region of Interest and Histogram Equalizer Net (ROIHEN), and Region of Interest with Histogram Equalization and Details Enhancer Net (ROIHEDEN) to improve the model's performance and accurately identify the defect region. Furthermore, this dissertation employs various model interpretation techniques, such as Local Interpretable Model-Agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), and Gradient-weighted Class Activation Mapping (Grad-CAM), to gain insights into the features in numerical, categorical, and image data that characterize the models' predictions. These techniques are used across multiple experiments and significantly contribute to a better understanding the models' decision-making processes. Lastly, the study considers a small mixed dataset containing numerical, categorical, and image data. Such diverse data types are often challenging for developing data-driven ML models. The study proposes a computationally efficient and simple ML model to address these data types by leveraging the Multilayer Perceptron and Convolutional Neural Network (MLP-CNN). The proposed MLP-CNN models demonstrate superior accuracy in identifying COVID-19 patients' patterns compared to existing methods. In conclusion, this research proposes various approaches to tackle significant challenges associated with class imbalance problems, including the sensitivity of ML models to multidimensional imbalanced data, distribution issues arising from data expansion techniques, and the need for model explainability and interpretability. By addressing these issues, this study can potentially mitigate data balancing challenges across various industries, particularly those that involve quality, defect, and pattern analysis, such as healthcare diagnostics, additive manufacturing, and product quality. By providing valuable insights into the models' decision-making process, this research could pave the way for developing more accurate and robust ML models, thereby improving their performance in real-world applications

    Improving k-nn search and subspace clustering based on local intrinsic dimensionality

    Get PDF
    In several novel applications such as multimedia and recommender systems, data is often represented as object feature vectors in high-dimensional spaces. The high-dimensional data is always a challenge for state-of-the-art algorithms, because of the so-called curse of dimensionality . As the dimensionality increases, the discriminative ability of similarity measures diminishes to the point where many data analysis algorithms, such as similarity search and clustering, that depend on them lose their effectiveness. One way to handle this challenge is by selecting the most important features, which is essential for providing compact object representations as well as improving the overall search and clustering performance. Having compact feature vectors can further reduce the storage space and the computational complexity of search and learning tasks. Support-Weighted Intrinsic Dimensionality (support-weighted ID) is a new promising feature selection criterion that estimates the contribution of each feature to the overall intrinsic dimensionality. Support-weighted ID identifies relevant features locally for each object, and penalizes those features that have locally lower discriminative power as well as higher density. In fact, support-weighted ID measures the ability of each feature to locally discriminate between objects in the dataset. Based on support-weighted ID, this dissertation introduces three main research contributions: First, this dissertation proposes NNWID-Descent, a similarity graph construction method that utilizes the support-weighted ID criterion to identify and retain relevant features locally for each object and enhance the overall graph quality. Second, with the aim to improve the accuracy and performance of cluster analysis, this dissertation introduces k-LIDoids, a subspace clustering algorithm that extends the utility of support-weighted ID within a clustering framework in order to gradually select the subset of informative and important features per cluster. k-LIDoids is able to construct clusters together with finding a low dimensional subspace for each cluster. Finally, using the compact object and cluster representations from NNWID-Descent and k-LIDoids, this dissertation defines LID-Fingerprint, a new binary fingerprinting and multi-level indexing framework for the high-dimensional data. LID-Fingerprint can be used for hiding the information as a way of preventing passive adversaries as well as providing an efficient and secure similarity search and retrieval for the data stored on the cloud. When compared to other state-of-the-art algorithms, the good practical performance provides an evidence for the effectiveness of the proposed algorithms for the data in high-dimensional spaces

    On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator

    Get PDF
    Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise

    On the Combination of Game-Theoretic Learning and Multi Model Adaptive Filters

    Get PDF
    This paper casts coordination of a team of robots within the framework of game theoretic learning algorithms. In particular a novel variant of fictitious play is proposed, by considering multi-model adaptive filters as a method to estimate other players’ strategies. The proposed algorithm can be used as a coordination mechanism between players when they should take decisions under uncertainty. Each player chooses an action after taking into account the actions of the other players and also the uncertainty. Uncertainty can occur either in terms of noisy observations or various types of other players. In addition, in contrast to other game-theoretic and heuristic algorithms for distributed optimisation, it is not necessary to find the optimal parameters a priori. Various parameter values can be used initially as inputs to different models. Therefore, the resulting decisions will be aggregate results of all the parameter values. Simulations are used to test the performance of the proposed methodology against other game-theoretic learning algorithms.</p

    XXI Workshop de Investigadores en Ciencias de la Computación - WICC 2019: libro de actas

    Get PDF
    Trabajos presentados en el XXI Workshop de Investigadores en Ciencias de la Computación (WICC), celebrado en la provincia de San Juan los días 25 y 26 de abril 2019, organizado por la Red de Universidades con Carreras en Informática (RedUNCI) y la Facultad de Ciencias Exactas, Físicas y Naturales de la Universidad Nacional de San Juan.Red de Universidades con Carreras en Informátic

    XXI Workshop de Investigadores en Ciencias de la Computación - WICC 2019: libro de actas

    Get PDF
    Trabajos presentados en el XXI Workshop de Investigadores en Ciencias de la Computación (WICC), celebrado en la provincia de San Juan los días 25 y 26 de abril 2019, organizado por la Red de Universidades con Carreras en Informática (RedUNCI) y la Facultad de Ciencias Exactas, Físicas y Naturales de la Universidad Nacional de San Juan.Red de Universidades con Carreras en Informátic
    corecore