6 research outputs found

    Ensemble strategy for insider threat detection from user activity logs

    Full text link
    In the information era, the core business and confidential information of enterprises/organizations is stored in information systems. However, certain malicious inside network users exist hidden inside the organization; these users intentionally or unintentionally misuse the privileges of the organization to obtain sensitive information from the company. The existing approaches on insider threat detection mostly focus on monitoring, detecting, and preventing any malicious behavior generated by users within an organization’s system while ignoring the imbalanced ground-truth insider threat data impact on security. To this end, to be able to detect insider threats more effectively, a data processing tool was developed to process the detected user activity to generate information-use events, and formulated a Data Adjustment (DA) strategy to adjust the weight of the minority and majority samples. Then, an efficient ensemble strategy was utilized, which applied the extreme gradient boosting (XGBoost) model combined with the DA strategy to detect anomalous behavior. The CERT dataset was used for an insider threat to evaluate our approach, which was a real-world dataset with artificially injected insider threat events. The results demonstrated that the proposed approach can effectively detect insider threats, with an accuracy rate of 99.51% and an average recall rate of 98.16%. Compared with other classifiers, the detection performance is improved by 8.76%

    A SVM framework for fault detection of the braking system in a high speed train

    Get PDF
    In April 2015, the number of operating High Speed Trains (HSTs) in the world has reached 3603. An efficient, effective and very reliable braking system is evidently very critical for trains running at a speed around 300 km/h. Failure of a highly reliable braking system is a rare event and, consequently, informative recorded data on fault conditions are scarce. This renders the fault detection problem a classification problem with highly unbalanced data. In this paper, a Support Vector Machine (SVM) framework, including feature selection, feature vector selection, model construction and decision boundary optimization, is proposed for tackling this problem. Feature vector selection can largely reduce the data size and, thus, the computational burden. The constructed model is a modified version of the least square SVM, in which a higher cost is assigned to the error of classification of faulty conditions than the error of classification of normal conditions. The proposed framework is successfully validated on a number of public unbalanced datasets. Then, it is applied for the fault detection of braking systems in HST: in comparison with several SVM approaches for unbalanced datasets, the proposed framework gives better results

    Imbalanced data classification using data improvement and parameter optimization with restarting genetic algorithm

    Get PDF

    Developing and deploying data mining techniques in healthcare

    Get PDF
    Improving healthcare is a top priority for all nations. US healthcare expenditure was $3 trillion in 2014. In the same year, the share of GDP assigned to healthcare expenditure was 17.5%. These statistics shows the importance of making improvement in healthcare delivery system. In this research, we developed several data mining methods and algorithms to address healthcare problems. These methods can also be applied to the problems in other domains.The first part of this dissertation is about rare item problem in association analysis. This problem deals with the discovering rare rules, which include rare items. In this study, we introduced a novel assessment metric, called adjusted support to address this problem. By applying this metric, we can retrieve rare rules without over-generating association rules. We applied this method to perform association analysis on complications of diabetes.The second part of this dissertation is developing a clinical decision support system for predicting retinopathy. Retinopathy is the leading cause of vision loss among American adults. In this research, we analyzed data from more than 1.4 million diabetic patients and developed four sets of predictive models: basic, comorbid, over-sampled, and ensemble models. The results show that incorporating comorbidity data and oversampling improved the accuracy of prediction. In addition, we developed a novel "confidence margin" ensemble approach that outperformed the existing ensemble models. In ensemble models, we also addressed the issue of tie in voting-based ensemble models by comparing the confidence margins of the base predictors.The third part of this dissertation addresses the problem of imbalanced data learning, which is a major challenge in machine learning. While a standard machine learning technique could have a good performance on balanced datasets, when applied to imbalanced datasets its performance deteriorates dramatically. This poor performance is rather troublesome especially in detecting the minority class that usually is the class of interest. In this study, we proposed a synthetic informative minority over-sampling (SIMO) algorithm embedded into support vector machine. We applied SIMO to 15 publicly available benchmark datasets and assessed its performance in comparison with seven existing approaches. The results showed that SIMO outperformed all existing approaches

    Application de la vision artificielle à l'identification de groupes benthiques dans une optique de suivi environnemental des récifs coralliens

    Get PDF
    La composition des récifs coralliens est un excellent indicateur de la santé de la faune marine. Pour cette raison, les biologistes de l’Australian Institute of Marine Science (AIMS) en effectuent un suivi constant par l’analyse de photos acquises chaque année à travers la Grande Barrière de corail. Pour accélérer l’identification de leur contenu, on développe des algorithmes automatisés de reconnaissance de forme qui sont basés sur l’intelligence et la vision artificielles. Nous avons optimisé chaque étape d’un de ces algorithmes pour les caractéristiques de la base de données de l’AIMS. Pour débuter, nous avons évalué divers prétraitements pour compenser l’effet de l’imagerie sous-marine. Ensuite, nous avons itéré sur la taille de la fenêtre d’analyse pour former une segmentation simple et facile d’application. Puis, nous avons extrait des descripteurs à plusieurs échelles et sur plusieurs canaux de couleur de manière à exploiter adéquatement la richesse en information visuelle des images de coraux et nous avons réduit la dimensionnalité de l’espace de descripteurs. Enfin, nous avons défini une plage de valeurs idéales pour les paramètres des classificateurs. Pour compléter le tout, nous avons comparé la performance des étapes optimisées à celle d’algorithmes correspondant à la fine pointe de la technologie. Postérieurement, nous avons généralisé l’application à toute la base de données par des validations croisées qui ont permis de définir les limites de performance du système. Nous avons aussi développé des stratégies réalistes d’exploitation de la base de données. Pour ce faire, nous avons évalué la compatibilité de divers groupes d’entrainement et de test, appartenant à la même période temporelle, mais à des emplacements spatiaux distincts et vice-versa, puis à des groupes petits, grands, homogènes et diversifiés. La stratégie ainsi développée a permis d’atteindre les objectifs fixés et de compléter un outil efficient pour les biologistes marins de l’AIMS
    corecore