Search CORE

173 research outputs found

Mini-batch k-Means versus k-Means to Cluster English Tafseer Text: View of Al-Baqarah Chapter

Author: Hanif Baharin
Mohammed A. Ahmed
Puteri NE. Nohuddin
Publication venue: 'Penerbit UTHM'
Publication date: 19/12/2021
Field of study

Al-Quran is the primary text of Muslims' religion and practise. Millions of Muslims around the world use al-Quran as their reference guide, and so knowledge can be obtained from it by Muslims and Islamic scholars in general. Al-Quran has been reinterpreted to various languages in the world, for example, English and has been written by several translators. Each translator has ideas, comments and statements to translate the verses from which he has obtained (Tafseer). Therefore, this paper tries to cluster the translation of the Tafseer using text clustering. Text clustering is the text mining method that needs to be clustered in the same section of related documents. The study adapted (mini-batch k-means and k-means) algorithms of clustering techniques to explain and to define the link between keywords known as features or concepts for Al-Baqarah chapter of 286 verses. For this dataset, data preprocessing and extraction of features using TF-IDF (Term Frequency-Inverse Document Frequency), and PCA (Principal Component Analysis) applied. Results show two/three-dimensional clustering plotting assigning seven cluster categories (k=7) for the Tafseer. The implementation time of the mini-batch k-means algorithm (0.05485s) outperforms the time of the k-means algorithm (0.23334s). Finally, the features 'god', 'people', and 'believe' was the most frequent features

Journals of Universiti Tun Hussein Onn Malaysia (UTHM)

Clustering Balinese Language Documents using the Balinese Stemmer Method and Mini Batch K-Means with K-Means++

Author: Adnyana I Made Budi
Budiarta Komang
Subali Made Agus Putra
Sugiartha I Gusti Rai Agung
Publication venue: Politeknik Negeri Batam
Publication date: 05/12/2023
Field of study

Clustering aims to categorize data into n groups, where data within each group exhibits maximum similarity, while the similarity between groups is minimized. Among various clustering methods, k-means is widely employed due to its simplicity and ability to yield optimal clustering results. However, the k-means method is susceptible to slow processing in high-dimensional datasets and the clustering outcomes are sensitive to the initial selection of cluster center values. In addressing these limitations, this study employs the k-means mini-batch method to enhance processing speed for high-dimensional data and utilizes the k-means++ method to optimize the selection of initial cluster center values. The dataset for this research comprises 300 news articles in Balinese sourced from the https://balitv.tv/ website. Prior to the clustering process, a stemming procedure is applied using the Balinese stemmer method to enhance recall. The obtained results reveal that a majority of the 300 data instances exhibit a high degree of similarity, as indicated by the clustering results. If the number of clusters (n) exceeds two, the data fails to be distinctly separated due to the high structural similarity among the data instances. This can be attributed to the relatively small number of words or attributes produced. In future research, feature reduction will be implemented, and a clustering method capable of addressing data overlap will be explored.Clustering aims to categorize data into n groups, where data within each group exhibits maximum similarity, while the similarity between groups is minimized. Among various clustering methods, k-means is widely employed due to its simplicity and ability to yield optimal clustering results. However, the k-means method is susceptible to slow processing in high-dimensional datasets and the clustering outcomes are sensitive to the initial selection of cluster center values. In addressing these limitations, this study employs the k-means mini-batch method to enhance processing speed for high-dimensional data and utilizes the k-means++ method to optimize the selection of initial cluster center values. The dataset for this research comprises 300 news articles in Balinese sourced from the https://balitv.tv/ website. Prior to the clustering process, a stemming procedure is applied using the Balinese stemmer method to enhance recall. The obtained results reveal that a majority of the 300 data instances exhibit a high degree of similarity, as indicated by the clustering results. If the number of clusters (n) exceeds two, the data fails to be distinctly separated due to the high structural similarity among the data instances. This can be attributed to the relatively small number of words or attributes produced. In future research, feature reduction will be implemented, and a clustering method capable of addressing data overlap will be explored

Jurnal Politeknik Negeri Batam (PoliBatam)

Selecting Root Exploit Features Using Flying Animal-Inspired Decision

Author: Budiarto Rahmat
Firdaus Ahmad
Kasim Shahreen
Nincarean Danakorn
Razak Mohd Faizal Ab
Sutikno Tole
Wan Din Wan Isni Sofiah
Publication venue: IAES Indonesia Section
Publication date: 02/01/2020
Field of study

Malware is an application that executes malicious activities to a computer system, including mobile devices. Root exploit brings more damages among all types of malware because it is able to run in stealthy mode. It compromises the nucleus of the operating system known as kernel to bypass the Android security mechanisms. Once it attacks and resides in the kernel, it is able to install other possible types of malware to the Android devices. In order to detect root exploit, it is important to investigate its features to assist machine learning to predict it accurately. This study proposes flying animal-inspired (1) bat, 2) firefly, and 3) bee) methods to search automatically the exclusive features, then utilizes these flying animal-inspired decision features to improve the machine learning prediction. Furthermore, a boosting method (Adaboost) boosts the multilayer perceptron (MLP) potential to a stronger classification. The evaluation jotted the best result is from bee search, which recorded 91.48 percent in accuracy, 82.2 percent in true positive rate, and 0.1 percent false positive rate

Indonesian Journal of Electrical Engineering and Informatics (IJEEI)

Forging a deep learning neural network intrusion detection framework to curb the distributed denial of service attack

Author: Ojugo Arnold Adimabua
Yoro Rume Elizabeth
Publication venue: Institute of Advanced Engineering and Science
Publication date: 01/04/2021
Field of study

Today’s popularity of the internet has since proven an effective and efficient means of information sharing. However, this has consequently advanced the proliferation of adversaries who aim at unauthorized access to information being shared over the internet medium. These are achieved via various means one of which is the distributed denial of service attacks-which has become a major threat to the electronic society. These are carefully crafted attacks of large magnitude that possess the capability to wreak havoc at very high levels and national infrastructures. This study posits intelligent systems via the use of machine learning frameworks to detect such. We employ the deep learning approach to distinguish between benign exchange of data and malicious attacks from data traffic. Results shows consequent success in the employment of deep learning neural network to effectively differentiate between acceptable and non-acceptable data packets (intrusion) on a network data traffic

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

Recommended from our members

Robust behavioral malware detection

Author: Kazdagli Mikhail
Publication venue
Publication date: 13/08/2018
Field of study

Computer security attacks evolve to evade deployed defenses. Recent attacks have ranged from exploiting generic software vulnerabilities in memory-unsafe languages such as buffer overflows and format string vulnerabilities to exploiting logic errors in web applications, through means such as SQL injection and cross-site scripting. Furthermore, recent attacks have focused on escalating privileges and stealing sensitive information by exploiting new hardware or operating system (OS) interfaces. Computer security attacks are also now relying on social engineering techniques to run malicious programs on victims' machines; instances of such abuse include phishing and watering hole attacks, both of which trick people into running malicious code or divulging confidential information. Thus, traditional computer security methods, such as OS confinement and program analysis, will not prevent new attacks that do not violate OS confinement or present illegal program behaviors. Another challenge is that traditional security approaches have large trusted code bases (TCBs), which include hardware, OSs, and other software components that implement authentication and authorization logic across a distributed system. This is a vulnerable area because these components are complex and often contain vulnerabilities that undermine the overall system's integrity or confidentiality. Evasive attacks on vulnerable systems -- especially in instances where trusted components turn malicious -- inspire the creation of defenses that can augment formally specified mechanisms against known threats. Specifically, this thesis advances the state of the art in behavioral malware detection -- detecting previously unknown malware in the very early stages of infection within an enterprise network. Here we assess three fundamental insights of modern-day attacks and then describe a cross-layer defense against such attacks. First, we make a low-level machine state visible to behavioral analysis, significantly minimizing the TCB and its associated vulnerabilities. Specifically, our behavioral detector utilizes an executable code's dynamic properties, with architectural and micro-architectural states as input. Second, we evaluate behavioral detectors against adaptive adversaries. For this purpose, we introduce a new metric to determine a detector's robustness against malware modifications, which serves as a step toward explainability of machine learning-based malware detectors. Finally, we exploit the fact that attacks spread through only a limited number of vectors and propose new techniques to analyze the resulting dynamic correlations created among machines. These insights show that behavioral detectors can efficiently protect both individual devices and end hosts within enterprise networks. We present three types of such behavioral detectors. Sherlock protects resource-constrained devices, such as mobile phones and Internet-of-things (IoT) devices, without modifying the software/hardware stack. Sherlock's supervised and unsupervised versions outperform prior work by 24.7% and 12.5% (area under the curve (AUC) metric), respectively, and detects stealthy malware that often evades static analysis tools. The second behavioral detector, Shape-GD, protects devices within an enterprise network. It monitors devices on the network, aggregates data from weak local detectors, overlays that with network-level information, and then makes early, robust predictions regarding malicious activity. Shape-GD achieves its goals by exploiting latent attack semantics. Specifically, it analyzes communication patterns across multiple devices, partitioning them into neighborhoods. Devices within the same neighborhood are likely to be exposed to the same attack vector. Furthermore, we hypothesize that the conditional distribution of false positives is different from that of true positives; i.e., given a neighborhood of nodes, we can compute the aggregate distributional shape of alert feature vectors from the neighborhood itself and provide robust labels. We evaluate Shape-GD by emulating a large community of Windows systems using the system call traces from a few thousand malicious and benign applications; we simulate both a phishing attack in a corporate email network as well as a watering hole attack through a popular website. In both scenarios, Shape-GD identifies malware early on (~100 infected nodes in a ~100K-node system for watering hole attacks, and ~10 of ~1,000 for phishing attacks) and robustly (with ~100% global true-positive and ~1% global false-positive rates). The third behavioral detector, Centurion, detects malware across machines monitored by an anti-virus company. It is able to analyze behavior from 5 million Symantec client machines in real time and discovers malware by correlating file downloads across multiple machines. Compared with a recent local detector that analyzes metadata from file downloads, Centurion reduced the number of false positives from ~1M to ~110K and increased the true-positive rate by a factor of ~2.5. In addition, on average, Centurion detects malware 345 days earlier than commercial anti-virus products.Electrical and Computer Engineerin

Texas ScholarWorks

Anomaly Detection in Sequential Data: A Deep Learning-Based Approach

Author: Soni Jayesh
Publication venue: FIU Digital Commons
Publication date: 20/06/2022
Field of study

Anomaly Detection has been researched in various domains with several applications in intrusion detection, fraud detection, system health management, and bio-informatics. Conventional anomaly detection methods analyze each data instance independently (univariate or multivariate) and ignore the sequential characteristics of the data. Anomalies in the data can be detected by grouping the individual data instances into sequential data and hence conventional way of analyzing independent data instances cannot detect anomalies. Currently: (1) Deep learning-based algorithms are widely used for anomaly detection purposes. However, significant computational overhead time is incurred during the training process due to static constant batch size and learning rate parameters for each epoch, (2) the threshold to decide whether an event is normal or malicious is often set as static. This can drastically increase the false alarm rate if the threshold is set low or decrease the True Alarm rate if it is set to a remarkably high value, (3) Real-life data is messy. It is impossible to learn the data features by training just one algorithm. Therefore, several one-class-based algorithms need to be trained. The final output is the ensemble of the output from all the algorithms. The prediction accuracy can be increased by giving a proper weight to each algorithm\u27s output. By extending the state-of-the-art techniques in learning-based algorithms, this dissertation provides the following solutions: (i) To address (1), we propose a hybrid, dynamic batch size and learning rate tuning algorithm that reduces the overall training time of the neural network. (ii) As a solution for (2), we present an adaptive thresholding algorithm that reduces high false alarm rates. (iii) To overcome (3), we propose a multilevel hybrid ensemble anomaly detection framework that increases the anomaly detection rate of the high dimensional dataset

DigitalCommons@Florida International University

Neural malware detection

Author: Park Sean
Publication venue: 'Federation University Australia'
Publication date: 01/01/2019
Field of study

At the heart of today’s malware problem lies theoretically infinite diversity created by metamorphism. The majority of conventional machine learning techniques tackle the problem with the assumptions that a sufficiently large number of training samples exist and that the training set is independent and identically distributed. However, the lack of semantic features combined with the models under these wrong assumptions result largely in overfitting with many false positives against real world samples, resulting in systems being left vulnerable to various adversarial attacks. A key observation is that modern malware authors write a script that automatically generates an arbitrarily large number of diverse samples that share similar characteristics in program logic, which is a very cost-effective way to evade detection with minimum effort. Given that many malware campaigns follow this paradigm of economic malware manufacturing model, the samples within a campaign are likely to share coherent semantic characteristics. This opens up a possibility of one-to-many detection. Therefore, it is crucial to capture this non-linear metamorphic pattern unique to the campaign in order to detect these seemingly diverse but identically rooted variants. To address these issues, this dissertation proposes novel deep learning models, including generative static malware outbreak detection model, generative dynamic malware detection model using spatio-temporal isomorphic dynamic features, and instruction cognitive malware detection. A comparative study on metamorphic threats is also conducted as part of the thesis. Generative adversarial autoencoder (AAE) over convolutional network with global average pooling is introduced as a fundamental deep learning framework for malware detection, which captures highly complex non-linear metamorphism through translation invariancy and local variation insensitivity. Generative Adversarial Network (GAN) used as a part of the framework enables oneshot training where semantically isomorphic malware campaigns are identified by a single malware instance sampled from the very initial outbreak. This is a major innovation because, to the best of our knowledge, no approach has been found to this challenging training objective against the malware distribution that consists of a large number of very sparse groups artificially driven by arms race between attackers and defenders. In addition, we propose a novel method that extracts instruction cognitive representation from uninterpreted raw binary executables, which can be used for oneto- many malware detection via one-shot training against frequency spectrum of the Transformer’s encoded latent representation. The method works regardless of the presence of diverse malware variations while remaining resilient to adversarial attacks that mostly use random perturbation against raw binaries. Comprehensive performance analyses including mathematical formulations and experimental evaluations are provided, with the proposed deep learning framework for malware detection exhibiting a superior performance over conventional machine learning methods. The methods proposed in this thesis are applicable to a variety of threat environments here artificially formed sparse distributions arise at the cyber battle fronts.Doctor of Philosoph

Federation ResearchOnline

Multimodal Approach for Malware Detection

Author: Hernandez Jimenez Jarilyn M
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2019
Field of study

Although malware detection is a very active area of research, few works were focused on using physical properties (e.g., power consumption) and multimodal features for malware detection. We designed an experimental testbed that allowed us to run samples of malware and non-malicious software applications and to collect power consumption, network traffic, and system logs data, and subsequently to extract dynamic behavioral-based features. We also extracted code-based static features of both malware and non-malicious software applications. These features were used for malware detection based on: feature level fusion using power consumption and network traffic data, feature level fusion using network traffic data and system logs, and multimodal feature level and decision level fusion. The contributions when using feature level fusion of power consumption and network traffic data are: (1) We focused on detecting real malware using the extracted dynamic behavioral features (both power-based and network traffic-based) and supervised machine learning algorithms, which has not been done by any of the prior works. (2) We ran a large number of machine learning experiments, which allowed us to identify the best performing learner, DC voltage rails that led to the best malware detection performance, and the subset of features that are the best predictors for malware detection. (3) The comparison of malware detection performance was done using a comprehensive set of metrics that reflect different aspects of the quality of malware detection. In the case of the feature level fusion using network traffic data and system logs, the contributions are: (1) Most of the previous works that have used network flows-based features have done classification of the network traffic, while our focus was on classifying the software running in a machine as malware and non-malicious software using the extracted dynamic behavioral features. (2) We experimented with different sizes of the training set (i.e., 90%, 75%, 50%, and 25% of the data) and found that smaller training sets produced very good classification results. This aspect of our work has a practical value because the manual labeling of the training set is a tedious and time consuming process. In this dissertation we present a multimodal deep learning neural network that integrates different modalities (i.e., power consumption, system logs, network traffic, and code-based static data) using decision level fusion. We evaluated the performance of each modality individually, when using feature level fusion, and when using decision level fusion. The contributions of our multimodal approach are as follow: (1) Collecting data from different modalities allowed us to develop a multimodal approach to malware detection, which has not been widely explored by prior works. Even more, none of the previous works compared the performance of feature level fusion with decision level fusion, which is explored in this dissertation. (2) We proposed a multimodal decision level fusion malware detection approach using a deep neural network and compared its performance with the performance of feature level fusion approaches based on deep neural network and standard supervised machine learning algorithms (i.e., Random Forest, J48, JRip, PART, Naive Bayes, and SMO)

The Research Repository @ WVU (West Virginia University)

Toward Building an Intelligent and Secure Network: An Internet Traffic Forecasting Perspective

Author: Saha Sajal
Publication venue: Scholarship@Western
Publication date: 21/08/2023
Field of study

Internet traffic forecast is a crucial component for the proactive management of self-organizing networks (SON) to ensure better Quality of Service (QoS) and Quality of Experience (QoE). Given the volatile and random nature of traffic data, this forecasting influences strategic development and investment decisions in the Internet Service Provider (ISP) industry. Modern machine learning algorithms have shown potential in dealing with complex Internet traffic prediction tasks, yet challenges persist. This thesis systematically explores these issues over five empirical studies conducted in the past three years, focusing on four key research questions: How do outlier data samples impact prediction accuracy for both short-term and long-term forecasting? How can a denoising mechanism enhance prediction accuracy? How can robust machine learning models be built with limited data? How can out-of-distribution traffic data be used to improve the generalizability of prediction models? Based on extensive experiments, we propose a novel traffic forecast/prediction framework and associated models that integrate outlier management and noise reduction strategies, outperforming traditional machine learning models. Additionally, we suggest a transfer learning-based framework combined with a data augmentation technique to provide robust solutions with smaller datasets. Lastly, we propose a hybrid model with signal decomposition techniques to enhance model generalization for out-of-distribution data samples. We also brought the issue of cyber threats as part of our forecast research, acknowledging their substantial influence on traffic unpredictability and forecasting challenges. Our thesis presents a detailed exploration of cyber-attack detection, employing methods that have been validated using multiple benchmark datasets. Initially, we incorporated ensemble feature selection with ensemble classification to improve DDoS (Distributed Denial-of-Service) attack detection accuracy with minimal false alarms. Our research further introduces a stacking ensemble framework for classifying diverse forms of cyber-attacks. Proceeding further, we proposed a weighted voting mechanism for Android malware detection to secure Mobile Cyber-Physical Systems, which integrates the mobility of various smart devices to exchange information between physical and cyber systems. Lastly, we employed Generative Adversarial Networks for generating flow-based DDoS attacks in Internet of Things environments. By considering the impact of cyber-attacks on traffic volume and their challenges to traffic prediction, our research attempts to bridge the gap between traffic forecasting and cyber security, enhancing proactive management of networks and contributing to resilient and secure internet infrastructure

Scholarship@Western