83 research outputs found
Zero-day Network Intrusion Detection using Machine Learning Approach
Zero-day network attacks are a growing global cybersecurity concern. Hackers exploit vulnerabilities in network systems, making network traffic analysis crucial in detecting and mitigating unauthorized attacks. However, inadequate and ineffective network traffic analysis can lead to prolonged network compromises. To address this, machine learning-based zero-day network intrusion detection systems (ZDNIDS) rely on monitoring and collecting relevant information from network traffic data. The selection of pertinent features is essential for optimal ZDNIDS performance given the voluminous nature of network traffic data, characterized by attributes. Unfortunately, current machine learning models utilized in this field exhibit inefficiency in detecting zero-day network attacks, resulting in a high false alarm rate and overall performance degradation. To overcome these limitations, this paper introduces a novel approach combining the anomaly-based extended isolation forest algorithm with the BAT algorithm and Nevergrad. Furthermore, the proposed model was evaluated using 5G network traffic, showcasing its effectiveness in efficiently detecting both known and unknown attacks, thereby reducing false alarms when compared to existing systems. This advancement contributes to improved internet security
Randomized outlier detection with trees
Isolation forest (IF) is a popular outlier detection algorithm that isolates outlier observations from regular observations by building multiple random isolation trees. The average number of comparisons required to isolate a given observation can then be used as a measure of its outlierness. Multiple extensions of this approach have been proposed in the literature including the extended isolation forest (EIF) as well as the SCiForest. However, we find a lack of theoretical explanation on why IF, EIF, and SCiForest offer such good practical performance. In this paper, we present a theoretical framework that views these approaches from a distributional viewpoint. Using this viewpoint, we show that isolation-based approaches first accurately approximate the data distribution and then secondly approximate the coefficients of mixture components using the average path length. Using this framework, we derive the generalized isolation forest (GIF) that also trains random isolation trees, but combining them moves beyond using the average path length. That is, GIF splits the data into multiple sub-spaces by sampling random splits as do the original IF variants do and directly estimates the mixture coefficients of a mixture distribution to score the outlierness on entire regions of data. In an extensive evaluation, we compare GIF with 18 state-of-the-art outlier detection methods on 14 different datasets. We show that GIF outperforms three competing tree-based methods and has a competitive performance to other nearest-neighbor approaches while having a lower runtime. Last, we highlight a use-case study that uses GIF to detect transaction fraud in financial data
itsdm: Isolation forest-based presence-only species distribution modelling and explanation in r
Multiple statistical algorithms have been used for species distribution modelling (SDM). Due to shortcomings in species occurrence datasets, presence-only methods (such as MaxEnt) have become increasingly widely used. However, sampling bias remains a challenging issue, particularly for density-based approaches. The Isolation Forest (iForest) algorithm is a presence-only method less sensitive to sampling patterns and over-fitting because it fits the model by describing the unsuitable instead of suitable conditions. Here, we present the itsdm package for species distribution modelling with iForest, which provides a workflow wrapper for the algorithms in iForest family and convenient tools for model diagnostic and post-modelling analysis. itsdm allows users to fit and evaluate an iForest SDM using presence-only occurrence data. It also helps the users to understand relationships between species and the living environment using Shapley values, a suggested technique in explainable artificial intelligence (xAI). Additionally, itsdm can make spatial response maps that indicate how species respond to environmental variables across space and detect areas potentially affected by a changing environment. We demonstrated the usage of the itsdm package and compared iForest with other mainstream SDMs using virtual species. The results enlightened that iForest is an advantageous presence-only SDM when the actual distribution range is unclear. © 2023 The Authors. Methods in Ecology and Evolution published by John Wiley & Sons Ltd on behalf of British Ecological Society
Comparison of anomaly detection techniques applied to different problems in the telecom industry
Nowadays, with the growth of digital transformation in companies, a huge amount of
data is generated every second as a result of various processes. Often this data contains
important information which, when properly analyzed, can help a company gain a competitive
advantage. One data processing task common to many different applications is
detection of anomalies, that is, data points or groups of data points that stand out from
most of the others. Since it is not feasible to have an operator constantly analyzing the
data to find anomalous values, due to the generally large volumes of data, the focus of
this dissertation is the exploration of a Data Mining area called anomaly detection. In this
dissertation we first develop an anomaly detection software in Python, that applies 10
different anomaly detection algorithms, after automatically optimizing their parameters,
to an arbitrary dataset. Before applying these algorithms, the software also performs the
task of data scaling and imputation of missing values. It outputs the results of the performance
metrics of each algorithm, the values of the optimized parameters and the graphics
for the results visualization generated using the method t-SNE. This software was then
applied to three case studies to compare the performance of different anomaly detection
approaches using real-world datasets. These datasets have an increasing level of difficulty
associated with them: the amount of missing data and the uncertainty associated with
the ground truth regarding the anomalies. In the first case study, we detected fraudulent
bank transactions using a public dataset. Then, in the second case we identified clients of
a telecommunication company who were likely to miss their payment, leading to contract
termination. For this case we used a dataset from a telecommunications company. In
the third case, we detected low quality of internet service, again using a large dataset
with real measurements from a telecommunications company. Finally, we implemented
a state of the art, neural network model, specially applicable to the task of identifying
anomalies in time-series data. We optimized the parameters of the network, and applied
it to address the problem of low quality of service.Com o crescimento da transformação digital nas empresas, uma quantidade enorme
de dados são gerados a cada segundo como consequência de variados processos. Muitas
das vezes esses dados contêm informação importante que podem permitir a uma determinada
empresa obter uma vantagem competitiva. Uma forma de obter conhecimento
sobre o actual funcionamento de um determinado processo é através da detecção de anomalias,
ou seja, instâncias de dados que se destacam da maioria das restantes. Visto não
ser viável ter um operador a visualizar linhas de dados para encontrar anomalias, devido
às dimensões dos dados, o foco desta dissertação revolve em torno da exploração de uma
área de Data Mining chamada detecção de anomalias. Nesta dissertação propõe-se em
primeiro lugar um software de detecção de anomalias feito em Python que aplica um conjunto
de 10 algoritmos de detecção de anomalias, depois de optimizar os seus parâmetros
automaticamente, a um conjunto de dados arbitrários. Antes da aplicação dos algoritmos,
o software realiza primeiramente a sua normalização e a imputação dos valores nulos. Por
fim, retorna os resultados das métricas de desempenho de cada algoritmo, os parâmetros
escolhidos e um conjunto de gráficos para visualização de resultados, gerados utilizando
t-SNE. Este software foi então aplicado a três casos de estudo para comparar o desempenho
das diferentes técnicas utilizando conjuntos de dados reais. Estes conjuntos de dados
têm um nível crescente de dificuldade associado a eles: a quantidade de valores nulos
e a incerteza em relação aos pontos realmente anómalos. O primeiro é relacionado com
transacções bancárias onde se utilizou um conjunto de dados público. Depois, um caso
de estudo relacionado com cessações de contrato devido à falta de pagamento, onde foi
utilizado um conjunto de dados de uma empresa de telecomunicações. Por último um
caso de estudo relacionado com a qualidade de serviço de clientes de uma empresa de
telecomunicações. Por fim, foi implementada uma arquitectura de um modelo de redes
neuronais avançado de detecção de anomalias em séries temporais, que foi utilizado para
detectar anomalias no conjunto de dados de qualidade de serviço
SQ-SLAM: Monocular Semantic SLAM Based on Superquadric Object Representation
Object SLAM uses additional semantic information to detect and map objects in
the scene, in order to improve the system's perception and map representation
capabilities. Quadrics and cubes are often used to represent objects, but their
single shape limits the accuracy of object map and thus affects the application
of downstream tasks. In this paper, we introduce superquadrics (SQ) with shape
parameters into SLAM for representing objects, and propose a separate parameter
estimation method that can accurately estimate object pose and adapt to
different shapes. Furthermore, we present a lightweight data association
strategy for correctly associating semantic observations in multiple views with
object landmarks. We implement a monocular semantic SLAM system with real-time
performance and conduct comprehensive experiments on public datasets. The
results show that our method is able to build accurate object map and has
advantages in object representation. Code will be released upon acceptance.Comment: Submitted to ICRA 202
T2D2: A Time Series Tester, Transformer, and Decomposer Framework for Outlier Detection
The automatic detection of outliers in time series datasets has captured much amount of attention in the data science community. It is not a simple task as the data may have several perspectives, such as sessional, trendy, or a combination of the two. Furthermore, to obtain a reliable and untrustworthy knowledge from the data, the data itself should be understandable. To cope with these challenges, in this paper, we introduce a new framework that can first test the stationarity and seasonality of dataset, then apply a set of Fourier transforms to get the Fourier sample frequencies, which can be used as a support of a decomposer component. The proposed framework, namely TTDD (Test, Transform, Decompose, and Detection), implements the decomposer component that split the dataset into three parts: trend, seasonal, and residual. Finally, the frequency difference detector compares the frequency of the test set to the frequency of the training set determining the periods of discrepancy in the frequency considering them as outlier periods
- …