17 research outputs found
Intrusion detection using decision tree classifier with feature reduction technique
The number of internet users and network services is increasing rapidly in the recent decade gradually. A Large volume of data is produced and transmitted over the network. Number of security threats to the network has also been increased. Although there are many machine learning approaches and methods are used in intrusion detection systems to detect the attacks, but generally they are not efficient for large datasets and real time detection. Machine learning classifiers using all features of datasets minimized the accuracy of detection for classifier. A reduced feature selection technique that selects the most relevant features to detect the attack with ML approach has been used to obtain higher accuracy. In this paper, we used recursive feature elimination technique and selected more relevant features with machine learning approaches for big data to meet the challenge of detecting the attack. We applied this technique and classifier to NSL KDD dataset. Results showed that selecting all features for detection can maximize the complexity in the context of large data and performance of classifier can be increased by feature selection best in terms of efficiency and accuracy
Unsupervised Intrusion Detection with Cross-Domain Artificial Intelligence Methods
Cybercrime is a major concern for corporations, business owners, governments and citizens, and it continues to grow in spite of increasing investments in security and fraud prevention. The main challenges in this research field are: being able to detect unknown attacks, and reducing the false positive ratio. The aim of this research work was to target both problems by leveraging four artificial intelligence techniques.
The first technique is a novel unsupervised learning method based on skip-gram modeling. It was designed, developed and tested against a public dataset with popular intrusion patterns. A high accuracy and a low false positive rate were achieved without prior knowledge of attack patterns.
The second technique is a novel unsupervised learning method based on topic modeling. It was applied to three related domains (network attacks, payments fraud, IoT malware traffic). A high accuracy was achieved in the three scenarios, even though the malicious activity significantly differs from one domain to the other.
The third technique is a novel unsupervised learning method based on deep autoencoders, with feature selection performed by a supervised method, random forest. Obtained results showed that this technique can outperform other similar techniques.
The fourth technique is based on an MLP neural network, and is applied to alert reduction in fraud prevention. This method automates manual reviews previously done by human experts, without significantly impacting accuracy
Simple low cost causal discovery using mutual information and domain knowledge
PhDThis thesis examines causal discovery within datasets, in particular observational datasets where
normal experimental manipulation is not possible. A number of machine learning techniques
are examined in relation to their use of knowledge and the insights they can provide regarding
the situation under study. Their use of prior knowledge and the causal knowledge produced by
the learners are examined. Current causal learning algorithms are discussed in terms of their
strengths and limitations. The main contribution of the thesis is a new causal learner LUMIN
that operates with a polynomial time complexity in both the number of variables and records
examined. It makes no prior assumptions about the form of the relationships and is capable of
making extensive use of available domain information. This learner is compared to a number of
current learning algorithms and it is shown to be competitive with them
Algoritmos de aprendizagem adaptativos para classificadores de redes Bayesianas
Doutoramento em MatemáticaNesta tese consideramos o desenvolvimento de algoritmos adaptativos para
classificadores de redes Bayesianas (BNCs) num cenário on-line. Neste
cenário os dados são apresentados sequencialmente. O modelo de decisão
primeiro faz uma predição e logo este é actualizado com os novos dados. Um
cenário on-line de aprendizagem corresponde ao cenário “prequencial”
proposto por Dawid. Um algoritmo de aprendizagem num cenário prequencial
é eficiente se este melhorar o seu desempenho dedutivo e, ao mesmo tempo,
reduzir o custo da adaptação. Por outro lado, em muitas aplicações pode ser
difícil melhorar o desempenho e adaptar-se a fluxos de dados que apresentam
mudança de conceito. Neste caso, os algoritmos de aprendizagem devem ser
dotados com estratégias de controlo e adaptação que garantem o ajuste rápido
a estas mudanças.
Todos os algoritmos adaptativos foram integrados num modelo conceptual de
aprendizagem adaptativo e prequencial para classificação supervisada
designado AdPreqFr4SL, o qual tem como objectivo primordial atingir um
equilíbrio óptimo entre custo-qualidade e controlar a mudança de conceito. O
equilíbrio entre custo-qualidade é abordado através do controlo do viés (bias) e
da adaptação do modelo. Em vez de escolher uma única classe de BNCs
durante todo o processo, propomo-nos utilizar a classe de classificadores
Bayesianos k-dependentes (k-DBCs) e começar com o seu modelo mais
simples: o classificador Naïve Bayes (NB) (quando o número máximo de
dependências permissíveis entre os atributos, k, é 0). Podemos melhorar o
desempenho do NB se reduzirmos o bias produto das restrições de
independência. Com este fim, propomo-nos incrementar k gradualmente de
forma a que em cada etapa de aprendizagem sejam seleccionados modelos
de k-DBCs com uma complexidade crescente que melhor se vai ajustando ao
actual montante de dados. Assim podemos evitar os problemas causados por
demasiado viés (underfitting) ou demasiada variância (overfiting). Por outro
lado, a adaptação da estrutura de um BNC com novos dados implica um custo
computacional elevado. Propomo-nos reduzir nos custos da adaptação se,
sempre que possível, usarmos os novos dados para adaptar os parâmetros. A
estrutura é adaptada só em momentos esporádicos, quando é detectado que a
sua adaptação é vital para atingir uma melhoria no desempenho. Para
controlar a mudança de conceito, incluímos um método baseado no Controlo
de Qualidade Estatístico que tem mostrado ser efectivo na detecção destas
mudanças.
Avaliamos os algoritmos adaptativos usando a classe de classificadores k-DBC
em diferentes problemas artificiais e reais e mostramos as vantagens da sua
implementação quando comparado com as versões no adaptativas.This thesis mainly addresses the development of adaptive learning algorithms
for Bayesian network classifiers (BNCs) in an on-line leaning scenario. In this
scenario data arrives at the learning system sequentially. The actual predictive
model must first make a prediction and then update the current model with new
data. This scenario corresponds to the Dawid’s prequential approach for
statistical validation of models. An efficient adaptive algorithm in a prequential
learning framework must be able, above all, to improve its predictive accuracy
over time while reducing the cost of adaptation. However, in many real-world
situations it may be difficult to improve and adapt to existing changing
environments, a problem known as concept drift. In changing environments,
learning algorithms should be provided with some control and adaptive
mechanisms that effort to adjust quickly to these changes.
We have integrated all the adaptive algorithms into an adaptive prequential
framework for supervised learning called AdPreqFr4SL, which attempts to
handle the cost-performance trade-off and also to cope with concept drift.
The cost-quality trade-off is approached through bias management and
adaptation control. The rationale is as follows. Instead of selecting a particular
class of BNCs and using it during all the learning process, we use the class of
k-Dependence Bayesian classifiers and start with the simple Naïve Bayes (by
setting the maximum number of allowable attribute dependence k to 0). We can
then improve the performance of Naïve Bayes over time if we trade-off the bias
reduction which leads to the addition of new attribute dependencies with the
variance reduction by accurately estimating the parameters. However, as the
learning process advances we should place more focus on bias management.
We reduce the bias resulting from the independence assumption by gradually
adding dependencies between the attributes over time. To this end, we
gradually increase k so that at each learning step we can use a class-model of
k-DBCs that better suits the available data. Thus, we can avoid the problems
caused by either too much bias (underfitting) or too much variance (overfitting).
On the other hand, updating the structure of BNCs with new data is a very
costly task. Hence some adaptation control is desirable to decide whether it is
inevitable to adapt the structure. We reduce the cost of updating by using new
data to primarily adapt the parameters. Only when it is detected that the use of
the current structure no longer guarantees the desirable improvement in the
performance, do we adapt the structure. To handle concept drift, our
framework includes a method based on Statistical Quality Control, which has
been demonstrated to be efficient for recognizing concept changes.
We experimentally evaluated the AdPreqFr4SL on artificial domains and
benchmark problems and show its advantages in comparison against its nonadaptive
versions
Towards Personalized and Human-in-the-Loop Document Summarization
The ubiquitous availability of computing devices and the widespread use of
the internet have generated a large amount of data continuously. Therefore, the
amount of available information on any given topic is far beyond humans'
processing capacity to properly process, causing what is known as information
overload. To efficiently cope with large amounts of information and generate
content with significant value to users, we require identifying, merging and
summarising information. Data summaries can help gather related information and
collect it into a shorter format that enables answering complicated questions,
gaining new insight and discovering conceptual boundaries.
This thesis focuses on three main challenges to alleviate information
overload using novel summarisation techniques. It further intends to facilitate
the analysis of documents to support personalised information extraction. This
thesis separates the research issues into four areas, covering (i) feature
engineering in document summarisation, (ii) traditional static and inflexible
summaries, (iii) traditional generic summarisation approaches, and (iv) the
need for reference summaries. We propose novel approaches to tackle these
challenges, by: i)enabling automatic intelligent feature engineering, ii)
enabling flexible and interactive summarisation, iii) utilising intelligent and
personalised summarisation approaches. The experimental results prove the
efficiency of the proposed approaches compared to other state-of-the-art
models. We further propose solutions to the information overload problem in
different domains through summarisation, covering network traffic data, health
data and business process data.Comment: PhD thesi
Study of communications data compression methods
A simple monochrome conditional replenishment system was extended to higher compression and to higher motion levels, by incorporating spatially adaptive quantizers and field repeating. Conditional replenishment combines intraframe and interframe compression, and both areas are investigated. The gain of conditional replenishment depends on the fraction of the image changing, since only changed parts of the image need to be transmitted. If the transmission rate is set so that only one fourth of the image can be transmitted in each field, greater change fractions will overload the system. A computer simulation was prepared which incorporated (1) field repeat of changes, (2) a variable change threshold, (3) frame repeat for high change, and (4) two mode, variable rate Hadamard intraframe quantizers. The field repeat gives 2:1 compression in moving areas without noticeable degradation. Variable change threshold allows some flexibility in dealing with varying change rates, but the threshold variation must be limited for acceptable performance