1,199 research outputs found
A traffic classification method using machine learning algorithm
Applying concepts of attack investigation in IT industry, this idea has been developed to design
a Traffic Classification Method using Data Mining techniques at the intersection of Machine
Learning Algorithm, Which will classify the normal and malicious traffic. This classification will
help to learn about the unknown attacks faced by IT industry. The notion of traffic classification
is not a new concept; plenty of work has been done to classify the network traffic for
heterogeneous application nowadays. Existing techniques such as (payload based, port based
and statistical based) have their own pros and cons which will be discussed in this
literature later, but classification using Machine Learning techniques is still an open field to explore and has provided very promising results up till now
Two-Step Cluster Based Feature Discretization of Naive Bayes for Outlier Detection in Intrinsic Plagiarism Detection
Intrinsic plagiarism detection is the task of analyzing a document with respect to undeclared changes in writing style which treated as outliers. Naive Bayes is often used to outlier detection. However, Naive Bayes has assumption that the values of continuous feature are normally distributed where this condition is strongly violated that caused low classification performance. Discretization of continuous feature can improve the performance of NaĂŻve Bayes. In this study, feature discretization based on Two-Step Cluster for NaĂŻve Bayes has been proposed. The proposed method using tf-idf and query language model as feature creator and False Positive/False Negative (FP/FN) threshold which aims to improve the accuracy and evaluated using PAN PC 2009 dataset. The result indicated that the proposed method with discrete feature outperform the result from continuous feature for all evaluation, such as recall, precision, f-measure and accuracy. The using of FP/FN threshold affects the result as well since it can decrease FP and FN; thus, increase all evaluation
Distribution of Mutual Information from Complete and Incomplete Data
Mutual information is widely used, in a descriptive way, to measure the
stochastic dependence of categorical random variables. In order to address
questions such as the reliability of the descriptive value, one must consider
sample-to-population inferential approaches. This paper deals with the
posterior distribution of mutual information, as obtained in a Bayesian
framework by a second-order Dirichlet prior distribution. The exact analytical
expression for the mean, and analytical approximations for the variance,
skewness and kurtosis are derived. These approximations have a guaranteed
accuracy level of the order O(1/n^3), where n is the sample size. Leading order
approximations for the mean and the variance are derived in the case of
incomplete samples. The derived analytical expressions allow the distribution
of mutual information to be approximated reliably and quickly. In fact, the
derived expressions can be computed with the same order of complexity needed
for descriptive mutual information. This makes the distribution of mutual
information become a concrete alternative to descriptive mutual information in
many applications which would benefit from moving to the inductive side. Some
of these prospective applications are discussed, and one of them, namely
feature selection, is shown to perform significantly better when inductive
mutual information is used.Comment: 26 pages, LaTeX, 5 figures, 4 table
Discretization of Continuous Attributes
7 pagesIn the data mining field, many learning methods -like association rules, Bayesian networks, induction rules (Grzymala-Busse & Stefanowski, 2001)- can handle only discrete attributes. Therefore, before the machine learning process, it is necessary to re-encode each continuous attribute in a discrete attribute constituted by a set of intervals, for example the age attribute can be transformed in two discrete values representing two intervals: less than 18 (a minor) and 18 and more (of age). This process, known as discretization, is an essential task of the data preprocessing, not only because some learning methods do not handle continuous attributes, but also for other important reasons: the data transformed in a set of intervals are more cognitively relevant for a human interpretation (Liu, Hussain, Tan & Dash, 2002); the computation process goes faster with a reduced level of data, particularly when some attributes are suppressed from the representation space of the learning problem if it is impossible to find a relevant cut (Mittal & Cheong, 2002); the discretization can provide non-linear relations -e.g., the infants and the elderly people are more sensitive to illness
Feature Expansion for Social Media User Characterization
Personality plays an impactful role in our lives and psychologists believe that an individual’s behavior can be inferred through its personality. Recently, there have been cases of
influential people in social media spreading misinformation, which is a potentially dangerous action. To prevent it, we need to identify which users will negatively impact the
community, and we might be able to predict such behavior through personality recognition from their social media posts.
This dissertation presents an approach to personality recognition from text. During the
bibliographic revision, we learned that a text analysis tool called LIWC is repeatedly used
with success for tasks of this type, thus we chose the LIWC dictionary to be the base feature
set to consider. Also, we have found that Support-Vector Machine classifiers exhibit the
best results. From these two findings, we outlined the following objectives: (i) exploit
machine learning algorithms different from the ones used in related works to find one
that produces better results; (ii) analyze whether extending LIWC’s vocabulary without
supervision improves the classification results.
For training and testing, we used a data set of stream-of-consciousness essays comprised
of 2468 samples annotated with the Big Five personality traits of the writer: openness
to experience, conscientiousness, extraversion, agreeableness, and neuroticism. We used
four machine learning algorithms for classification: Support-Vector Machine, Naive Bayes,
Decision Tree, and Random Forest. Also, we selected two methods for vocabulary expansion: WordNet’s synsets, and Word Embeddings.
The results obtained show that the Random Forest classifier performs similarly to the algorithms used in related works, with an average accuracy of approximately 56.5%, which
are promising ratings. The vocabulary expansions we have performed allowed the algorithm to match 0.6% more words from the essay data set. However, the changes to the
classification results were not significant, therefore the vocabulary expansion was not beneficial.A personalidade Ă© um fator fundamental nas nossas vidas e os psicĂłlogos acreditam que
o comportamento de um indivĂduo pode ser inferido com base na sua personalidade. Recentemente, ocorreram casos de disseminação de informação falsa em redes sociais por
parte de pessoas influentes, executando assim ações potencialmente perigosas. Para prevenir estes acontecimentos, é necessário identificar quais os utilizadores que afetarão negativamente a comunidade, e poderemos fazê-lo com o reconhecimento de personalidade
através das suas publicações em redes sociais.
Esta dissertação apresenta uma abordagem à tarefa de reconhecimento de personalidade
através de texto. Durante a revisão bibliográfica, identificámos uma ferramenta de análise
de texto chamada Linguistic Inquiry and Word Count (LIWC) que Ă© usada repetidamente
e com sucesso em trabalhos relacionados e, portanto, decidimos que será a base de dados a utilizar para extração de caracterĂsticas. Verificou-se tambĂ©m que classificadores
Support-Vector Machine produzem os melhores resultados. Perante estes factos, delineámos os seguintes objetivos: (i) explorar algoritmos de aprendizagem automática diferentes dos usados em trabalhos relacionados para encontrar um que produza melhores
resultados; (ii) analisar se uma extensão não supervisionada do vocabulário do LIWC melhora os resultados da classificação.
Para treinar e testar os modelos, usámos um conjunto de 2468 ensaios de fluxo de consciência anotados com os traços de personalidade Big Five do escritor: abertura para a experiência, conscienciosidade, extroversão, amabilidade, e neuroticismo. Implementámos
quatro algoritmos de aprendizagem automática para classificar os textos: Support-Vector
Machine, Naive Bayes, Decision Tree, e Random Forest. Para além disso, selecionámos
dois métodos para a expansão de vocabulário: sinónimos cognitivos do WordNet, e Word
Embeddings.
Os resultados obtidos demonstram que o classificador Random Forest tem uma performance promissora, semelhante Ă dos algoritmos utilizados pelos artigos relacionados,
com uma exatidĂŁo mĂ©dia de aproximadamente 56.5%. As expansões de vocabulário realizadas traduziram-se num aumento de 0.6% de palavras dos ensaios atribuĂdas a categorias do LIWC. No entanto, a diferença introduzida nos resultados nĂŁo Ă© significativa,
portanto a expansĂŁo de vocabulário nĂŁo mostrou benefĂcios
Analysis of Intelligent Classifiers and Enhancing the Detection Accuracy for Intrusion Detection System
In this paper we discuss and analyze some of the intelligent classifiers
which allows for automatic detection and classification of networks attacks for
any intrusion detection system. We will proceed initially with their analysis
using the WEKA software to work with the classifiers on a well-known IDS
(Intrusion Detection Systems) dataset like NSL-KDD dataset. The NSL-KDD dataset
of network attacks was created in a military network by MIT Lincoln Labs. Then
we will discuss and experiment some of the hybrid AI (Artificial Intelligence)
classifiers that can be used for IDS, and finally we developed a Java software
with three most efficient classifiers and compared it with other options. The
outputs would show the detection accuracy and efficiency of the single and
combined classifiers used
What is behind a summary-evaluation decision?
Research in psychology has reported that, among the variety of possibilities for assessment methodologies, summary evaluation offers a particularly adequate context for inferring text comprehension and topic understanding. However, grades obtained in this methodology are hard to quantify objectively. Therefore, we carried out an empirical study to analyze the decisions underlying human summary-grading behavior. The task consisted of expert evaluation of summaries produced in critically relevant contexts of summarization development, and the resulting data were modeled by means of Bayesian networks using an application called Elvira, which allows for graphically observing the predictive power (if any) of the resultant variables. Thus, in this article, we analyzed summary-evaluation decision making in a computational framewor
- …