1,199 research outputs found

    A traffic classification method using machine learning algorithm

    Get PDF
    Applying concepts of attack investigation in IT industry, this idea has been developed to design a Traffic Classification Method using Data Mining techniques at the intersection of Machine Learning Algorithm, Which will classify the normal and malicious traffic. This classification will help to learn about the unknown attacks faced by IT industry. The notion of traffic classification is not a new concept; plenty of work has been done to classify the network traffic for heterogeneous application nowadays. Existing techniques such as (payload based, port based and statistical based) have their own pros and cons which will be discussed in this literature later, but classification using Machine Learning techniques is still an open field to explore and has provided very promising results up till now

    Two-Step Cluster Based Feature Discretization of Naive Bayes for Outlier Detection in Intrinsic Plagiarism Detection

    Full text link
    Intrinsic plagiarism detection is the task of analyzing a document with respect to undeclared changes in writing style which treated as outliers. Naive Bayes is often used to outlier detection. However, Naive Bayes has assumption that the values of continuous feature are normally distributed where this condition is strongly violated that caused low classification performance. Discretization of continuous feature can improve the performance of NaĂŻve Bayes. In this study, feature discretization based on Two-Step Cluster for NaĂŻve Bayes has been proposed. The proposed method using tf-idf and query language model as feature creator and False Positive/False Negative (FP/FN) threshold which aims to improve the accuracy and evaluated using PAN PC 2009 dataset. The result indicated that the proposed method with discrete feature outperform the result from continuous feature for all evaluation, such as recall, precision, f-measure and accuracy. The using of FP/FN threshold affects the result as well since it can decrease FP and FN; thus, increase all evaluation

    Distribution of Mutual Information from Complete and Incomplete Data

    Full text link
    Mutual information is widely used, in a descriptive way, to measure the stochastic dependence of categorical random variables. In order to address questions such as the reliability of the descriptive value, one must consider sample-to-population inferential approaches. This paper deals with the posterior distribution of mutual information, as obtained in a Bayesian framework by a second-order Dirichlet prior distribution. The exact analytical expression for the mean, and analytical approximations for the variance, skewness and kurtosis are derived. These approximations have a guaranteed accuracy level of the order O(1/n^3), where n is the sample size. Leading order approximations for the mean and the variance are derived in the case of incomplete samples. The derived analytical expressions allow the distribution of mutual information to be approximated reliably and quickly. In fact, the derived expressions can be computed with the same order of complexity needed for descriptive mutual information. This makes the distribution of mutual information become a concrete alternative to descriptive mutual information in many applications which would benefit from moving to the inductive side. Some of these prospective applications are discussed, and one of them, namely feature selection, is shown to perform significantly better when inductive mutual information is used.Comment: 26 pages, LaTeX, 5 figures, 4 table

    Discretization of Continuous Attributes

    No full text
    7 pagesIn the data mining field, many learning methods -like association rules, Bayesian networks, induction rules (Grzymala-Busse & Stefanowski, 2001)- can handle only discrete attributes. Therefore, before the machine learning process, it is necessary to re-encode each continuous attribute in a discrete attribute constituted by a set of intervals, for example the age attribute can be transformed in two discrete values representing two intervals: less than 18 (a minor) and 18 and more (of age). This process, known as discretization, is an essential task of the data preprocessing, not only because some learning methods do not handle continuous attributes, but also for other important reasons: the data transformed in a set of intervals are more cognitively relevant for a human interpretation (Liu, Hussain, Tan & Dash, 2002); the computation process goes faster with a reduced level of data, particularly when some attributes are suppressed from the representation space of the learning problem if it is impossible to find a relevant cut (Mittal & Cheong, 2002); the discretization can provide non-linear relations -e.g., the infants and the elderly people are more sensitive to illness

    Classifiers for educational technology

    Get PDF
    Peer reviewe

    Feature Expansion for Social Media User Characterization

    Get PDF
    Personality plays an impactful role in our lives and psychologists believe that an individual’s behavior can be inferred through its personality. Recently, there have been cases of influential people in social media spreading misinformation, which is a potentially dangerous action. To prevent it, we need to identify which users will negatively impact the community, and we might be able to predict such behavior through personality recognition from their social media posts. This dissertation presents an approach to personality recognition from text. During the bibliographic revision, we learned that a text analysis tool called LIWC is repeatedly used with success for tasks of this type, thus we chose the LIWC dictionary to be the base feature set to consider. Also, we have found that Support-Vector Machine classifiers exhibit the best results. From these two findings, we outlined the following objectives: (i) exploit machine learning algorithms different from the ones used in related works to find one that produces better results; (ii) analyze whether extending LIWC’s vocabulary without supervision improves the classification results. For training and testing, we used a data set of stream-of-consciousness essays comprised of 2468 samples annotated with the Big Five personality traits of the writer: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism. We used four machine learning algorithms for classification: Support-Vector Machine, Naive Bayes, Decision Tree, and Random Forest. Also, we selected two methods for vocabulary expansion: WordNet’s synsets, and Word Embeddings. The results obtained show that the Random Forest classifier performs similarly to the algorithms used in related works, with an average accuracy of approximately 56.5%, which are promising ratings. The vocabulary expansions we have performed allowed the algorithm to match 0.6% more words from the essay data set. However, the changes to the classification results were not significant, therefore the vocabulary expansion was not beneficial.A personalidade é um fator fundamental nas nossas vidas e os psicólogos acreditam que o comportamento de um indivíduo pode ser inferido com base na sua personalidade. Recentemente, ocorreram casos de disseminação de informação falsa em redes sociais por parte de pessoas influentes, executando assim ações potencialmente perigosas. Para prevenir estes acontecimentos, é necessário identificar quais os utilizadores que afetarão negativamente a comunidade, e poderemos fazê-lo com o reconhecimento de personalidade através das suas publicações em redes sociais. Esta dissertação apresenta uma abordagem à tarefa de reconhecimento de personalidade através de texto. Durante a revisão bibliográfica, identificámos uma ferramenta de análise de texto chamada Linguistic Inquiry and Word Count (LIWC) que é usada repetidamente e com sucesso em trabalhos relacionados e, portanto, decidimos que será a base de dados a utilizar para extração de características. Verificou-se também que classificadores Support-Vector Machine produzem os melhores resultados. Perante estes factos, delineámos os seguintes objetivos: (i) explorar algoritmos de aprendizagem automática diferentes dos usados em trabalhos relacionados para encontrar um que produza melhores resultados; (ii) analisar se uma extensão não supervisionada do vocabulário do LIWC melhora os resultados da classificação. Para treinar e testar os modelos, usámos um conjunto de 2468 ensaios de fluxo de consciência anotados com os traços de personalidade Big Five do escritor: abertura para a experiência, conscienciosidade, extroversão, amabilidade, e neuroticismo. Implementámos quatro algoritmos de aprendizagem automática para classificar os textos: Support-Vector Machine, Naive Bayes, Decision Tree, e Random Forest. Para além disso, selecionámos dois métodos para a expansão de vocabulário: sinónimos cognitivos do WordNet, e Word Embeddings. Os resultados obtidos demonstram que o classificador Random Forest tem uma performance promissora, semelhante à dos algoritmos utilizados pelos artigos relacionados, com uma exatidão média de aproximadamente 56.5%. As expansões de vocabulário realizadas traduziram-se num aumento de 0.6% de palavras dos ensaios atribuídas a categorias do LIWC. No entanto, a diferença introduzida nos resultados não é significativa, portanto a expansão de vocabulário não mostrou benefícios

    Analysis of Intelligent Classifiers and Enhancing the Detection Accuracy for Intrusion Detection System

    Get PDF
    In this paper we discuss and analyze some of the intelligent classifiers which allows for automatic detection and classification of networks attacks for any intrusion detection system. We will proceed initially with their analysis using the WEKA software to work with the classifiers on a well-known IDS (Intrusion Detection Systems) dataset like NSL-KDD dataset. The NSL-KDD dataset of network attacks was created in a military network by MIT Lincoln Labs. Then we will discuss and experiment some of the hybrid AI (Artificial Intelligence) classifiers that can be used for IDS, and finally we developed a Java software with three most efficient classifiers and compared it with other options. The outputs would show the detection accuracy and efficiency of the single and combined classifiers used

    What is behind a summary-evaluation decision?

    Get PDF
    Research in psychology has reported that, among the variety of possibilities for assessment methodologies, summary evaluation offers a particularly adequate context for inferring text comprehension and topic understanding. However, grades obtained in this methodology are hard to quantify objectively. Therefore, we carried out an empirical study to analyze the decisions underlying human summary-grading behavior. The task consisted of expert evaluation of summaries produced in critically relevant contexts of summarization development, and the resulting data were modeled by means of Bayesian networks using an application called Elvira, which allows for graphically observing the predictive power (if any) of the resultant variables. Thus, in this article, we analyzed summary-evaluation decision making in a computational framewor

    Speaker Prediction based on Head Orientations

    Get PDF
    • …
    corecore