1,932 research outputs found

    Abstraction, aggregation and recursion for generating accurate and simple classifiers

    Get PDF
    An important goal of inductive learning is to generate accurate and compact classifiers from data. In a typical inductive learning scenario, instances in a data set are simply represented as ordered tuples of attribute values. In our research, we explore three methodologies to improve the accuracy and compactness of the classifiers: abstraction, aggregation, and recursion;Firstly, abstraction is aimed at the design and analysis of algorithms that generate and deal with taxonomies for the construction of compact and robust classifiers. In many applications of the data-driven knowledge discovery process, taxonomies have been shown to be useful in constructing compact, robust, and comprehensible classifiers. However, in many application domains, human-designed taxonomies are unavailable. We introduce algorithms for automated construction of taxonomies inductively from both structured (such as UCI Repository) and unstructured (such as text and biological sequences) data. We introduce AVT-Learner, an algorithm for automated construction of attribute value taxonomies (AVT) from data, and Word Taxonomy Learner (WTL), an algorithm for automated construction of word taxonomy from text and sequence data. We describe experiments on the UCI data sets and compare the performance of AVT-NBL (an AVT-guided Naive Bayes Learner) with that of the standard Naive Bayes Learner (NBL). Our results show that the AVTs generated by AVT-Learner are compeitive with human-generated AVTs (in cases where such AVTs are available). AVT-NBL using AVTs generated by AVT-Learner achieves classification accuracies that are comparable to or higher than those obtained by NBL; and the resulting classifiers are significantly more compact than those generated by NBL. Similarly, our experimental results of WTL and WTNBL on protein localization sequences and Reuters newswire text categorization data sets show that the proposed algorithms can generate Naive Bayes classifiers that are more compact and often more accurate than those produced by standard Naive Bayes learner for the Multinomial Model;Secondly, we apply aggregation to construct features as a multiset of values for the intrusion detection task. For this task, we propose a bag of system calls representation for system call traces and describe misuse and anomaly detection results on the University of New Mexico (UNM) and MIT Lincoln Lab (MIT LL) system call sequences with the proposed representation. With the feature representation as input, we compare the performance of several machine learning techniques for misuse detection and show experimental results on anomaly detection. The results show that standard machine learning and clustering techniques using the simple bag of system calls representation based on the system call traces generated by the operating system\u27s kernel is effective and often performs better than approaches that use foreign contiguous sequences in detecting intrusive behaviors of compromised processes;Finally, we construct a set of classifiers by recursive application of the Naive Bayes learning algorithms. Naive Bayes (NB) classifier relies on the assumption that the instances in each class can be described by a single generative model. This assumption can be restrictive in many real world classification tasks. We describe recursive Naive Bayes learner (RNBL), which relaxes this assumption by constructing a tree of Naive Bayes classifiers for sequence classification, where each individual NB classifier in the tree is based on an event model (one model for each class at each node in the tree). In our experiments on protein sequences, Reuters newswire documents and UC-Irvine benchmark data sets, we observe that RNBL substantially outperforms NB classifier. Furthermore, our experiments on the protein sequences and the text documents show that RNBL outperforms C4.5 decision tree learner (using tests on sequence composition statistics as the splitting criterion) and yields accuracies that are comparable to those of support vector machines (SVM) using similar information

    Automatic log parser to support forensic analysis

    Get PDF
    Event log parsing is a process to split and label each field in a log entry. Existing approaches commonly use regular expressions or parsing rules to extract the fields. However, such techniques are time-consuming as a forensic investigator needs to define a new rule for each log file type. In this paper, we present a tool, namely nerlogparser, to parse the log entries automatically, where log parsing is modeled as a named entity recognition problem. We use a deep machine learning technique, specifically the bidirectional long short-term memory networks, as the underlying architecture for this purpose. Unlike existing tools, nerlogparser is a fully automatic tool as the investigators do not need to define any parsing rules and it is generic as there is only one model to parse various types of log files. Experimental results show that nerlogparser achieves superior performance compared with other traditional machine learning methods

    Automatic log parser to support forensic analysis

    Get PDF
    Event log parsing is a process to split and label each field in a log entry. Existing approaches commonly use regular expressions or parsing rules to extract the fields. However, such techniques are time-consuming as a forensic investigator needs to define a new rule for each log file type. In this paper, we present a tool, namely nerlogparser, to parse the log entries automatically, where log parsing is modeled as a named entity recognition problem. We use a deep machine learning technique, specifically the bidirectional long short-term memory networks, as the underlying architecture for this purpose. Unlike existing tools, nerlogparser is a fully automatic tool as the investigators do not need to define any parsing rules and it is generic as there is only one model to parse various types of log files. Experimental results show that nerlogparser achieves superior performance compared with other traditional machine learning methods

    Modelling Digital Media Objects

    Get PDF

    Sentiment Analysis of News Tweets

    Get PDF
    Sentiment Analysis is a process of extracting information from a large amount of data and classifying them into different classes called sentiments. Python is a simple yet powerful, high-level, interpreted, and dynamic programming language, which is well known for its functionality of processing natural language data by using NLTK (Natural Language Toolkit). NLTK is a library of python, which provides a base for building programs and classification of data. NLTK also provides a graphical demonstration for representing various results or trends and it also provides sample data to train and test various classifiers respectively. Sentiment classification aims to automatically predict the sentiment polarity of users publishing sentiment data. Although traditional classification algorithms can be used to train sentiment classifiers from manually labeled text data, the labeling work can be time-consuming and expensive. Meanwhile, users often use different words when they express sentiment in different domains. If we directly apply a classifier trained in one domain to other domains, the performance will be very low due to the difference between these domains. In this work, we develop a general solution to sentiment classification when we do not have any labels in the target domain but have some labeled data in a different domain, regarded as the source domain. The purpose of this study is to analyze the tweets of the popular local and international news agencies and classify the tweeted news as positive, negative, or neutral categories

    An automated email classification system for the Ashesi Support Center

    Get PDF
    Applied project submitted to the Department of Computer Science, Ashesi University, in partial fulfillment of Bachelor of Science degree in Computer Science, April 2019The widespread usage of the internet has made email an indispensable tool for communication within organizations. Today, email is used by support centers as one of the mediums for providing solutions to the daily internal problems’ organizations face. An example is the Ashesi Support Center which is the hub for solutions for all problems and questions relating to IT, facilities, logistics, and other issues on the Ashesi University campus. In dealing with problems, the Ashesi support center classifies emails as either an IT related issue or an operations related issue. However, the support center does not have a way to automatically classify the emails. Hence, a support personnel manually sifts through the emails to group them. This can be a cumbersome process considering the support center receives over 40 emails daily during peak periods. Harnessing the power of machine learning, a classification model is built to automatically group emails the Ashesi support center receives.Ashesi Universit
    • …
    corecore