966 research outputs found

    Network problems detection and classification by analyzing syslog data

    Get PDF
    Network troubleshooting is an important process which has a wide research field. The first step in troubleshooting procedures is to collect information in order to diagnose the problems. Syslog messages which are sent by almost all network devices contain a massive amount of data related to the network problems. It is found that in many studies conducted previously, analyzing syslog data which can be a guideline for network problems and their causes was used. Detecting network problems could be more efficient if the detected problems have been classified in terms of network layers. Classifying syslog data needs to identify the syslog messages that describe the network problems for each layer, taking into account the different formats of various syslog for vendors’ devices. This study provides a method to classify syslog messages that indicates the network problem in terms of network layers. The method used data mining tool to classify the syslog messages while the description part of the syslog message was used for classification process. Related syslog messages were identified; features were then selected to train the classifiers. Six classification algorithms were learned; LibSVM, SMO, KNN, Naïve Bayes, J48, and Random Forest. A real data set which was obtained from the Universiti Utara Malaysia’s (UUM) network devices is used for the prediction stage. Results indicate that SVM shows the best performance during the training and prediction stages. This study contributes to the field of network troubleshooting, and the field of text data classification

    Data Masking, Encryption, and their Effect on Classification Performance: Trade-offs Between Data Security and Utility

    Get PDF
    As data mining increasingly shapes organizational decision-making, the quality of its results must be questioned to ensure trust in the technology. Inaccuracies can mislead decision-makers and cause costly mistakes. With more data collected for analytical purposes, privacy is also a major concern. Data security policies and regulations are increasingly put in place to manage risks, but these policies and regulations often employ technologies that substitute and/or suppress sensitive details contained in the data sets being mined. Data masking and substitution and/or data encryption and suppression of sensitive attributes from data sets can limit access to important details. It is believed that the use of data masking and encryption can impact the quality of data mining results. This dissertation investigated and compared the causal effects of data masking and encryption on classification performance as a measure of the quality of knowledge discovery. A review of the literature found a gap in the body of knowledge, indicating that this problem had not been studied before in an experimental setting. The objective of this dissertation was to gain an understanding of the trade-offs between data security and utility in the field of analytics and data mining. The research used a nationally recognized cancer incidence database, to show how masking and encryption of potentially sensitive demographic attributes such as patients’ marital status, race/ethnicity, origin, and year of birth, could have a statistically significant impact on the patients’ predicted survival. Performance parameters measured by four different classifiers delivered sizable variations in the range of 9% to 10% between a control group, where the select attributes were untouched, and two experimental groups where the attributes were substituted or suppressed to simulate the effects of the data protection techniques. In practice, this represented a corroboration of the potential risk involved when basing medical treatment decisions using data mining applications where attributes in the data sets are masked or encrypted for patient privacy and security concerns

    A Survey of Email Spam Filtering Methods

    Get PDF
    E-mail is one of the most secure medium for online communication and transferring data or messages through the web. An overgrowing increase in popularity, the number of unsolicited data has also increased rapidly. To filtering data, different approaches exist which automatically detect and remove these untenable messages. There are several numbers of email spam filtering technique such as Knowledge-based technique, Clustering techniques, Learning based technique, Heuristic processes and so on. This paper illustrates a survey of different existing email spam filtering system regarding Machine Learning Technique (MLT) such as Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. However, here we present the classification, evaluation and comparison of different email spam filtering system Keywords: e-mail spam, spam filtering methods, machine learning technique, classification, SVM, AN

    Concept graphs: Applications to biomedical text categorization and concept extraction

    Get PDF
    As science advances, the underlying literature grows rapidly providing valuable knowledge mines for researchers and practitioners. The text content that makes up these knowledge collections is often unstructured and, thus, extracting relevant or novel information could be nontrivial and costly. In addition, human knowledge and expertise are being transformed into structured digital information in the form of vocabulary databases and ontologies. These knowledge bases hold substantial hierarchical and semantic relationships of common domain concepts. Consequently, automating learning tasks could be reinforced with those knowledge bases through constructing human-like representations of knowledge. This allows developing algorithms that simulate the human reasoning tasks of content perception, concept identification, and classification. This study explores the representation of text documents using concept graphs that are constructed with the help of a domain ontology. In particular, the target data sets are collections of biomedical text documents, and the domain ontology is a collection of predefined biomedical concepts and relationships among them. The proposed representation preserves those relationships and allows using the structural features of graphs in text mining and learning algorithms. Those features emphasize the significance of the underlying relationship information that exists in the text content behind the interrelated topics and concepts of a text document. The experiments presented in this study include text categorization and concept extraction applied on biomedical data sets. The experimental results demonstrate how the relationships extracted from text and captured in graph structures can be used to improve the performance of the aforementioned applications. The discussed techniques can be used in creating and maintaining digital libraries through enhancing indexing, retrieval, and management of documents as well as in a broad range of domain-specific applications such as drug discovery, hypothesis generation, and the analysis of molecular structures in chemoinformatics

    Predictive analysis of incidents based on software deployments

    Get PDF
    A high number of IT organizations have problems when deploying their services, this alongside with the high number of services that organizations have daily, makes Incident Management (IM) process quite demanding. An effective IM system need to enable decision makers to detect problems easily otherwise the organizations can face unscheduled system downtime and/or unplanned costs. By predicting these problems, the decision makers can better allocate resources and mitigate costs. Therefore, this research aims to help predicting those problems by looking at the history of past deployments and incident ticket creation and relate them by using machine learning algorithms to predict the number of incidents of a certain deployment. This research aims to analyze the results with the most used algorithms found in the literature.info:eu-repo/semantics/publishedVersio

    Automatic Classification of the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft

    Get PDF
    Classification systems are one of the most established methods of knowledge organization with many advantages and yet, the collection of the Berliner Handreichungen zur Bibliotheks- und Informationswissenschaft (BHR) is missing a classification scheme. Therefore, an objective of the thesis at hand is to achieve a classification system for the collection and to potentially use Machine Learning (ML) methods for the automatic allocation of the BHR documents to the obtained classification system. The research questions that will be answered, are whether the JITA Classification System of Library and Information Science (JITA) is an appropriate classification system for the BHR and if automatic classification with ML can be applied to allocate the documents of the collection to a classification system without a using BHR data in the training dataset. To evaluate JITA an evaluation checklist was created based on recommendations of the cited literature. Using this checklist, it was concluded that JITA is not suitable as classification system of the BHR. Thus, using the same checklist as a reference, a new classification system was created. No expert evaluations nor user studies were conducted, which is a clear limitation of the thesis at hand. After a suitable classification scheme for the BHR was created, titles and abstracts of documents from different sources were scraped to use them as the training set for the ML experiments. Naïve Bayes, SVM, and Logistic Regression classifiers as well as Deep Learning classifiers, using the FLAIR framework, were tested. None of the obtained models yielded satisfying results, which is why no further experiments classifying the BHR documents were conducted. It was concluded that an automatic classification of the BHR documents is not possible without a BHR training set. Several limitations, especially during the creation of the training set, could have led to the unsatisfactory results which will be discussed in this thesis, which offers a basis for future studies that aim to evaluate classification schemes or for further Text Classification experiments

    Analysis of Twitter Data Using Deep Learning Approach: LSTM

    Get PDF
    Sentiment analysis the procedure of computationally identifying and categorizing evaluations expressed in a chunk of text, especially with a view to decide whether the writer’s mind-set toward a selected subject matter, product, etc. is high-quality, poor, or impartial[1]. Now a days the growth of social websites, running a blog offerings and electronic media con-tributes big amount of consumer supply messages which includes customer reviews, remarks and evaluations. Sentiment evaluation is an important term cited gather facts in a source with the aid of the usage of NLP, computational[2] linguistics and text analysis and to make decision through subjective information extracting and analyzing opinion, figuring out advantageous and bad opinions measuring how definitely and negatively an entity (public ,organization, product) is concerned. in the beyond decade , researcher have performed the sentiment analysis using device getting to know techniques which include guide vector gadget, naive bayes , maximum entropy method etc. Sentient analysis on social media textual content received lot of recognition because it includes pointers and pointers. lately deep gaining knowledge of methods like long short-term memory (LSTM) and convolution neural network (CNN) have gained recognition by means of displaying promising effects for speech and photograph processing, obligations in NLP through learning functions wealthy deep illustration from the facts robotically
    • …
    corecore