Search CORE

43,635 research outputs found

A traffic classification method using machine learning algorithm

Author: Chishti Hamayoun Rauf
Publication venue: University of Bedfordshire
Publication date: 01/01/2013
Field of study

Applying concepts of attack investigation in IT industry, this idea has been developed to design a Traffic Classification Method using Data Mining techniques at the intersection of Machine Learning Algorithm, Which will classify the normal and malicious traffic. This classification will help to learn about the unknown attacks faced by IT industry. The notion of traffic classification is not a new concept; plenty of work has been done to classify the network traffic for heterogeneous application nowadays. Existing techniques such as (payload based, port based and statistical based) have their own pros and cons which will be discussed in this literature later, but classification using Machine Learning techniques is still an open field to explore and has provided very promising results up till now

Using Machine Learning to Forecast Future Earnings

Author: Cui Xinyue
Xu Zhaoyu
Zhou Yue
Publication venue
Publication date: 26/05/2020
Field of study

In this essay, we have comprehensively evaluated the feasibility and suitability of adopting the Machine Learning Models on the forecast of corporation fundamentals (i.e. the earnings), where the prediction results of our method have been thoroughly compared with both analysts' consensus estimation and traditional statistical models. As a result, our model has already been proved to be capable of serving as a favorable auxiliary tool for analysts to conduct better predictions on company fundamentals. Compared with previous traditional statistical models being widely adopted in the industry like Logistic Regression, our method has already achieved satisfactory advancement on both the prediction accuracy and speed. Meanwhile, we are also confident enough that there are still vast potentialities for this model to evolve, where we do hope that in the near future, the machine learning model could generate even better performances compared with professional analysts

arXiv.org e-Print Archive

A review of associative classification mining

Author: Thabtah Fadi
Publication venue
Publication date: 01/01/2007
Field of study

Associative classification mining is a promising approach in data mining that utilizes the association rule discovery techniques to construct classification systems, also known as associative classifiers. In the last few years, a number of associative classification algorithms have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative classification techniques with regards to the above criteria. Finally, future directions in associative classification, such as incremental learning and mining low-quality data sets, are also highlighted in this paper

CiteSeerX

University of Huddersfield Repository

Stochastic Attribute-Value Grammars

Author: Abney Steven
Publication venue
Publication date: 23/10/1996
Field of study

Probabilistic analogues of regular and context-free grammars are well-known in computational linguistics, and currently the subject of intensive research. To date, however, no satisfactory probabilistic analogue of attribute-value grammars has been proposed: previous attempts have failed to define a correct parameter-estimation algorithm. In the present paper, I define stochastic attribute-value grammars and give a correct algorithm for estimating their parameters. The estimation algorithm is adapted from Della Pietra, Della Pietra, and Lafferty (1995). To estimate model parameters, it is necessary to compute the expectations of certain functions under random fields. In the application discussed by Della Pietra, Della Pietra, and Lafferty (representing English orthographic constraints), Gibbs sampling can be used to estimate the needed expectations. The fact that attribute-value grammars generate constrained languages makes Gibbs sampling inapplicable, but I show how a variant of Gibbs sampling, the Metropolis-Hastings algorithm, can be used instead.Comment: 23 pages, 21 Postscript figures, uses rotate.st

arXiv.org e-Print Archive

CiteSeerX

An Approach to Find Missing Values in Medical Datasets

Author: Bai B. Mathura
Mangathayaru N.
Rani B. Padmaja
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/04/2016
Field of study

Mining medical datasets is a challenging problem before data mining researchers as these datasets have several hidden challenges compared to conventional datasets.Starting from the collection of samples through field experiments and clinical trials to performing classification,there are numerous challenges at every stage in the mining process. The preprocessing phase in the mining process itself is a challenging issue when, we work on medical datasets. One of the prime challenges in mining medical datasets is handling missing values which is part of preprocessing phase. In this paper, we address the issue of handling missing values in medical dataset consisting of categorical attribute values. The main contribution of this research is to use the proposed imputation measure to estimate and fix the missing values. We discuss a case study to demonstrate the working of proposed measure.Comment: 7 pages,ACM Digital Library, ICEMIS September 201

arXiv.org e-Print Archive

ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

Author: Schubert Erich
Zimek Arthur
Publication venue
Publication date: 10/02/2019
Field of study

This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality

arXiv.org e-Print Archive

Modeling Social Networks with Node Attributes using the Multiplicative Attribute Graph Model

Author: Kim Myunghwan
Leskovec Jure
Publication venue
Publication date: 01/01/2011
Field of study

Networks arising from social, technological and natural domains exhibit rich connectivity patterns and nodes in such networks are often labeled with attributes or features. We address the question of modeling the structure of networks where nodes have attribute information. We present a Multiplicative Attribute Graph (MAG) model that considers nodes with categorical attributes and models the probability of an edge as the product of individual attribute link formation affinities. We develop a scalable variational expectation maximization parameter estimation method. Experiments show that MAG model reliably captures network connectivity as well as provides insights into how different attributes shape the network structure.Comment: 15 pages, 7 figures, 7 table

arXiv.org e-Print Archive

CiteSeerX

Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views

Author: Franklin Michael J.
Goldberg Ken
Kraska Tim
Krishnan Sanjay
Wang Jiannan
Publication venue: 'VLDB Endowment'
Publication date: 24/09/2015
Field of study

Materialized views (MVs), stored pre-computed results, are widely used to facilitate fast queries on large datasets. When new records arrive at a high rate, it is infeasible to continuously update (maintain) MVs and a common solution is to defer maintenance by batching updates together. Between batches the MVs become increasingly stale with incorrect, missing, and superfluous rows leading to increasingly inaccurate query results. We propose Stale View Cleaning (SVC) which addresses this problem from a data cleaning perspective. In SVC, we efficiently clean a sample of rows from a stale MV, and use the clean sample to estimate aggregate query results. While approximate, the estimated query results reflect the most recent data. As sampling can be sensitive to long-tailed distributions, we further explore an outlier indexing technique to give increased accuracy when the data distributions are skewed. SVC complements existing deferred maintenance approaches by giving accurate and bounded query answers between maintenance. We evaluate our method on a generated dataset from the TPC-D benchmark and a real video distribution application. Experiments confirm our theoretical results: (1) cleaning an MV sample is more efficient than full view maintenance, (2) the estimated results are more accurate than using the stale MV, and (3) SVC is applicable for a wide variety of MVs

arXiv.org e-Print Archive

Cloud Service Provider Evaluation System using Fuzzy Rough Set Technique

Author: Anjana Parwat Singh
Badiwal Priyanka
Rao C. Raghavendra
Wankar Rajeev
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 27/10/2018
Field of study

Cloud Service Providers (CSPs) offer a wide variety of scalable, flexible, and cost-efficient services to cloud users on demand and pay-per-utilization basis. However, vast diversity in available cloud service providers leads to numerous challenges for users to determine and select the best suitable service. Also, sometimes users need to hire the required services from multiple CSPs which introduce difficulties in managing interfaces, accounts, security, supports, and Service Level Agreements (SLAs). To circumvent such problems having a Cloud Service Broker (CSB) be aware of service offerings and users Quality of Service (QoS) requirements will benefit both the CSPs as well as users. In this work, we proposed a Fuzzy Rough Set based Cloud Service Brokerage Architecture, which is responsible for ranking and selecting services based on users QoS requirements, and finally monitor the service execution. We have used the fuzzy rough set technique for dimension reduction. Used weighted Euclidean distance to rank the CSPs. To prioritize user QoS request, we intended to use user assign weights, also incorporated system assigned weights to give the relative importance to QoS attributes. We compared the proposed ranking technique with an existing method based on the system response time. The case study experiment results show that the proposed approach is scalable, resilience, and produce better results with less searching time.Comment: 12 pages, 7 figures, and 8 table

arXiv.org e-Print Archive

Stable Multiple Time Step Simulation/Prediction from Lagged Dynamic Network Regression Models

Author: Almquist Zack W.
Mallik Abhirup
Publication venue
Publication date: 23/07/2018
Field of study

Recent developments in computers and automated data collection strategies have greatly increased the interest in statistical modeling of dynamic networks. Many of the statistical models employed for inference on large-scale dynamic networks suffer from limited forward simulation/prediction ability. A major problem with many of the forward simulation procedures is the tendency for the model to become degenerate in only a few time steps, i.e., the simulation/prediction procedure results in either null graphs or complete graphs. Here, we describe an algorithm for simulating a sequence of networks generated from lagged dynamic network regression models DNR(V), a sub-family of TERGMs. We introduce a smoothed estimator for forward prediction based on smoothing of the change statistics obtained for a dynamic network regression model. We focus on the implementation of the algorithm, providing a series of motivating examples with comparisons to dynamic network models from the literature. We find that our algorithm significantly improves multi-step prediction/simulation over standard DNR(V) forecasting. Furthermore, we show that our method performs comparably to existing more complex dynamic network analysis frameworks (SAOM and STERGMs) for small networks over short time periods, and significantly outperforms these approaches over long time time intervals and/or large networks

arXiv.org e-Print Archive