43,635 research outputs found
A traffic classification method using machine learning algorithm
Applying concepts of attack investigation in IT industry, this idea has been developed to design
a Traffic Classification Method using Data Mining techniques at the intersection of Machine
Learning Algorithm, Which will classify the normal and malicious traffic. This classification will
help to learn about the unknown attacks faced by IT industry. The notion of traffic classification
is not a new concept; plenty of work has been done to classify the network traffic for
heterogeneous application nowadays. Existing techniques such as (payload based, port based
and statistical based) have their own pros and cons which will be discussed in this
literature later, but classification using Machine Learning techniques is still an open field to explore and has provided very promising results up till now
Using Machine Learning to Forecast Future Earnings
In this essay, we have comprehensively evaluated the feasibility and
suitability of adopting the Machine Learning Models on the forecast of
corporation fundamentals (i.e. the earnings), where the prediction results of
our method have been thoroughly compared with both analysts' consensus
estimation and traditional statistical models. As a result, our model has
already been proved to be capable of serving as a favorable auxiliary tool for
analysts to conduct better predictions on company fundamentals. Compared with
previous traditional statistical models being widely adopted in the industry
like Logistic Regression, our method has already achieved satisfactory
advancement on both the prediction accuracy and speed. Meanwhile, we are also
confident enough that there are still vast potentialities for this model to
evolve, where we do hope that in the near future, the machine learning model
could generate even better performances compared with professional analysts
A review of associative classification mining
Associative classification mining is a promising approach in data mining that utilizes the
association rule discovery techniques to construct classification systems, also known as
associative classifiers. In the last few years, a number of associative classification algorithms
have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms
employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule
evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative
classification techniques with regards to the above criteria. Finally, future directions in associative
classification, such as incremental learning and mining low-quality data sets, are also
highlighted in this paper
Stochastic Attribute-Value Grammars
Probabilistic analogues of regular and context-free grammars are well-known
in computational linguistics, and currently the subject of intensive research.
To date, however, no satisfactory probabilistic analogue of attribute-value
grammars has been proposed: previous attempts have failed to define a correct
parameter-estimation algorithm.
In the present paper, I define stochastic attribute-value grammars and give a
correct algorithm for estimating their parameters. The estimation algorithm is
adapted from Della Pietra, Della Pietra, and Lafferty (1995). To estimate model
parameters, it is necessary to compute the expectations of certain functions
under random fields. In the application discussed by Della Pietra, Della
Pietra, and Lafferty (representing English orthographic constraints), Gibbs
sampling can be used to estimate the needed expectations. The fact that
attribute-value grammars generate constrained languages makes Gibbs sampling
inapplicable, but I show how a variant of Gibbs sampling, the
Metropolis-Hastings algorithm, can be used instead.Comment: 23 pages, 21 Postscript figures, uses rotate.st
An Approach to Find Missing Values in Medical Datasets
Mining medical datasets is a challenging problem before data mining
researchers as these datasets have several hidden challenges compared to
conventional datasets.Starting from the collection of samples through field
experiments and clinical trials to performing classification,there are numerous
challenges at every stage in the mining process. The preprocessing phase in the
mining process itself is a challenging issue when, we work on medical datasets.
One of the prime challenges in mining medical datasets is handling missing
values which is part of preprocessing phase. In this paper, we address the
issue of handling missing values in medical dataset consisting of categorical
attribute values. The main contribution of this research is to use the proposed
imputation measure to estimate and fix the missing values. We discuss a case
study to demonstrate the working of proposed measure.Comment: 7 pages,ACM Digital Library, ICEMIS September 201
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
This paper documents the release of the ELKI data mining framework, version
0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The
focus of ELKI is research in algorithms, with an emphasis on unsupervised
methods in cluster analysis and outlier detection. In order to achieve high
performance and scalability, ELKI offers data index structures such as the
R*-tree that can provide major performance gains. ELKI is designed to be easy
to extend for researchers and students in this domain, and welcomes
contributions of additional methods. ELKI aims at providing a large collection
of highly parameterizable algorithms, in order to allow easy and fair
evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the
future, and then give a brief overview over the new functionality in this
version. We also include an appendix presenting an overview on the overall
implemented functionality
Modeling Social Networks with Node Attributes using the Multiplicative Attribute Graph Model
Networks arising from social, technological and natural domains exhibit rich
connectivity patterns and nodes in such networks are often labeled with
attributes or features. We address the question of modeling the structure of
networks where nodes have attribute information. We present a Multiplicative
Attribute Graph (MAG) model that considers nodes with categorical attributes
and models the probability of an edge as the product of individual attribute
link formation affinities. We develop a scalable variational expectation
maximization parameter estimation method. Experiments show that MAG model
reliably captures network connectivity as well as provides insights into how
different attributes shape the network structure.Comment: 15 pages, 7 figures, 7 table
Stale View Cleaning: Getting Fresh Answers from Stale Materialized Views
Materialized views (MVs), stored pre-computed results, are widely used to
facilitate fast queries on large datasets. When new records arrive at a high
rate, it is infeasible to continuously update (maintain) MVs and a common
solution is to defer maintenance by batching updates together. Between batches
the MVs become increasingly stale with incorrect, missing, and superfluous rows
leading to increasingly inaccurate query results. We propose Stale View
Cleaning (SVC) which addresses this problem from a data cleaning perspective.
In SVC, we efficiently clean a sample of rows from a stale MV, and use the
clean sample to estimate aggregate query results. While approximate, the
estimated query results reflect the most recent data. As sampling can be
sensitive to long-tailed distributions, we further explore an outlier indexing
technique to give increased accuracy when the data distributions are skewed.
SVC complements existing deferred maintenance approaches by giving accurate and
bounded query answers between maintenance. We evaluate our method on a
generated dataset from the TPC-D benchmark and a real video distribution
application. Experiments confirm our theoretical results: (1) cleaning an MV
sample is more efficient than full view maintenance, (2) the estimated results
are more accurate than using the stale MV, and (3) SVC is applicable for a wide
variety of MVs
Cloud Service Provider Evaluation System using Fuzzy Rough Set Technique
Cloud Service Providers (CSPs) offer a wide variety of scalable, flexible,
and cost-efficient services to cloud users on demand and pay-per-utilization
basis. However, vast diversity in available cloud service providers leads to
numerous challenges for users to determine and select the best suitable
service. Also, sometimes users need to hire the required services from multiple
CSPs which introduce difficulties in managing interfaces, accounts, security,
supports, and Service Level Agreements (SLAs). To circumvent such problems
having a Cloud Service Broker (CSB) be aware of service offerings and users
Quality of Service (QoS) requirements will benefit both the CSPs as well as
users. In this work, we proposed a Fuzzy Rough Set based Cloud Service
Brokerage Architecture, which is responsible for ranking and selecting services
based on users QoS requirements, and finally monitor the service execution. We
have used the fuzzy rough set technique for dimension reduction. Used weighted
Euclidean distance to rank the CSPs. To prioritize user QoS request, we
intended to use user assign weights, also incorporated system assigned weights
to give the relative importance to QoS attributes. We compared the proposed
ranking technique with an existing method based on the system response time.
The case study experiment results show that the proposed approach is scalable,
resilience, and produce better results with less searching time.Comment: 12 pages, 7 figures, and 8 table
Stable Multiple Time Step Simulation/Prediction from Lagged Dynamic Network Regression Models
Recent developments in computers and automated data collection strategies
have greatly increased the interest in statistical modeling of dynamic
networks. Many of the statistical models employed for inference on large-scale
dynamic networks suffer from limited forward simulation/prediction ability. A
major problem with many of the forward simulation procedures is the tendency
for the model to become degenerate in only a few time steps, i.e., the
simulation/prediction procedure results in either null graphs or complete
graphs. Here, we describe an algorithm for simulating a sequence of networks
generated from lagged dynamic network regression models DNR(V), a sub-family of
TERGMs. We introduce a smoothed estimator for forward prediction based on
smoothing of the change statistics obtained for a dynamic network regression
model. We focus on the implementation of the algorithm, providing a series of
motivating examples with comparisons to dynamic network models from the
literature. We find that our algorithm significantly improves multi-step
prediction/simulation over standard DNR(V) forecasting. Furthermore, we show
that our method performs comparably to existing more complex dynamic network
analysis frameworks (SAOM and STERGMs) for small networks over short time
periods, and significantly outperforms these approaches over long time time
intervals and/or large networks
- …