7,821 research outputs found
Optimal Prefix Codes with Fewer Distinct Codeword Lengths are Faster to Construct
A new method for constructing minimum-redundancy binary prefix codes is
described. Our method does not explicitly build a Huffman tree; instead it uses
a property of optimal prefix codes to compute the codeword lengths
corresponding to the input weights. Let be the number of weights and be
the number of distinct codeword lengths as produced by the algorithm for the
optimum codes. The running time of our algorithm is . Following
our previous work in \cite{be}, no algorithm can possibly construct optimal
prefix codes in time. When the given weights are presorted our
algorithm performs comparisons.Comment: 23 pages, a preliminary version appeared in STACS 200
X-TREPAN: a multi class regression and adapted extraction of comprehensible decision tree in artificial neural networks
In this work, the TREPAN algorithm is enhanced and extended for extracting
decision trees from neural networks. We empirically evaluated the performance
of the algorithm on a set of databases from real world events. This benchmark
enhancement was achieved by adapting Single-test TREPAN and C4.5 decision tree
induction algorithms to analyze the datasets. The models are then compared with
X-TREPAN for comprehensibility and classification accuracy. Furthermore, we
validate the experimentations by applying statistical methods. Finally, the
modified algorithm is extended to work with multi-class regression problems and
the ability to comprehend generalized feed forward networks is achieved.Comment: 17 Pages, 8 Tables, 8 Figures, 6 Equation
Approximate k-nearest neighbour based spatial clustering using k-d tree
Different spatial objects that vary in their characteristics, such as
molecular biology and geography, are presented in spatial areas. Methods to
organize, manage, and maintain those objects in a structured manner are
required. Data mining raised different techniques to overcome these
requirements. There are many major tasks of data mining, but the mostly used
task is clustering. Data set within the same cluster share common features that
give each cluster its characteristics. In this paper, an implementation of
Approximate kNN-based spatial clustering algorithm using the K-d tree is
proposed. The major contribution achieved by this research is the use of the
k-d tree data structure for spatial clustering, and comparing its performance
to the brute-force approach. The results of the work performed in this paper
revealed better performance using the k-d tree, compared to the traditional
brute-force approach
Tree models for difference and change detection in a complex environment
A new family of tree models is proposed, which we call "differential trees."
A differential tree model is constructed from multiple data sets and aims to
detect distributional differences between them. The new methodology differs
from the existing difference and change detection techniques in its
nonparametric nature, model construction from multiple data sets, and
applicability to high-dimensional data. Through a detailed study of an arson
case in New Zealand, where an individual is known to have been laying
vegetation fires within a certain time period, we illustrate how these models
can help detect changes in the frequencies of event occurrences and uncover
unusual clusters of events in a complex environment.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS548 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
On Relationship between Primal-Dual Method of Multipliers and Kalman Filter
Recently the primal-dual method of multipliers (PDMM), a novel distributed
optimization method, was proposed for solving a general class of decomposable
convex optimizations over graphic models. In this work, we first study the
convergence properties of PDMM for decomposable quadratic optimizations over
tree-structured graphs. We show that with proper parameter selection, PDMM
converges to its optimal solution in finite number of iterations. We then apply
PDMM for the causal estimation problem over a statistical linear state-space
model. We show that PDMM and the Kalman filter have the same update
expressions, where PDMM can be interpreted as solving a sequence of quadratic
optimizations over a growing chain graph.Comment: 11 page
Mining Education Data to Predict Student's Retention: A comparative Study
The main objective of higher education is to provide quality education to
students. One way to achieve highest level of quality in higher education
system is by discovering knowledge for prediction regarding enrolment of
students in a course. This paper presents a data mining project to generate
predictive models for student retention management. Given new records of
incoming students, these predictive models can produce short accurate
prediction lists identifying students who tend to need the support from the
student retention program most. This paper examines the quality of the
predictive models generated by the machine learning algorithms. The results
show that some of the machines learning algorithms are able to establish
effective predictive models from the existing student retention data.Comment: 5 pages. arXiv admin note: substantial text overlap with
arXiv:1202.481
Forest Floor Visualizations of Random Forests
We propose a novel methodology, forest floor, to visualize and interpret
random forest (RF) models. RF is a popular and useful tool for non-linear
multi-variate classification and regression, which yields a good trade-off
between robustness (low variance) and adaptiveness (low bias). Direct
interpretation of a RF model is difficult, as the explicit ensemble model of
hundreds of deep trees is complex. Nonetheless, it is possible to visualize a
RF model fit by its mapping from feature space to prediction space. Hereby the
user is first presented with the overall geometrical shape of the model
structure, and when needed one can zoom in on local details. Dimensional
reduction by projection is used to visualize high dimensional shapes. The
traditional method to visualize RF model structure, partial dependence plots,
achieve this by averaging multiple parallel projections. We suggest to first
use feature contributions, a method to decompose trees by splitting features,
and then subsequently perform projections. The advantages of forest floor over
partial dependence plots is that interactions are not masked by averaging. As a
consequence, it is possible to locate interactions, which are not visualized in
a given projection. Furthermore, we introduce: a goodness-of-visualization
measure, use of colour gradients to identify interactions and an out-of-bag
cross validated variant of feature contributions.Comment: 25 pages, 12 figures, supplementary materials. v2->v3: minor
proofing, moderated comments on ICE-plots, replaced \psi-operator with the
subset named H in equation 13 and 14 to improve simplicit
Data Mining: A Prediction for Performance Improvement of Engineering Students using Classification
Now-a-days the amount of data stored in educational database increasing
rapidly. These databases contain hidden information for improvement of
students' performance. Educational data mining is used to study the data
available in the educational field and bring out the hidden knowledge from it.
Classification methods like decision trees, Bayesian network etc can be applied
on the educational data for predicting the student's performance in
examination. This prediction will help to identify the weak students and help
them to score better marks. The C4.5, ID3 and CART decision tree algorithms are
applied on engineering student's data to predict their performance in the final
exam. The outcome of the decision tree predicted the number of students who are
likely to pass, fail or promoted to next year. The results provide steps to
improve the performance of the students who were predicted to fail or promoted.
After the declaration of the results in the final examination the marks
obtained by the students are fed into the system and the results were analyzed
for the next session. The comparative analysis of the results states that the
prediction has helped the weaker students to improve and brought out betterment
in the result.Comment: 6 pages, 3 Figures. arXiv admin note: substantial text overlap with
arXiv:1202.481
Transfer Learning, Soft Distance-Based Bias, and the Hierarchical BOA
An automated technique has recently been proposed to transfer learning in the
hierarchical Bayesian optimization algorithm (hBOA) based on distance-based
statistics. The technique enables practitioners to improve hBOA efficiency by
collecting statistics from probabilistic models obtained in previous hBOA runs
and using the obtained statistics to bias future hBOA runs on similar problems.
The purpose of this paper is threefold: (1) test the technique on several
classes of NP-complete problems, including MAXSAT, spin glasses and minimum
vertex cover; (2) demonstrate that the technique is effective even when
previous runs were done on problems of different size; (3) provide empirical
evidence that combining transfer learning with other efficiency enhancement
techniques can often yield nearly multiplicative speedups.Comment: Accepted at Parallel Problem Solving from Nature (PPSN XII), 10
pages. arXiv admin note: substantial text overlap with arXiv:1201.224
Steering plasmodium with light: Dynamical programming of Physarum machine
A plasmodium of Physarum polycephalum is a very large cell visible by unaided
eye. The plasmodium is capable for distributed sensing, parallel information
processing, and decentralized optimization. It is an ideal substrate for future
and emerging bio-computing devices. We study space-time dynamics of plasmodium
reactiom to localised illumination, and provide analogies between propagating
plasmodium and travelling wave-fragments in excitable media. We show how
plasmodium-based computing devices can be precisely controlled and shaped by
planar domains of illumination.Comment: Accepted for publication in New Mathematics and Natural Computation
Journal (April, 2009
- β¦