16,533 research outputs found
k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)
Perhaps the most straightforward classifier in the arsenal or machine
learning techniques is the Nearest Neighbour Classifier -- classification is
achieved by identifying the nearest neighbours to a query example and using
those neighbours to determine the class of the query. This approach to
classification is of particular importance because issues of poor run-time
performance is not such a problem these days with the computational power that
is available. This paper presents an overview of techniques for Nearest
Neighbour classification focusing on; mechanisms for assessing similarity
(distance), computational issues in identifying nearest neighbours and
mechanisms for reducing the dimension of the data.
This paper is the second edition of a paper previously published as a
technical report. Sections on similarity measures for time-series, retrieval
speed-up and intrinsic dimensionality have been added. An Appendix is included
providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN
Development of signal processing algorithms for ultrasonic detection of coal seam interfaces
A pattern recognition system is presented for determining the thickness of coal remaining on the roof and floor of a coal seam. The system was developed to recognize reflected pulse echo signals that are generated by an acoustical transducer and reflected from the coal seam interface. The flexibility of the system, however, should enable it to identify pulse-echo signals generated by radar or other techniques. The main difference being the specific features extracted from the recorded data as a basis for pattern recognition
Classification hardness for supervised learners on 20 years of intrusion detection data
This article consolidates analysis of established (NSL-KDD) and new intrusion detection datasets (ISCXIDS2012, CICIDS2017, CICIDS2018) through the use of supervised machine learning (ML) algorithms. The uniformity in analysis procedure opens up the option to compare the obtained results. It also provides a stronger foundation for the conclusions about the efficacy of supervised learners on the main classification task in network security. This research is motivated in part to address the lack of adoption of these modern datasets. Starting with a broad scope that includes classification by algorithms from different families on both established and new datasets has been done to expand the existing foundation and reveal the most opportune avenues for further inquiry. After obtaining baseline results, the classification task was increased in difficulty, by reducing the available data to learn from, both horizontally and vertically. The data reduction has been included as a stress-test to verify if the very high baseline results hold up under increasingly harsh constraints. Ultimately, this work contains the most comprehensive set of results on the topic of intrusion detection through supervised machine learning. Researchers working on algorithmic improvements can compare their results to this collection, knowing that all results reported here were gathered through a uniform framework. This work's main contributions are the outstanding classification results on the current state of the art datasets for intrusion detection and the conclusion that these methods show remarkable resilience in classification performance even when aggressively reducing the amount of data to learn from
Online and Offline Character Recognition Using Alignment to Prototypes
Nearest neighbor classifiers are simple to implement, yet they can model complex non-parametric distributions, and provide state-of-the-art recognition accuracy in OCR databases. At the same time, they may be too slow for practical character recognition, especially when they rely on similarity measures that require computationally expensive pairwise alignments between characters. This paper proposes an efficient method for computing an approximate similarity score between two characters based on their exact alignment to a small number of prototypes. The proposed method is applied to both online and offline character recognition, where similarity is based on widely used and computationally expensive alignment methods, i.e., Dynamic Time Warping and the Hungarian method respectively. In both cases significant recognition speedup is obtained at the expense of only a minor increase in recognition error.Office of Naval Research (N00014-03-1-0108); National Science Foundation (IIS-0308213, EIA-0202067
An Easy to Use Repository for Comparing and Improving Machine Learning Algorithm Usage
The results from most machine learning experiments are used for a specific
purpose and then discarded. This results in a significant loss of information
and requires rerunning experiments to compare learning algorithms. This also
requires implementation of another algorithm for comparison, that may not
always be correctly implemented. By storing the results from previous
experiments, machine learning algorithms can be compared easily and the
knowledge gained from them can be used to improve their performance. The
purpose of this work is to provide easy access to previous experimental results
for learning and comparison. These stored results are comprehensive -- storing
the prediction for each test instance as well as the learning algorithm,
hyperparameters, and training set that were used. Previous results are
particularly important for meta-learning, which, in a broad sense, is the
process of learning from previous machine learning results such that the
learning process is improved. While other experiment databases do exist, one of
our focuses is on easy access to the data. We provide meta-learning data sets
that are ready to be downloaded for meta-learning experiments. In addition,
queries to the underlying database can be made if specific information is
desired. We also differ from previous experiment databases in that our
databases is designed at the instance level, where an instance is an example in
a data set. We store the predictions of a learning algorithm trained on a
specific training set for each instance in the test set. Data set level
information can then be obtained by aggregating the results from the instances.
The instance level information can be used for many tasks such as determining
the diversity of a classifier or algorithmically determining the optimal subset
of training instances for a learning algorithm.Comment: 7 pages, 1 figure, 6 table
- …