54,981 research outputs found
k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)
Perhaps the most straightforward classifier in the arsenal or machine
learning techniques is the Nearest Neighbour Classifier -- classification is
achieved by identifying the nearest neighbours to a query example and using
those neighbours to determine the class of the query. This approach to
classification is of particular importance because issues of poor run-time
performance is not such a problem these days with the computational power that
is available. This paper presents an overview of techniques for Nearest
Neighbour classification focusing on; mechanisms for assessing similarity
(distance), computational issues in identifying nearest neighbours and
mechanisms for reducing the dimension of the data.
This paper is the second edition of a paper previously published as a
technical report. Sections on similarity measures for time-series, retrieval
speed-up and intrinsic dimensionality have been added. An Appendix is included
providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN
An Intelligent System For Arabic Text Categorization
Text Categorization (classification) is the process of classifying documents into a predefined set of categories based on their content. In this paper, an intelligent Arabic text categorization system is presented. Machine learning algorithms are used in this system. Many algorithms for stemming and feature selection are tried. Moreover, the document is represented using several term weighting schemes and finally the k-nearest neighbor and Rocchio classifiers are used for classification process. Experiments are performed over self collected data corpus and the results show that the suggested hybrid method of statistical and light stemmers is the most suitable stemming algorithm for Arabic language. The results also show that a hybrid approach of document frequency and information gain is the preferable feature selection criterion and normalized-tfidf is the best weighting scheme. Finally, Rocchio classifier has the advantage over k-nearest neighbor classifier in the classification process. The experimental results illustrate that the proposed model is an efficient method and gives generalization accuracy of about 98%
An agent-driven semantical identifier using radial basis neural networks and reinforcement learning
Due to the huge availability of documents in digital form, and the deception
possibility raise bound to the essence of digital documents and the way they
are spread, the authorship attribution problem has constantly increased its
relevance. Nowadays, authorship attribution,for both information retrieval and
analysis, has gained great importance in the context of security, trust and
copyright preservation. This work proposes an innovative multi-agent driven
machine learning technique that has been developed for authorship attribution.
By means of a preprocessing for word-grouping and time-period related analysis
of the common lexicon, we determine a bias reference level for the recurrence
frequency of the words within analysed texts, and then train a Radial Basis
Neural Networks (RBPNN)-based classifier to identify the correct author. The
main advantage of the proposed approach lies in the generality of the semantic
analysis, which can be applied to different contexts and lexical domains,
without requiring any modification. Moreover, the proposed system is able to
incorporate an external input, meant to tune the classifier, and then
self-adjust by means of continuous learning reinforcement.Comment: Published on: Proceedings of the XV Workshop "Dagli Oggetti agli
Agenti" (WOA 2014), Catania, Italy, Sepember. 25-26, 201
A Novel Feature Selection and Extraction Technique for Classification
This paper presents a versatile technique for the purpose of feature
selection and extraction - Class Dependent Features (CDFs). We use CDFs to
improve the accuracy of classification and at the same time control
computational expense by tackling the curse of dimensionality. In order to
demonstrate the generality of this technique, it is applied to handwritten
digit recognition and text categorization.Comment: 2 pages, 2 tables, published at IEEE SMC 201
Similarity Learning for High-Dimensional Sparse Data
A good measure of similarity between data points is crucial to many tasks in
machine learning. Similarity and metric learning methods learn such measures
automatically from data, but they do not scale well respect to the
dimensionality of the data. In this paper, we propose a method that can learn
efficiently similarity measure from high-dimensional sparse data. The core idea
is to parameterize the similarity measure as a convex combination of rank-one
matrices with specific sparsity structures. The parameters are then optimized
with an approximate Frank-Wolfe procedure to maximally satisfy relative
similarity constraints on the training data. Our algorithm greedily
incorporates one pair of features at a time into the similarity measure,
providing an efficient way to control the number of active features and thus
reduce overfitting. It enjoys very appealing convergence guarantees and its
time and memory complexity depends on the sparsity of the data instead of the
dimension of the feature space. Our experiments on real-world high-dimensional
datasets demonstrate its potential for classification, dimensionality reduction
and data exploration.Comment: 14 pages. Proceedings of the 18th International Conference on
Artificial Intelligence and Statistics (AISTATS 2015). Matlab code:
https://github.com/bellet/HDS
Cross validation of bi-modal health-related stress assessment
This study explores the feasibility of objective and ubiquitous stress assessment. 25 post-traumatic stress disorder patients participated in a controlled storytelling (ST) study and an ecologically valid reliving (RL) study. The two studies were meant to represent an early and a late therapy session, and each consisted of a "happy" and a "stress triggering" part. Two instruments were chosen to assess the stress level of the patients at various point in time during therapy: (i) speech, used as an objective and ubiquitous stress indicator and (ii) the subjective unit of distress (SUD), a clinically validated Likert scale. In total, 13 statistical parameters were derived from each of five speech features: amplitude, zero-crossings, power, high-frequency power, and pitch. To model the emotional state of the patients, 28 parameters were selected from this set by means of a linear regression model and, subsequently, compressed into 11 principal components. The SUD and speech model were cross-validated, using 3 machine learning algorithms. Between 90% (2 SUD levels) and 39% (10 SUD levels) correct classification was achieved. The two sessions could be discriminated in 89% (for ST) and 77% (for RL) of the cases. This report fills a gap between laboratory and clinical studies, and its results emphasize the usefulness of Computer Aided Diagnostics (CAD) for mental health care
Recommended from our members
Mining learning preferences in web-based instruction: Holists vs. Serialists
Web-based instruction programs are used by learners with diverse knowledge, skills and needs. These differences determine their preferences for the design of Web-based instruction programs and ultimately influence learners' success in using them. Cognitive style has been found to significantly affect learners' preferences of web-based instruction programs. However, the majority of previous studies focus on Field Dependence/Independence. Pask's Holist/Serialist dimension has conceptual links with Field Dependence/Independence but it is left mostly unstudied. Therefore, this study focuses on identifying how this dimension of cognitive style affects learner preferences of Web-based instruction programs. A data mining approach is used to illustrate the difference in preferences between Holists and Serialists. The findings show that there are clear differences in regard to content presentation and navigation support. A set of design features were then produced to help designers incorporate cognitive styles into the development of Web-based instruction programs to ensure that they can accommodate learners' different preferences.This work is partially funded by National Science Council, Taiwan, ROC (NSC 98-2511-S-008-012- MY3; NSC 99-
2511-S-008 -003 -MY2; NSC 99-2631-S-008-001)
Neural network setups for a precise detection of the many-body localization transition: finite-size scaling and limitations
Determining phase diagrams and phase transitions semi-automatically using
machine learning has received a lot of attention recently, with results in good
agreement with more conventional approaches in most cases. When it comes to
more quantitative predictions, such as the identification of universality class
or precise determination of critical points, the task is more challenging. As
an exacting test-bed, we study the Heisenberg spin-1/2 chain in a random
external field that is known to display a transition from a many-body localized
to a thermalizing regime, which nature is not entirely characterized. We
introduce different neural network structures and dataset setups to achieve a
finite-size scaling analysis with the least possible physical bias (no assumed
knowledge on the phase transition and directly inputing wave-function
coefficients), using state-of-the-art input data simulating chains of sizes up
to L=24. In particular, we use domain adversarial techniques to ensure that the
network learns scale-invariant features. We find a variability of the output
results with respect to network and training parameters, resulting in
relatively large uncertainties on final estimates of critical point and
correlation length exponent which tend to be larger than the values obtained
from conventional approaches. We put the emphasis on interpretability
throughout the paper and discuss what the network appears to learn for the
various used architectures. Our findings show that a it quantitative analysis
of phase transitions of unknown nature remains a difficult task with neural
networks when using the minimally engineered physical input.Comment: v2: published versio
- âŠ