1,131 research outputs found
The Inverse Bagging Algorithm: Anomaly Detection by Inverse Bootstrap Aggregating
For data sets populated by a very well modeled process and by another process
of unknown probability density function (PDF), a desired feature when
manipulating the fraction of the unknown process (either for enhancing it or
suppressing it) consists in avoiding to modify the kinematic distributions of
the well modeled one. A bootstrap technique is used to identify sub-samples
rich in the well modeled process, and classify each event according to the
frequency of it being part of such sub-samples. Comparisons with general MVA
algorithms will be shown, as well as a study of the asymptotic properties of
the method, making use of a public domain data set that models a typical search
for new physics as performed at hadronic colliders such as the Large Hadron
Collider (LHC).Comment: 8 pages, 5 figures. Proceedings of the XIIth Quark Confinement and
Hadron Spectrum conference, 28/8-2/9 2016, Thessaloniki, Greec
Localized Regression
The main problem with localized discriminant techniques is the curse of dimensionality, which seems to restrict their use to the case of few variables. This restriction does not hold if localization is combined with a reduction of dimension. In particular it is shown that localization yields powerful classifiers even in higher dimensions if localization is combined with locally adaptive selection of predictors. A robust localized logistic regression (LLR) method is developed for which all tuning parameters are chosen dataÂĄadaptively. In an extended simulation study we evaluate the potential of the proposed procedure for various types of data and compare it to other classification procedures. In addition we demonstrate that automatic choice of localization, predictor selection and penalty parameters based on cross validation is working well. Finally the method is applied to real data sets and its real world performance is compared to alternative procedures
Ensembles of Randomized Time Series Shapelets Provide Improved Accuracy while Reducing Computational Costs
Shapelets are discriminative time series subsequences that allow generation
of interpretable classification models, which provide faster and generally
better classification than the nearest neighbor approach. However, the shapelet
discovery process requires the evaluation of all possible subsequences of all
time series in the training set, making it extremely computation intensive.
Consequently, shapelet discovery for large time series datasets quickly becomes
intractable. A number of improvements have been proposed to reduce the training
time. These techniques use approximation or discretization and often lead to
reduced classification accuracy compared to the exact method.
We are proposing the use of ensembles of shapelet-based classifiers obtained
using random sampling of the shapelet candidates. Using random sampling reduces
the number of evaluated candidates and consequently the required computational
cost, while the classification accuracy of the resulting models is also not
significantly different than that of the exact algorithm. The combination of
randomized classifiers rectifies the inaccuracies of individual models because
of the diversity of the solutions. Based on the experiments performed, it is
shown that the proposed approach of using an ensemble of inexpensive
classifiers provides better classification accuracy compared to the exact
method at a significantly lesser computational cost
An Optimal k Nearest Neighbours Ensemble for Classification Based on Extended Neighbourhood Rule with Features subspace
To minimize the effect of outliers, kNN ensembles identify a set of closest
observations to a new sample point to estimate its unknown class by using
majority voting in the labels of the training instances in the neighbourhood.
Ordinary kNN based procedures determine k closest training observations in the
neighbourhood region (enclosed by a sphere) by using a distance formula. The k
nearest neighbours procedure may not work in a situation where sample points in
the test data follow the pattern of the nearest observations that lie on a
certain path not contained in the given sphere of nearest neighbours.
Furthermore, these methods combine hundreds of base kNN learners and many of
them might have high classification errors thereby resulting in poor ensembles.
To overcome these problems, an optimal extended neighbourhood rule based
ensemble is proposed where the neighbours are determined in k steps. It starts
from the first nearest sample point to the unseen observation. The second
nearest data point is identified that is closest to the previously selected
data point. This process is continued until the required number of the k
observations are obtained. Each base model in the ensemble is constructed on a
bootstrap sample in conjunction with a random subset of features. After
building a sufficiently large number of base models, the optimal models are
then selected based on their performance on out-of-bag (OOB) data.Comment: 12 page
Multi-sensor data fusion and modelling in mobile devices for enhanced user experience
Multi-sensor data fusion and modelling in mobile devices for enhanced user experienc
COVID-19: Symptoms Clustering and Severity Classification Using Machine Learning Approach
COVID-19 is an extremely contagious illness that causes illnesses varying from either the common cold to more chronic illnesses or even death. The constant mutation of a new variant of COVID-19 makes it important to identify the symptom of COVID-19 in order to contain the infection. The use of clustering and classification in machine learning is in mainstream use in different aspects of research, especially in recent years to generate useful knowledge on COVID-19 outbreak. Many researchers have shared their COVID-19 data on public database and a lot of studies have been carried out. However, the merit of the dataset is unknown and analysis need to be carried by the researchers to check on its reliability. The dataset that is used in this work was sourced from the Kaggle website. The data was obtained through a survey collected from participants of various gender and age who had been to at least ten countries. There are four levels of severity based on the COVID-19 symptom, which was developed in accordance to World Health Organization (WHO) and the Indian Ministry of Health and Family Welfare recommendations. Â This paper presented an inquiry on the dataset utilising supervised and unsupervised machine learning approaches in order to better comprehend the dataset. In this study, the analysis of the severity group based on the COVID-19 symptoms using supervised learning techniques employed a total of seven classifiers, namely the K-NN, Linear SVM, Naive Bayes, Decision Tree (J48), Ada Boost, Bagging, and Stacking. For the unsupervised learning techniques, the clustering algorithm utilized in this work are Simple K-Means and Expectation-Maximization. From the result obtained from both supervised and unsupervised learning techniques, we observed that the result analysis yielded relatively poor classification and clustering results. The findings for the dataset analysed in this study do not appear to be providing the correct result for the symptoms categorized against the severity level which raises concerns about the validity and reliability of the dataset
COVID-19: Symptoms Clustering and Severity Classification Using Machine Learning Approach
COVID-19 is an extremely contagious illness that causes illnesses varying from either the common cold to more chronic illnesses or even death. The constant mutation of a new variant of COVID-19 makes it important to identify the symptom of COVID-19 in order to contain the infection. The use of clustering and classification in machine learning is in mainstream use in different aspects of research, especially in recent years to generate useful knowledge on COVID-19 outbreak. Many researchers have shared their COVID-19 data on public database and a lot of studies have been carried out. However, the merit of the dataset is unknown and analysis need to be carried by the researchers to check on its reliability. The dataset that is used in this work was sourced from the Kaggle website. The data was obtained through a survey collected from participants of various gender and age who had been to at least ten countries. There are four levels of severity based on the COVID-19 symptom, which was developed in accordance to World Health Organization (WHO) and the Indian Ministry of Health and Family Welfare recommendations. Â This paper presented an inquiry on the dataset utilising supervised and unsupervised machine learning approaches in order to better comprehend the dataset. In this study, the analysis of the severity group based on the COVID-19 symptoms using supervised learning techniques employed a total of seven classifiers, namely the K-NN, Linear SVM, Naive Bayes, Decision Tree (J48), Ada Boost, Bagging, and Stacking. For the unsupervised learning techniques, the clustering algorithm utilized in this work are Simple K-Means and Expectation-Maximization. From the result obtained from both supervised and unsupervised learning techniques, we observed that the result analysis yielded relatively poor classification and clustering results. The findings for the dataset analysed in this study do not appear to be providing the correct result for the symptoms categorized against the severity level which raises concerns about the validity and reliability of the dataset
Pulsar Star Detection: A Comparative Analysis of Classification Algorithms using SMOTE
A Pulsar is a highly magnetized rotating compact star whose magnetic poles emit beams of radiation. The application of pulsar stars has a great application in the field of astronomical study. Applications like the existence of gravitational radiation can be indirectly confirmed from the observation of pulsars in a binary neutron star system. Therefore, the identification of pulsars is necessary for the study of gravitational waves and general relativity. Detection of pulsars in the universe can help research in the field of astrophysics. At present, there are millions of pulsar candidates present to be searched. Machine learning techniques can help detect pulsars from such a large number of candidates. The paper discusses nine common classification algorithms for the prediction of pulsar stars and then compares their performances using various classification metrics such as classification accuracy, precision and recall value, ROC score and f-score on both balanced and unbalanced data. SMOTE-technique is used to balance the data for better results. Among the nine algorithms, XGBoosting algorithm achieved the best results. The paper is concluded with prospects of machine learning for pulsar detection in the field of astronomy
- âŠ