5 research outputs found
Instance selection of linear complexity for big data
Over recent decades, database sizes have grown considerably. Larger sizes present new challenges, because machine learning algorithms are not prepared to process such large volumes of information. Instance selection methods can alleviate this problem when the size of the data set is medium to large. However, even these methods face similar problems with very large-to-massive data sets.
In this paper, two new algorithms with linear complexity for instance selection purposes are presented. Both algorithms use locality-sensitive hashing to find similarities between instances. While the complexity of conventional methods (usually quadratic, O(n2), or log-linear, O(nlogn)) means that they are unable to process large-sized data sets, the new proposal shows competitive results in terms of accuracy. Even more remarkably, it shortens execution time, as the proposal manages to reduce complexity and make it linear with respect to the data set size. The new proposal has been compared with some of the best known instance selection methods for testing and has also been evaluated on large data sets (up to a million instances).Supported by the Research Projects TIN 2011-24046 and TIN 2015-67534-P from the Spanish Ministry of Economy and Competitiveness
Probabilistic Value Selection for Space Efficient Model
An alternative to current mainstream preprocessing methods is proposed: Value
Selection (VS). Unlike the existing methods such as feature selection that
removes features and instance selection that eliminates instances, value
selection eliminates the values (with respect to each feature) in the dataset
with two purposes: reducing the model size and preserving its accuracy. Two
probabilistic methods based on information theory's metric are proposed: PVS
and P + VS. Extensive experiments on the benchmark datasets with various sizes
are elaborated. Those results are compared with the existing preprocessing
methods such as feature selection, feature transformation, and instance
selection methods. Experiment results show that value selection can achieve the
balance between accuracy and model size reduction.Comment: Accepted in the 21st IEEE International Conference on Mobile Data
Management (July 2020
Improving the classification performance on imbalanced data sets via new hybrid parameterisation model
The aim of this work is to analyse the performance of the new proposed hybrid parameterisation model in handling problematic data. Three types of problematic data will be highlighted in this paper: i) big data set, ii) uncertain and inconsistent data set and iii) imbalanced data set. The proposed hybrid model is an integration of three main phases which consist of the data decomposition, parameter reduction and parameter selection phases. Three main methods, which are soft set and rough set theories, were implemented to reduce and to select the optimised parameter set, while a neural network was used to classify the optimised data set. This proposed model can process a data set that might contain uncertain, inconsistent and imbalanced data. Therefore, one additional phase, data decomposition, was introduced and executed after the pre-processing task was completed in order to manage the big data issue. Imbalanced data sets were used to evaluate the capability of the proposed hybrid model in handling problematic data. The experimental results demonstrate that the proposed hybrid model has the potential to be implemented with any type of data set in a classification task, especially with complex data sets
Multistage feature selection methods for data classification
In data analysis process, a good decision can be made with the assistance of several sub-processes and methods. The most common processes are feature selection and classification processes. Various methods and processes have been proposed to solve many issues such as low classification accuracy, and long processing time faced by the decision-makers. The analysis process becomes more complicated especially when dealing with complex datasets that consist of large and problematic datasets. One of the solutions that can be used is by employing an effective feature selection method to reduce the data processing time, decrease the used memory space, and increase the accuracy of decisions. However, not all the existing methods are capable of dealing with these issues. The aim of this research was to assist the classifier in giving a better performance when dealing with problematic datasets by generating optimised attribute set. The proposed method comprised two stages of feature selection processes, that employed correlation-based feature selection method using a best first search algorithm (CFS-BFS) and as well as a soft set and rough set parameter selection method (SSRS). CFS-BFS is used to eliminate uncorrelated attributes in a dataset meanwhile SSRS was utilized to manage any problematic values such as uncertainty in a dataset. Several bench-marking feature selection methods such as classifier subset evaluation (CSE) and principle component analysis (PCA) and different classifiers such as support vector machine (SVM) and neural network (NN) were used to validate the obtained results. ANOVA and T-test were also conducted to verify the obtained results. The obtained averages for two experimentalworks have proven that the proposed method equally matched the performance of other benchmarking methods in terms of assisting the classifier in achieving high classification performance for complex datasets. The obtained average for another experimental work has shown that the proposed work has outperformed the other benchmarking methods. In conclusion, the proposed method is significant to be used as an alternative feature selection method and able to assist the classifiers in achieving better accuracy in the classification process especially when dealing with problematic datasets
Estudio de métodos de selección de instancias
En la tesis se ha realizado un estudio de las técnicas de selección de instancias: analizando el estado del
arte y desarrollando nuevos métodos para cubrir algunas áreas que no habían recibido la debida
atención hasta el momento.
Los dos primeros capítulos presentan nuevos métodos de selección de instancias para regresión, un
tema poco estudiado hasta la fecha en la literatura. El tercer capítulo, estudia la posibilidad de cómo la
combinación de algoritmos de selección de instancias para regresión ofrece mejores resultados que los
métodos por sí mismos.
El último de los capítulos presenta una novedosa idea: la utilización de las funciones hash localmente
sensibles para diseñar dos nuevos algoritmos de selección de instancias para clasificación. La ventaja
que presenta esta solución, es que ambos algoritmos tienen complejidad lineal.
Los resultados de esta tesis han sido publicados en cuatro artículos en revistas JCR del primer cuartil.Ministerio de Economía, Industria y Competitividad, la Junta de Castilla y León y el Fondo Europeo
para el Desarrollo Regional, proyectos TIN 2011-24046, TIN 2015-67534-P (MINECO/FEDER)
y BU085P17 (JCyL/FEDER