166 research outputs found

    Simple but Not Simplistic: Reducing the Complexity of Machine Learning Methods

    Get PDF
    Programa Oficial de Doutoramento en Computación . 5009V01[Resumo] A chegada do Big Data e a explosión do Internet das cousas supuxeron un gran reto para os investigadores en Aprendizaxe Automática, facendo que o proceso de aprendizaxe sexa mesmo roáis complexo. No mundo real, os problemas da aprendizaxe automática xeralmente teñen complexidades inherentes, como poden ser as características intrínsecas dos datos, o gran número de mostras, a alta dimensión dos datos de entrada, os cambios na distribución entre o conxunto de adestramento e test, etc. Todos estes aspectos son importantes, e requiren novoS modelos que poi dan facer fronte a estas situacións. Nesta tese, abordáronse todos estes problemas, tratando de simplificar o proceso de aprendizaxe automática no escenario actual. En primeiro lugar, realízase unha análise de complexidade para observar como inflúe esta na tarefa de clasificación, e se é posible que a aplicación dun proceso previo de selección de características reduza esta complexidade. Logo, abórdase o proceso de simplificación da fase de aprendizaxe automática mediante a filosofía divide e vencerás, usando un enfoque distribuído. Seguidamente, aplicamos esa mesma filosofía sobre o proceso de selección de características. Finalmente, optamos por un enfoque diferente seguindo a filosofía do Edge Computing, a cal permite que os datos producidos polos dispositivos do Internet das cousas se procesen máis preto de onde se crearon. Os enfoques propostos demostraron a súa capacidade para reducir a complexidade dos métodos de aprendizaxe automática tradicionais e, polo tanto, espérase que a contribución desta tese abra as portas ao desenvolvemento de novos métodos de aprendizaxe máquina máis simples, máis robustos, e máis eficientes computacionalmente.[Resumen] La llegada del Big Data y la explosión del Internet de las cosas han supuesto un gran reto para los investigadores en Aprendizaje Automático, haciendo que el proceso de aprendizaje sea incluso más complejo. En el mundo real, los problemas de aprendizaje automático generalmente tienen complejidades inherentes) como pueden ser las características intrínsecas de los datos, el gran número de muestras, la alta dimensión de los datos de entrada, los cambios en la distribución entre el conjunto de entrenamiento y test, etc. Todos estos aspectos son importantes, y requieren nuevos modelos que puedan hacer frente a estas situaciones. En esta tesis, se han abordado todos estos problemas, tratando de simplificar el proceso de aprendizaje automático en el escenario actual. En primer lugar, se realiza un análisis de complejidad para observar cómo influye ésta en la tarea de clasificación1 y si es posible que la aplicación de un proceso previo de selección de características reduzca esta complejidad. Luego, se aborda el proceso de simplificación de la fase de aprendizaje automático mediante la filosofía divide y vencerás, usando un enfoque distribuido. A continuación, aplicamos esa misma filosofía sobre el proceso de selección de características. Finalmente, optamos por un enfoque diferente siguiendo la filosofía del Edge Computing, la cual permite que los datos producidos por los dispositivos del Internet de las cosas se procesen más cerca de donde se crearon. Los enfoques propuestos han demostrado su capacidad para reducir la complejidad de los métodos de aprendizaje automático tnidicionales y, por lo tanto, se espera que la contribución de esta tesis abra las puertas al desarrollo de nuevos métodos de aprendizaje máquina más simples, más robustos, y más eficientes computacionalmente.[Abstract] The advent of Big Data and the explosion of the Internet of Things, has brought unprecedented challenges to Machine Learning researchers, making the learning task more complexo Real-world machine learning problems usually have inherent complexities, such as the intrinsic characteristics of the data, large number of instauces, high input dimensionality, dataset shift, etc. AH these aspects matter, and can fOI new models that can confront these situations. Thus, in this thesis, we have addressed aH these issues) simplifying the machine learning process in the current scenario. First, we carry out a complexity analysis to see how it inftuences the classification models, and if it is possible that feature selection might result in a deerease of that eomplexity. Then, we address the proeess of simplifying learning with the divide-and-conquer philosophy of the distributed approaeh. Later, we aim to reduce the complexity of the feature seleetion preprocessing through the same philosophy. FinallYl we opt for a different approaeh following the eurrent philosophy Edge eomputing, whieh allows the data produeed by Internet of Things deviees to be proeessed closer to where they were ereated. The proposed approaehes have demonstrated their eapability to reduce the complexity of traditional maehine learning algorithms, and thus it is expeeted that the eontribution of this thesis will open the doors to the development of new maehine learning methods that are simpler, more robust, and more eomputationally efficient

    An efficient multivariate feature ranking method for gene selection in high-dimensional microarray data

    Get PDF
    Classification of microarray data plays a significant role in the diagnosis and prediction of cancer. However, its high-dimensionality (>tens of thousands) compared to the number of observations (<tens of hundreds) may lead to poor classification accuracy. In addition, only a fraction of genes is really important for the classification of a certain cancer, and thus feature selection is very essential in this field. Due to the time and memory burden for processing the high-dimensional data, univariate feature ranking methods are widely-used in gene selection. However, most of them are not that accurate because they only consider the relevance of features to the target without considering the redundancy among features. In this study, we propose a novel multivariate feature ranking method to improve the quality of gene selection and ultimately to improve the accuracy of microarray data classification. The method can be efficiently applied to high-dimensional microarray data. We embedded the formal definition of relevance into a Markov blanket (MB) to create a new feature ranking method. Using a few microarray datasets, we demonstrated the practicability of MB-based feature ranking having high accuracy and good efficiency. The method outperformed commonly-used univariate ranking methods and also yielded the better result even compared with the other multivariate feature ranking method due to the advantage of data efficiency

    Low-Precision Feature Selection on Microarray Data: An Information Theoretic Approach

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG[Abstract] The number of interconnected devices, such as personal wearables, cars, and smart-homes, surrounding us every day has recently increased. The Internet of Things devices monitor many processes, and have the capacity of using machine learning models for pattern recognition, and even making decisions, with the added advantage of diminishing network congestion by allowing computations near to the data sources. The main restriction is the low computation capacity of these devices. Thus, machine learning algorithms capable of maintaining accuracy while using mechanisms that exploit certain characteristics, such as low-precision versions, are needed. In this paper, low-precision mutual information-based feature selection algorithms are employed over DNA microarray datasets, showing that 16-bit and some times even 8-bit representations of these algorithms can be used without significant variations in the final classification results achieved.This work has been supported by the grant Machine Learning on the Edge - Ayudas Fundación BBVA a Equipos de Investigación Científica 2019. It has also been possible thanks to the support received by the National Plan for Scientific and Technical Research and Innovation of the Spanish Government (Grant PID2019-109238GB-C2), and by the Xunta de Galicia (Grant ED431C 2018/34) with the European Union ERDF funds. CITIC, as Research Center accredited by Galician University System, is funded by “Consellería de Cultura, Educación e Universidades from Xunta de Galicia”, supported in an 80% through ERDF Funds, ERDF Operational Programme Galicia 2014-2020, and the remaining 20% by “Secretaría Xeral de Universidades” (Grant ED431G 2019/01). Open Access funding provided thanks to the CRUE-CSIC agreement with Springer NatureXunta de Galicia; ED431C 2018/34Xunta de Galicia; ED431G 2019/0

    An efficient statistical feature selection approach for classification of gene expression data

    Get PDF
    AbstractClassification of gene expression data plays a significant role in prediction and diagnosis of diseases. Gene expression data has a special characteristic that there is a mismatch in gene dimension as opposed to sample dimension. All genes do not contribute for efficient classification of samples. A robust feature selection algorithm is required to identify the important genes which help in classifying the samples efficiently. In order to select informative genes (features) based on relevance and redundancy characteristics, many feature selection algorithms have been introduced in the past. Most of the earlier algorithms require computationally expensive search strategy to find an optimal feature subset. Existing feature selection methods are also sensitive to the evaluation measures. The paper introduces a novel and efficient feature selection approach based on statistically defined effective range of features for every class termed as ERGS (Effective Range based Gene Selection). The basic principle behind ERGS is that higher weight is given to the feature that discriminates the classes clearly. Experimental results on well-known gene expression datasets illustrate the effectiveness of the proposed approach. Two popular classifiers viz. Nave Bayes Classifier (NBC) and Support Vector Machine (SVM) have been used for classification. The proposed feature selection algorithm can be helpful in ranking the genes and also is capable of identifying the most relevant genes responsible for diseases like leukemia, colon tumor, lung cancer, diffuse large B-cell lymphoma (DLBCL), prostate cancer
    corecore