1,273 research outputs found

    Imputation method based on recurrent neural networks for the internet of things

    Get PDF
    El Internet de las Cosas (IoT) es un nuevo paradigma tecnológico, en el cual sensores y objetos comunes, como electrodomésticos, se conectan e interactúan a través de la Internet -- Este nuevo paradigma, de la mano con técnicas de Inteligencia Artificial (AI) y técnicas modernas para el análisis de datos, hace posible el desarrollo de productos y servicios inteligentes; lo que promete revolucionar la industria y la forma de vida de los humanos -- Sin embargo, existen muchos problemas que deben ser solucionados para poder contar con productos y servicios confiables basados en el IoT -- Dentro de estos problemas, el problema de los datos faltantes impide la correcta aplicación de modernas técnicas the AI y análisis de datos en aplicaciones basadas en el IoT -- Este escrito presenta un análisis del problema de los datos faltantes en el contexto del IoT, así como métodos de imputación actuales propuestos a solucionar este problema -- Del análisis se concluye que las soluciones actuales tienen grandes limitaciones si se considera lo amplio del contexto de las applicaciones basadas en IoT -- El análisis también expone que no hay un marco experimental en común que pueda ser usado por los diferentes autores, y que los experimentos encontrados carecen de reproducibilidad y no consideran adecuadamente como el problema de los datos faltantes se presenta en el contexto en particular del IoT -- De acuerdo con lo anterior, este escrito presenta dos propuestas principales: i) un marco experimental que permite evaluar adecuadamente los métodos de imputación que se pretendan evaluar en este contexto; y ii) un método de imputación que es lo suficientemente general como para ser aplicado en los diferentes escenarios del IoT -- El método de imputación se basa en el uso de Redes Neuronales Recurrentes, una familia de métodos de aprendizaje supervisado que ha mostrado un buen desempeño explotando patrones de datos sequenciales y relaciones intrínsecas entre variablesThe Internet of Things (IoT) refers to the new technological paradigm in which sensors and common objects, like household appliances, connect to and interact through the Internet -- This new paradigm, and the use of Artificial Intelligence (AI) and modern data analysis techniques, powers the development of smart products and services; which promise to revolutionize the industry and humans way of living -- Nonetheless, there are plenty of issues that need to be solved in order to have reliable products and services based on the IoT -- Among others, the problem of missing data posses great threats to the applicability of AI and data analysis to IoT applications -- This manuscript shows an analysis of the missing data problem in the context of the IoT, as well as the current imputation methods proposed to solve the problem -- This analysis leads to the conclusion that current solutions are very limited when considering how broad the context of IoT applications may be -- Additionally, this manuscript exposes that there is not a common experimental set up in which the authors have tested their proposed imputation methods; moreover, the experiments found in the literature, lack reproducibility and do not carefully consider how the missing data problem may present in the IoT -- Consequently, the reader will find two proposals in this manuscript: i) an experimental set up to properly test imputation methods in the context of the IoT; and ii) an imputation method that is general enough as to be applied to several IoT scenarios -- The latter is based on Recurrent Neural Networks, a family of supervised learning methods which have excel at exploiting patterns in sequential data and intrinsic association between the variables of dat

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

    Geometry- and Accuracy-Preserving Random Forest Proximities with Applications

    Get PDF
    Many machine learning algorithms use calculated distances or similarities between data observations to make predictions, cluster similar data, visualize patterns, or generally explore the data. Most distances or similarity measures do not incorporate known data labels and are thus considered unsupervised. Supervised methods for measuring distance exist which incorporate data labels and thereby exaggerate separation between data points of different classes. This approach tends to distort the natural structure of the data. Instead of following similar approaches, we leverage a popular algorithm used for making data-driven predictions, known as random forests, to naturally incorporate data labels into similarity measures known as random forest proximities. In this dissertation, we explore previously defined random forest proximities and demonstrate their weaknesses in popular proximity-based applications. Additionally, we develop a new proximity definition that can be used to recreate the random forest’s predictions. We call these random forest-geometry-and accuracy-Preserving proximities or RF-GAP. We show by proof and empirical demonstration can be used to perfectly reconstruct the random forest’s predictions and, as a result, we argue that RF-GAP proximities provide a truer representation of the random forest’s learning when used in proximity-based applications. We provide evidence to suggest that RF-GAP proximities improve applications including imputing missing data, detecting outliers, and visualizing the data. We also introduce a new random forest proximity-based technique that can be used to generate 2- or 3-dimensional data representations which can be used as a tool to visually explore the data. We show that this method does well at portraying the relationship between data variables and the data labels. We show quantitatively and qualitatively that this method surpasses other existing methods for this task
    • …
    corecore