10 research outputs found

    Certainty of outlier and boundary points processing in data mining

    Full text link
    Data certainty is one of the issues in the real-world applications which is caused by unwanted noise in data. Recently, more attentions have been paid to overcome this problem. We proposed a new method based on neutrosophic set (NS) theory to detect boundary and outlier points as challenging points in clustering methods. Generally, firstly, a certainty value is assigned to data points based on the proposed definition in NS. Then, certainty set is presented for the proposed cost function in NS domain by considering a set of main clusters and noise cluster. After that, the proposed cost function is minimized by gradient descent method. Data points are clustered based on their membership degrees. Outlier points are assigned to noise cluster and boundary points are assigned to main clusters with almost same membership degrees. To show the effectiveness of the proposed method, two types of datasets including 3 datasets in Scatter type and 4 datasets in UCI type are used. Results demonstrate that the proposed cost function handles boundary and outlier points with more accurate membership degrees and outperforms existing state of the art clustering methods.Comment: Conference Paper, 6 page

    Parallel implementation of fuzzy minimals clustering algorithm

    Get PDF
    Clustering aims to classify different patterns into groups called clusters. Many algorithms for both hard and fuzzy clustering have been developed to deal with exploratory data analysis in many contexts such as image processing, pattern recognition, etc. However, we are witnessing the era of big data computing where computing resources are becoming the main bottleneck to deal with those large datasets. In this context, sequential algorithms need to be redesigned and even rethought to fully leverage the emergent massively parallel architectures. In this paper, we propose a parallel implementation of the fuzzy minimals clustering algorithm called Parallel Fuzzy Minimal (PFM). Our experimental results reveal linear speed-up of PFM when compared to the sequential counterpart version, keeping very good classification quality.Ingeniería, Industria y Construcció

    High-throughput fuzzy clustering on heterogeneous architectures

    Full text link
    [EN] The Internet of Things (IoT) is pushing the next economic revolution in which the main players are data and immediacy. IoT is increasingly producing large amounts of data that are now classified as "dark data'' because most are created but never analyzed. The efficient analysis of this data deluge is becoming mandatory in order to transform it into meaningful information. Among the techniques available for this purpose, clustering techniques, which classify different patterns into groups, have proven to be very useful for obtaining knowledge from the data. However, clustering algorithms are computationally hard, especially when it comes to large data sets and, therefore, they require the most powerful computing platforms on the market. In this paper, we investigate coarse and fine grain parallelization strategies in Intel and Nvidia architectures of fuzzy minimals (FM) algorithm; a fuzzy clustering technique that has shown very good results in the literature. We provide an in-depth performance analysis of the FM's main bottlenecks, reporting a speed-up factor of up to 40x compared to the sequential counterpart version.This work was partially supported by the Fundacion Seneca del Centro de Coordinacion de la Investigacion de la Region de Murcia under Project 20813/PI/18, and by Spanish Ministry of Science, Innovation and Universities under grants TIN2016-78799-P (AEI/FEDER, UE), RTI2018-096384-B-I00, RTI2018-098156-B-C53 and RTC-2017-6389-5.Cebrian, JM.; Imbernón, B.; Soto, J.; García, JM.; Cecilia-Canales, JM. (2020). High-throughput fuzzy clustering on heterogeneous architectures. Future Generation Computer Systems. 106:401-411. https://doi.org/10.1016/j.future.2020.01.022S401411106Waldrop, M. M. (2016). The chips are down for Moore’s law. Nature, 530(7589), 144-147. doi:10.1038/530144aCecilia, J. M., Timon, I., Soto, J., Santa, J., Pereniguez, F., & Munoz, A. (2018). High-Throughput Infrastructure for Advanced ITS Services: A Case Study on Air Pollution Monitoring. IEEE Transactions on Intelligent Transportation Systems, 19(7), 2246-2257. doi:10.1109/tits.2018.2816741Singh, D., & Reddy, C. K. (2014). A survey on platforms for big data analytics. Journal of Big Data, 2(1). doi:10.1186/s40537-014-0008-6Stephens, N., Biles, S., Boettcher, M., Eapen, J., Eyole, M., Gabrielli, G., … Walker, P. (2017). The ARM Scalable Vector Extension. IEEE Micro, 37(2), 26-39. doi:10.1109/mm.2017.35Wright, S. A. (2019). Performance Modeling, Benchmarking and Simulation of High Performance Computing Systems. Future Generation Computer Systems, 92, 900-902. doi:10.1016/j.future.2018.11.020Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering. ACM Computing Surveys, 31(3), 264-323. doi:10.1145/331499.331504Lee, J., Hong, B., Jung, S., & Chang, V. (2018). Clustering learning model of CCTV image pattern for producing road hazard meteorological information. Future Generation Computer Systems, 86, 1338-1350. doi:10.1016/j.future.2018.03.022Pérez-Garrido, A., Girón-Rodríguez, F., Bueno-Crespo, A., Soto, J., Pérez-Sánchez, H., & Helguera, A. M. (2017). Fuzzy clustering as rational partition method for QSAR. Chemometrics and Intelligent Laboratory Systems, 166, 1-6. doi:10.1016/j.chemolab.2017.04.006H.S. Nagesh, S. Goil, A. Choudhary, A scalable parallel subspace clustering algorithm for massive data sets, in: Proceedings 2000 International Conference on Parallel Processing, 2000, pp. 477–484.Bezdek, J. C., Ehrlich, R., & Full, W. (1984). FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences, 10(2-3), 191-203. doi:10.1016/0098-3004(84)90020-7Havens, T. C., Bezdek, J. C., Leckie, C., Hall, L. O., & Palaniswami, M. (2012). Fuzzy c-Means Algorithms for Very Large Data. IEEE Transactions on Fuzzy Systems, 20(6), 1130-1146. doi:10.1109/tfuzz.2012.2201485Flores-Sintas, A., Cadenas, J., & Martin, F. (1998). A local geometrical properties application to fuzzy clustering. Fuzzy Sets and Systems, 100(1-3), 245-256. doi:10.1016/s0165-0114(97)00038-9Soto, J., Flores-Sintas, A., & Palarea-Albaladejo, J. (2008). Improving probabilities in a fuzzy clustering partition. Fuzzy Sets and Systems, 159(4), 406-421. doi:10.1016/j.fss.2007.08.016Timón, I., Soto, J., Pérez-Sánchez, H., & Cecilia, J. M. (2016). Parallel implementation of fuzzy minimals clustering algorithm. Expert Systems with Applications, 48, 35-41. doi:10.1016/j.eswa.2015.11.011Flores-Sintas, A., M. Cadenas, J., & Martin, F. (2001). Detecting homogeneous groups in clustering using the Euclidean distance. Fuzzy Sets and Systems, 120(2), 213-225. doi:10.1016/s0165-0114(99)00110-4Wang, H., Potluri, S., Luo, M., Singh, A. K., Sur, S., & Panda, D. K. (2011). MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters. Computer Science - Research and Development, 26(3-4), 257-266. doi:10.1007/s00450-011-0171-3Kaltofen, E., & Villard, G. (2005). On the complexity of computing determinants. computational complexity, 13(3-4), 91-130. doi:10.1007/s00037-004-0185-3Johnson, S. C. (1967). Hierarchical clustering schemes. Psychometrika, 32(3), 241-254. doi:10.1007/bf02289588Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., … Lin, C.-T. (2017). A review of clustering techniques and developments. Neurocomputing, 267, 664-681. doi:10.1016/j.neucom.2017.06.053Woodley, A., Tang, L.-X., Geva, S., Nayak, R., & Chappell, T. (2019). Parallel K-Tree: A multicore, multinode solution to extreme clustering. Future Generation Computer Systems, 99, 333-345. doi:10.1016/j.future.2018.09.038Kwedlo, W., & Czochanski, P. J. (2019). A Hybrid MPI/OpenMP Parallelization of KK -Means Algorithms Accelerated Using the Triangle Inequality. IEEE Access, 7, 42280-42297. doi:10.1109/access.2019.2907885Li, Y., Zhao, K., Chu, X., & Liu, J. (2013). Speeding up k-Means algorithm by GPUs. Journal of Computer and System Sciences, 79(2), 216-229. doi:10.1016/j.jcss.2012.05.004Saveetha, V., & Sophia, S. (2018). Optimal Tabu K-Means Clustering Using Massively Parallel Architecture. Journal of Circuits, Systems and Computers, 27(13), 1850199. doi:10.1142/s0218126618501992Djenouri, Y., Djenouri, D., Belhadi, A., & Cano, A. (2019). Exploiting GPU and cluster parallelism in single scan frequent itemset mining. Information Sciences, 496, 363-377. doi:10.1016/j.ins.2018.07.020Krawczyk, B. (2016). GPU-Accelerated Extreme Learning Machines for Imbalanced Data Streams with Concept Drift. Procedia Computer Science, 80, 1692-1701. doi:10.1016/j.procs.2016.05.509Fang, Y., Chen, Q., & Xiong, N. (2019). A multi-factor monitoring fault tolerance model based on a GPU cluster for big data processing. Information Sciences, 496, 300-316. doi:10.1016/j.ins.2018.04.053Tanweer, S., & Rao, N. (2019). Novel Algorithm of CPU-GPU hybrid system for health care data classification. Journal of Drug Delivery and Therapeutics, 9(1-s), 355-357. doi:10.22270/jddt.v9i1-s.244

    Event processing in web of things

    Get PDF
    The incoming digital revolution has the potential to drastically improve our productivity, reduce operational costs and improve the quality of the products. However, the realization of these promises requires the convergence of technologies — from edge computing to cloud, artificial intelligence, and the Internet of Things — blurring the lines between the physical and digital worlds. Although these technologies evolved independently over time, they are increasingly becoming intertwined. Their convergence will create an unprecedented level of automation, achieved via massive machine-to-machine interactions whose cornerstone are event processing tasks. This thesis explores the intersection of these technologies by making an in-depth analysis of their role in the life-cycle of event processing tasks, including their creation, placement and execution. First, it surveys currently existing Web standards, Internet drafts, and design patterns that are used in the creation of cloud-based event processing. Then, it investigates the reasons for event processing to start shifting towards the edge, alongside with the standards that are necessary for a smooth transition to occur. Finally, this work proposes the use of deep reinforcement learning methods for the placement and distribution of event processing tasks at the edge. Obtained results show that the proposed neural-based event placement method is capable of obtaining (near) optimal solutions in several scenarios and provide hints about future research directions.A nova revolução digital promete melhorar drasticamente a nossa produtividade, reduzir os custos operacionais e melhorar a qualidade dos produtos. A concretizac¸ ˜ao dessas promessas requer a convergˆencia de tecnologias – desde edge computing à cloud, inteligência artificial e Internet das coisas (IoT) – atenuando a linha que separa o mundo físico do digital. Embora as quatro tecnologias mencionadas tenham evoluído de forma independente ao longo do tempo, atualmente elas estão cada vez mais interligadas. A convergência destas tecnologias irá criar um nível de automatização sem precedentes.The research published in this work was supported by the Portuguese Foundation for Science and Technology (FCT) through CEOT (Center for Electronic, Optoelectronic and Telecommunications) funding (UID/MULTI/00631/2020) and by FCT Ph.D grant to Andriy Mazayev (SFRH/BD/138836/2018)

    Diagnóstico de consumos anómalos de energia: abordagem por classificação

    Get PDF
    Durante o período de funcionamento de uma instalação eléctrica podem ocorrer várias anomalias. Enquanto muitas delas apenas são identificadas tardiamente, outras acabam por nunca serem identificadas como um potencial problema. A identificação atempada dessas anomalias permite a realização de um diagnóstico que leve à correcção das suas causas evitando assim os desperdícios e prejuízos inerentes. A identificação de um consumo anómalo pode ser realizada, de forma automática ou semi automática através de sistemas de apoio que permitam sinalizar falhas ou comportamentos anormais. O trabalho apresentado nesta dissertação pretende possibilitar esta sinalização apenas através da análise dos dados de consumo medidos em tempo real e comparados com dados históricos através de uma abordagem baseada em classificação, recorrendo a métodos de clustering. Foram testadas diferentes abordagens em três casos distintos, dois relativos a consumidores residenciais para os quais existiam registos de consumo durante um período alargado, e um relativo a uma instalação desportiva, para a qual é possível aceder em tempo real ao sistema de gestão de consumos via web. O sistema implementado proporciona vários tipos de informação ao utilizador, permitindo visualizar graficamente a existência de uma potencial anomalia quando a disparidade entre a classificação do consumo no instante e a classe do consumo de referência for significativa

    Desarrollo eficiente de algoritmos de clasificación difusa en entornos Big Data

    Get PDF
    Estamos presenciando una época de transición donde los “datos” son los principales protagonistas. En la actualidad, cada día se genera una ingente cantidad de información en la conocida como era del Big Data. La toma de decisiones basada en estos datos, su estructuración, organización, así como su correcta integración y análisis, constituyen un factor clave para muchos sectores estratégicos de la sociedad. En el tratamiento de cantidades grandes de datos, las técnicas de almacenamiento y análisis asociadas al Big Data nos proporcionan una gran ayuda. Entre estas técnicas predominan los algoritmos conocidos como machine learning, esenciales para el análisis predictivo a partir de grandes cantidades de datos. Dentro del campo del machine learning, los algoritmos de clasificación difusa son empleados con frecuencia para la resolución de una gran variedad de problemas, principalmente, los relacionados con control de procesos industriales complejos, sistemas de decisión en general, la resolución y la compresión de datos. Los sistemas de clasificación están también muy extendidos en la tecnología cotidiana, por ejemplo, en cámaras digitales, sistemas de aire acondicionado, etc. El éxito del uso de las técnicas de machine learning está limitado por las restricciones de los recursos computacionales actuales, especialmente, cuando se trabaja con grandes conjuntos de datos y requisitos de tiempo real. En este contexto, dichos algoritmos necesitan ser rediseñados e, incluso, repensados con la finalidad de aprovechar al máximo las arquitecturas masivamente paralelas que ofrecen el máximo rendimiento en la actualidad. Esta tesis doctoral se centra dentro de este contexto, analizando computacionalmente el actual panorama de algoritmos de clasificación y proponiendo algoritmos de clasificación paralelos que permitan ofrecer soluciones adecuadas en un intervalo de tiempo reducido. En concreto, se ha realizado un estudio en profundidad de técnicas bien conocidas de machine learning mediante un caso de aplicación práctica. Esta aplicación predice el nivel de ozono en diferentes áreas de la Región de Murcia. Dicho análisis se fundamentó en la recogida de distintos parámetros de contaminación para cada día durante los años 2013 y 2014. El estudio reveló que la técnica que obtenía mejores resultados fue Random Forest y se obtuvo una regionalización en dos grandes zonas, atendiendo a los datos procesados. A continuación, se centró el objetivo en los algoritmos de clasificación difusa. En este caso, se utilizó una modificación del algoritmo Fuzzy C-Means (FCM), mFCM, como técnica de discretización con el objetivo de convertir los datos de entrada de continuos a discretos. Este proceso tiene especial importancia debido a que hay determinados algoritmos que necesitan valores discretos para poder trabajar, incluso técnicas que sí trabajan con datos continuos, obtienen mejores resultados con datos discretos. Esta técnica fue validada a través de la aplicación al bien conocido conjunto de Iris Data de Anderson, donde se comparó estadísticamente con la técnica de K-Means (KM), proporcionando mejores resultados. Una vez realizado el estudio de los algoritmos de clasificación difusa, se detecta que dichas técnicas son sensibles a la cantidad de datos, incrementando su tiempo computacional. De modo que la eficiencia en la programación de estos algoritmos es un factor crítico para su posible aplicabilidad al Big Data. Por lo tanto, se propone la paralización de un algoritmo de clasificación difusa a fin de conseguir que la aplicación sea más rápida conforme aumente el grado de paralelismo del sistema. Para ello, se propuso el algoritmo de clasificación difusa Parallel Fuzzy Minimals (PFM) y se comparó con los algoritmos FCM y Fuzzy Minimals (FM) en diferentes conjuntos de datos. En términos de calidad, la clasificación era similar a la obtenida por los tres algoritmos, sin embargo, en términos de escalabilidad, el algoritmo paralelizado PFM obtenía una aceleración lineal con respecto al número de procesadores empleados. Habiendo identificado la necesidad de que dichas técnicas tengan que ser desarrolladas en entornos masivamente paralelos, se propone una infraestructura de hardware y software de alto rendimiento para procesar, en tiempo real, los datos obtenidos de varios vehículos en relación a variables que analizan problemas de contaminación y tráfico. Los resultados mostraron un rendimiento adecuado del sistema trabajando con grandes cantidades de datos y, en términos de escalabilidad, las ejecuciones fueron satisfactorias. Se visualizan grandes retos a la hora de identificar otras aplicaciones en entornos Big Data y ser capaces de utilizar dichas técnicas para la predicción en áreas tan relevantes como la contaminación, el tráfico y las ciudades inteligentes.Ingeniería, Industria y Construcció

    Estudo e comparação de modelos de canal rádio para sistemas MIMO

    Get PDF
    Mestrado em Engenharia Electrónica e TelecomunicaçõesDevido à constante necessidade de desenvolvimento de sistemas com uma elevada taxa de transmissão e qualidade de serviço recorrendo a um espectro limitado e uma potência radiada regulamentada, é fundamental efectuar uma caracterização adequada do canal rádio para explorar de forma eficiente as suas potencialidades. O trabalho desenvolvido incide no estudo de alguns parâmetros que permitem caracterizar o canal de propagação rádio, obtidos através da aplicação do algoritmo de alta resolução SAGE (Space-Alternating Generalized Expectation Maximization) ao sinal captado por um agregado sintético usado para sondagem do mesmo canal. Inicialmente efectuou-se um estudo relativo ao canal de propagação, nomeadamente acerca dos modelos de propagação exterior, canal multipercurso e respectivos parâmetros. Em seguida foi introduzido o sistema MIMO através do estudo teórico da capacidade sendo posteriormente apresentados os modelos de canal MIMO (analíticos e físicos). Através dos parâmetros estimados pelo algoritmo SAGE, foram estudados alguns cenários de forma a obter de forma experimental alguns parâmetros do canal de propagação. Com a introdução do algoritmo de clustering foi possível agregar as réplicas multipercurso em famílias de raios que apresentam afinidade espacial. A visualização gráfica do clustering permitiu analisar a disposição dos clusters assim como os parâmetros atraso e azimute das componentes multipercurso. Finalmente efectuou-se um estudo estatístico dos clusters, incidindo nas medidas do atraso, azimute e resposta impulsiva, que culminou num conjunto de gráficos representativos dos diversos ensaios que permitem uma melhor compreensão do canal de propagaçãoAccording to the constant need of development of systems that keep a high transmission rate and quality service using a limited spectrum and a regulated power radiated, it is fundamental to make an accurate characterization of the radio channel to efficiently exploit its potential. This work focuses on the study of some parameters that allow the characterization of the radio propagational channel, obtained through the application of the high resolution algorithm SAGE (Space-Alternation Generalized Expectation Maximization) to the signal captured by a synthetic aggregate used for polling the same channel. Initially a study relative to the propagation channel was realized, namely about the outdoor propagation models, multipath channel and respective parameters. Later on the MIMO system was introduced by the theoretical study of the channel capacity being posteriorly presented the MIMO channel models (analytical and physical). Through the measures obtained by the SAGE algorithm some scenarios were studied to obtain experimentally some parameters of the propagation channel. With the introduction of the clustering algorithm it was possible to aggregate the multipath replicas in families of rays that have spatial affinity. The graphic display of the clustering allowed the analysis of the clusters disposition as well as the delay and angle parameters of the multipath components. Finally it was carried out a statistical study of clusters, regarding on the measures of delay, angle and impulse response which turned into a set of graphics representing the several trials allowing a better understanding of the propagation channel

    Parallel Computing for Biological Data

    Get PDF
    In the 1990s a number of technological innovations appeared that revolutionized biology, and 'Bioinformatics' became a new scientific discipline. Microarrays can measure the abundance of tens of thousands of mRNA species, data on the complete genomic sequences of many different organisms are available, and other technologies make it possible to study various processes at the molecular level. In Bioinformatics and Biostatistics, current research and computations are limited by the available computer hardware. However, this problem can be solved using high-performance computing resources. There are several reasons for the increased focus on high-performance computing: larger data sets, increased computational requirements stemming from more sophisticated methodologies, and latest developments in computer chip production. The open-source programming language 'R' was developed to provide a powerful and extensible environment for statistical and graphical techniques. There are many good reasons for preferring R to other software or programming languages for scientific computations (in statistics and biology). However, the development of the R language was not aimed at providing a software for parallel or high-performance computing. Nonetheless, during the last decade, a great deal of research has been conducted on using parallel computing techniques with R. This PhD thesis demonstrates the usefulness of the R language and parallel computing for biological research. It introduces parallel computing with R, and reviews and evaluates existing techniques and R packages for parallel computing on Computer Clusters, on Multi-Core Systems, and in Grid Computing. From a computer-scientific point of view the packages were examined as to their reusability in biological applications, and some upgrades were proposed. Furthermore, parallel applications for next-generation sequence data and preprocessing of microarray data were developed. Microarray data are characterized by high levels of noise and bias. As these perturbations have to be removed, preprocessing of raw data has been a research topic of high priority over the past few years. A new Bioconductor package called affyPara for parallelized preprocessing of high-density oligonucleotide microarray data was developed and published. The partition of data can be performed on arrays using a block cyclic partition, and, as a result, parallelization of algorithms becomes directly possible. Existing statistical algorithms and data structures had to be adjusted and reformulated for the use in parallel computing. Using the new parallel infrastructure, normalization methods can be enhanced and new methods became available. The partition of data and distribution to several nodes or processors solves the main memory problem and accelerates the methods by up to the factor fifteen for 300 arrays or more. The final part of the thesis contains a huge cancer study analysing more than 7000 microarrays from a publicly available database, and estimating gene interaction networks. For this purpose, a new R package for microarray data management was developed, and various challenges regarding the analysis of this amount of data are discussed. The comparison of gene networks for different pathways and different cancer entities in the new amount of data partly confirms already established forms of gene interaction
    corecore