Search CORE

8 research outputs found

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

Author
Publication venue: BioMed Central
Publication date: 01/03/2017
Field of study

Springer - Publisher Connector

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

Author: C Ding
G Brown
J Dean
JJ Lin
M Hamstra
M Zaharia
S del Río
T White
X Meng
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

On Distributed Fuzzy Decision Trees for Big Data

Author: Marcelloni Francesco
Pedrycz Witold
Segatori Armando
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Fuzzy decision trees (FDTs) have shown to be an effective solution in the framework of fuzzy classification. The approaches proposed so far to FDT learning, however, have generally neglected time and space requirements. In this paper, we propose a distributed FDT learning scheme shaped according to the MapReduce programming model for generating both binary and multiway FDTs from big data. The scheme relies on a novel distributed fuzzy discretizer that generates a strong fuzzy partition for each continuous attribute based on fuzzy information entropy. The fuzzy partitions are, therefore, used as an input to the FDT learning algorithm, which employs fuzzy information gain for selecting the attributes at the decision nodes. We have implemented the FDT learning scheme on the Apache Spark framework. We have used ten real-world publicly available big datasets for evaluating the behavior of the scheme along three dimensions: 1) performance in terms of classification accuracy, model complexity, and execution time; 2) scalability varying the number of computing units; and 3) ability to efficiently accommodate an increasing dataset size. We have demonstrated that the proposed scheme turns out to be suitable for managing big datasets even with a modest commodity hardware support. Finally, we have used the distributed decision tree learning algorithm implemented in the MLLib library and the Chi-FRBCS-BigData algorithm, a MapReduce distributed fuzzy rule-based classification system, for comparative analysis. © 1993-2012 IEEE

Archivio della Ricerca - Università di Pisa

Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction with Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces

Author: Cao Z
Ding W
Lin CT
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

© 2012 IEEE. The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces

OPUS - University of Technology Sydney

University of Tasmania Open Access Repository

Big Data Preprocessing Frameworks: Tools and Techniques

Author: Alex Wenda -
Publication venue
Publication date
Field of study

Analisis Harga Pokok Produksi Rumah Pada

Computational learning algorithms for large-scale datasets

Author: Fernández Francos Diego
Publication venue
Publication date: 01/01/2017
Field of study

Programa Oficial de Doutoramento en Computación . 5009V01[Resumen]Actualmente nos encontramos sumidos en una avalancha de datos. Este hecho ha modificado fundamentalmente la manera en que se comparte la información y ha puesto de manifiesto la necesidad de desarrollar nuevos métodos eficientes para procesar y almacenar grandes cantidades de datos. El aprendizaje computacional es el área de la inteligencia artificial dedicada a estudiar algoritmos que puedan aprender a partir de los datos, hacer predicciones o crear representaciones exactas basadas en las observaciones. En este contexto, en el que el número de datos crece más rápido que la velocidad de los procesadores, la capacidad de los algoritmos tradicionales de aprendizaje máquina se encuentra limitada por el tiempo de computación y no por el tamaño de la muestra. Además, al tratar con gran cantidad de datos, los algoritmos de aprendizaje pueden degenerar su rendimiento debido al sobreajuste y su eficiencia decae de acuerdo con el tamaño. Por lo tanto, la escalabilidad de los algoritmos de aprendizaje ha dejado de ser una característica deseable de los algoritmos de aprendizaje para convertirse en una propiedad crucial cuando se trabaja con conjuntos de datos muy grandes. Existen, básicamente, tres enfoques diferentes para asegurar la escalabilidad de los algoritmos a medida que los conjuntos de datos continúan creciendo en tamaño y complejidad: aprendizaje en tiempo real, aprendizaje no iterativo y aprendizaje distribuido. Esta tesis desarrolla nuevos métodos de aprendizaje computacional escalables y eficientes siguiendo los tres enfoques anteriores. Específicamente, se desarrollan cuatro nuevos algoritmos: (1) El primero combina selección de características y clasificación en tiempo real, mediante la adaptación de un filtro clásico y la modificación de un algoritmo de aprendizaje incremental basado en una red neuronal de una capa. (2) El siguiente consiste en nuevo clasificador uniclase basado en una función de coste no iterativa para redes neuronales autoasociativas que lleva a cabo la reducción de dimensionalidad en la capa oculta mediante la técnica de Decomposición en Valores Singulares. (3) El tercer método es un nuevo clasificador uniclase basado en el cierre convexo para entornos de datos distribuidos que reduce la dimensionalidad del problema y, por lo tanto, la complejidad, mediante la utilización de proyecciones aleatorias. (4) Por último, se presenta una versión incremental del anterior algoritmo de clasificación uniclase.[Resumo] Hoxe en día atopámonos soterrados nunha morea de datos. Isto cambiou fundamentalmente a fonna na que a infonnación é compartida e puxo de manifesto a necesidade de desenvolver novos métodos eficientes para o procesamento e o almacenamento de grandes cantidades de datos. A aprendizaxe computacional é a área da intelixencia artificial dedicada a estudar algoritmos que poden aprender a partir dos datos. facer previsións 00 crear representacións precisas con base nas observacións. Neste contexto, no cal o número de datos crece roáis rápido que a velocidade dos procesadores, a capacidade dos algoritmos de aprendizaxe máquina tradicionais vese limitada polo tempo de computación e non polo tamaño da mostra. Ademais, cando se trata de grandes cantidades de datos, os algoritmos de aprendizaxe poden dexenerar o seu rendemento debido ó sobreaxuste e a súa eficiencia decae segundo o tamaño. Polo tanto, a escalabilidade dos algoritmos de aprendizaxe xa non é unha caracteristica desexable senón que se trata de unha propiedade fundamental cando se traballa con conxuntos de datos IDoi grandes. Existen basicamente tres enfoques diferentes para garantir a escalabilidade dos algoritmos namentres os conxuntos de datos seguen a medrar en tamaño e complexidade: aprendizaxe en tempo real, aprendizaxe non iterativa e aprendizaxe distribuida. Esta tese presenta novos métodos de aprendizaxe computacional escalables e eficientes seguindo os tres enfoques anteriores. En concreto, desenvólvense catro novos algoritmos: (1) O primeiro método mistura selección de características e clasificación en tempo real, a través da adaptación dun filtro convencional e da modificación de un algoritmo incrementábel baseado nunha rede de neuronas de unha capa: (2) O seguinte é un novo clasificador uniclase con base nunha función de custo non iterativa para redes de neuronas auto asociativas que leva a cabo a redución da dirnensionalidade na capa oculta pola técnica de Descomposición en Valores Singulares. (3) O terceiro método é un novo clasificador uniclase baseado no convex hull para conxuntos de datos distribuidos que reduce a dimensión dos datos do problema e, polo tanto, a complexidade, utilizando proxeccións aleatorias. (4) Por último, preséntase unha versión incremental do algoritmo de clasifición unicIase anterior.[Abstract] Nowadays we are engulfed in a flood of data. Tbis faet has fundamentally changed the ways that infonnation is shared, and has marle it clear that efficient methods fOI processing and staring vast amounts oi data should be put forward. Computationallearning theory i5 the area of artificial intelligence devoted to study algorithms aim at leaming froro data, building accurate models based on observations. In this context, where data has grown faster than the speed Di processors, the capabilities of traditional machine Iearning algorithms are limited by the computational time rather than by the sample size. Besides, when dealing with large quantities of data, learning algorithms can degenerate the:ir performance due to ayer-fitting and their efficiency declines in accordance with size. Therefore, the scalability Di the learning algorithms has turned froro a desirable property into a crucial one when very large datasets are envisioned, There exists, basically, three intersecting approaches to ensure algorithms scalability as datasets continue to grow in size and complexity: online learning, non-iterative learning and distributed leaming, This thesis develops new efficient and scalable machine leaming methods following the three previous approaches. Specifically, four new algorithms are developed, (1) The first one perfonns onIine feature selection and classification at the sarue time, by the adaptation of a c1assical fiIter method and the modification of an ooline leaming algorithm for one-Iayer neuraI network, (2) The next one is a new fast and efficient one-c1ass c1assifier based 00 a non-iterative cost function for autoassociative neural networks that perfonns dimensionality reduction io the hidden layer by means of Singular VaIue Decomposition. (3) The third method is a new onec1ass convex hull-based c1assifier fer distributed environments that reduces the dimeosionality of the problem and hence the complexity by means of Random Projections, (4) FinaIly, an onlioe version of the previous one-class classification algorithm is presented

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Bio-inspired computation for big data fusion, storage, processing, learning and visualization: state of the art and future directions

Author: Camacho David
Del Ser Javier
Díaz-de-Arcaya Josu
Muhammad Khan
Osaba Eneko
Torre-Bastida Ana I.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/08/2021
Field of study

This overview gravitates on research achievements that have recently emerged from the confluence between Big Data technologies and bio-inspired computation. A manifold of reasons can be identified for the profitable synergy between these two paradigms, all rooted on the adaptability, intelligence and robustness that biologically inspired principles can provide to technologies aimed to manage, retrieve, fuse and process Big Data efficiently. We delve into this research field by first analyzing in depth the existing literature, with a focus on advances reported in the last few years. This prior literature analysis is complemented by an identification of the new trends and open challenges in Big Data that remain unsolved to date, and that can be effectively addressed by bio-inspired algorithms. As a second contribution, this work elaborates on how bio-inspired algorithms need to be adapted for their use in a Big Data context, in which data fusion becomes crucial as a previous step to allow processing and mining several and potentially heterogeneous data sources. This analysis allows exploring and comparing the scope and efficiency of existing approaches across different problems and domains, with the purpose of identifying new potential applications and research niches. Finally, this survey highlights open issues that remain unsolved to date in this research avenue, alongside a prescription of recommendations for future research.This work has received funding support from the Basque Government (Eusko Jaurlaritza) through the Consolidated Research Group MATHMODE (IT1294-19), EMAITEK and ELK ARTEK programs. D. Camacho also acknowledges support from the Spanish Ministry of Science and Education under PID2020-117263GB-100 grant (FightDIS), the Comunidad Autonoma de Madrid under S2018/TCS-4566 grant (CYNAMON), and the CHIST ERA 2017 BDSI PACMEL Project (PCI2019-103623, Spain)

TECNALIA Publications