65 research outputs found
Predicting model training time to optimize distributed machine learning applications
Despite major advances in recent years, the field of Machine Learning continues to face research and technical challenges. Mostly, these stem from big data and streaming data, which require models to be frequently updated or re-trained, at the expense of significant computational resources. One solution is the use of distributed learning algorithms, which can learn in a distributed manner, from distributed datasets. In this paper, we describe CEDEs—a distributed learning system in which models are heterogeneous distributed Ensembles, i.e., complex models constituted by different base models, trained with different and distributed subsets of data. Specifically, we address the issue of predicting the training time of a given model, given its characteristics and the characteristics of the data. Given that the creation of an Ensemble may imply the training of hundreds of base models, information about the predicted duration of each of these individual tasks is paramount for an efficient management of the cluster’s computational resources and for minimizing makespan, i.e., the time it takes to train the whole Ensemble. Results show that the proposed approach is able to predict the training time of Decision Trees with an average error of 0.103 s, and the training time of Neural Networks with an average error of 21.263 s. We also show how results depend significantly on the hyperparameters of the model and on the characteristics of the input data.This work has been supported by national funds through FCT – Fundação para a Ciência e Tecnologia through projects UIDB/04728/2020, EXPL/CCI-COM/0706/2021, and CPCA-IAC/AV/475278/2022
Metaheuristic design of feedforward neural networks: a review of two decades of research
Over the past two decades, the feedforward neural network (FNN) optimization has been a key interest among the researchers and practitioners of multiple disciplines. The FNN optimization is often viewed from the various perspectives: the optimization of weights, network architecture, activation nodes, learning parameters, learning environment, etc. Researchers adopted such different viewpoints mainly to improve the FNN's generalization ability. The gradient-descent algorithm such as backpropagation has been widely applied to optimize the FNNs. Its success is evident from the FNN's application to numerous real-world problems. However, due to the limitations of the gradient-based optimization methods, the metaheuristic algorithms including the evolutionary algorithms, swarm intelligence, etc., are still being widely explored by the researchers aiming to obtain generalized FNN for a given problem. This article attempts to summarize a broad spectrum of FNN optimization methodologies including conventional and metaheuristic approaches. This article also tries to connect various research directions emerged out of the FNN optimization practices, such as evolving neural network (NN), cooperative coevolution NN, complex-valued NN, deep learning, extreme learning machine, quantum NN, etc. Additionally, it provides interesting research challenges for future research to cope-up with the present information processing era
CM-CASL: Comparison-based Performance Modeling of Software Systems via Collaborative Active and Semisupervised Learning
Configuration tuning for large software systems is generally challenging due
to the complex configuration space and expensive performance evaluation. Most
existing approaches follow a two-phase process, first learning a
regression-based performance prediction model on available samples and then
searching for the configurations with satisfactory performance using the
learned model. Such regression-based models often suffer from the scarcity of
samples due to the enormous time and resources required to run a large software
system with a specific configuration. Moreover, previous studies have shown
that even a highly accurate regression-based model may fail to discern the
relative merit between two configurations, whereas performance comparison is
actually one fundamental strategy for configuration tuning. To address these
issues, this paper proposes CM-CASL, a Comparison-based performance Modeling
approach for software systems via Collaborative Active and Semisupervised
Learning. CM-CASL learns a classification model that compares the performance
of two given configurations, and enhances the samples through a collaborative
labeling process by both human experts and classifiers using an integration of
active and semisupervised learning. Experimental results demonstrate that
CM-CASL outperforms two state-of-the-art performance modeling approaches in
terms of both classification accuracy and rank accuracy, and thus provides a
better performance model for the subsequent work of configuration tuning
Recommended from our members
Optimizing Data-Intensive Computing with Efficient Configuration Tuning
As the complexity of distributed analytics systems evolves over time, more configuration parameters get exposed for tuning. While these numerous parameters allow users more control over how their workloads are executed, this flexibility comes at a cost, since finding the right configurations for such systems in a cost-effective way becomes challenging. In practice, several factors contribute to the complexity of tuning the configuration of those systems: the large configuration space, the diversity of the served workloads (each workload possibly requiring a different resource allocation strategy to run optimally), and the dynamic
characteristics of these systems’ environment (e.g., increase in input data size, changes in the allocation of resources). Paradoxically, existing solutions for workload tuning either assume static tuning environment or workloads that are inexpensive to run (i.e. requiring hundreds of execution samples). Recently, Bayesian Optimisation (BO) strategies have been applied as a solution to enable efficient autotuning. They build a probabilistic model incrementally to predict the impact of the parameters on performance using a small number of execution samples. The incrementally constructed BO model is used to guide the tuning process and accelerate convergence to a near-optimal configuration. Unfortunately, for distributed analytics systems, the configuration space is too large to construct a good model using traditional BO, which fails to provide quick convergence in high dimensional configuration space.
I argue that cost-effective tuning strategies can only be developed when taking into account: the frequent changes that can happen in the analytics workload/environment, the amortization of tuning costs and how this influences tuning profitability, the high dimensionality of configuration
space and the need to cater for diverse workloads. To tackle these challenges, I propose Tuneful, an efficient configuration tuning framework
for such expensive to tune systems. It works efficiently both initially (when little data is available) as well as later (as more tuning knowledge is acquired). It starts with learning workload-specific influential parameters incrementally and tunes those only, then when more tuning knowledge becomes available, it detects similarity across workloads and utilizes multitask BO to share the tuning knowledge across similar workloads. I show how augmenting the BO approach with parameters’ significance and workload similarity characteristics enables an
efficient configuration tuning in high dimensional configuration space. Over diverse analytics workloads, this significantly accelerates both configuration tuning and cost amortization, saving search time by 2.7-3.7X at median compared to the-state-of-the-art approaches
Bio-inspired computation for big data fusion, storage, processing, learning and visualization: state of the art and future directions
This overview gravitates on research achievements that have recently emerged from the confluence between Big Data technologies and bio-inspired computation. A manifold of reasons can be identified for the profitable synergy between these two paradigms, all rooted on the adaptability, intelligence and robustness that biologically inspired principles can provide to technologies aimed to manage, retrieve, fuse and process Big Data efficiently. We delve into this research field by first analyzing in depth the existing literature, with a focus on advances reported in the last few years. This prior literature analysis is complemented by an identification of the new trends and open challenges in Big Data that remain unsolved to date, and that can be effectively addressed by bio-inspired algorithms. As a second contribution, this work elaborates on how bio-inspired algorithms need to be adapted for their use in a Big Data context, in which data fusion becomes crucial as a previous step to allow processing and mining several and potentially heterogeneous data sources. This analysis allows exploring and comparing the scope and efficiency of existing approaches across different problems and domains, with the purpose of identifying new potential applications and research niches. Finally, this survey highlights open issues that remain unsolved to date in this research avenue, alongside a prescription of recommendations for future research.This work has received funding support from the Basque Government (Eusko Jaurlaritza) through the Consolidated
Research Group MATHMODE (IT1294-19), EMAITEK and ELK ARTEK programs. D. Camacho also acknowledges support from the Spanish Ministry of Science and Education under PID2020-117263GB-100 grant (FightDIS), the Comunidad Autonoma de Madrid under S2018/TCS-4566 grant (CYNAMON), and the CHIST ERA 2017 BDSI PACMEL Project (PCI2019-103623, Spain)
Bio-inspired computation: where we stand and what's next
In recent years, the research community has witnessed an explosion of literature dealing with the adaptation of behavioral patterns and social phenomena observed in nature towards efficiently solving complex computational tasks. This trend has been especially dramatic in what relates to optimization problems, mainly due to the unprecedented complexity of problem instances, arising from a diverse spectrum of domains such as transportation, logistics, energy, climate, social networks, health and industry 4.0, among many others. Notwithstanding this upsurge of activity, research in this vibrant topic should be steered towards certain areas that, despite their eventual value and impact on the field of bio-inspired computation, still remain insufficiently explored to date. The main purpose of this paper is to outline the state of the art and to identify open challenges concerning the most relevant areas within bio-inspired optimization. An analysis and discussion are also carried out over the general trajectory followed in recent years by the community working in this field, thereby highlighting the need for reaching a consensus and joining forces towards achieving valuable insights into the understanding of this family of optimization techniques
Bio-inspired computation: where we stand and what's next
In recent years, the research community has witnessed an explosion of literature dealing with the adaptation of behavioral patterns and social phenomena observed in nature towards efficiently solving complex computational tasks. This trend has been especially dramatic in what relates to optimization problems, mainly due to the unprecedented complexity of problem instances, arising from a diverse spectrum of domains such as transportation, logistics, energy, climate, social networks, health and industry 4.0, among many others. Notwithstanding this upsurge of activity, research in this vibrant topic should be steered towards certain areas that, despite their eventual value and impact on the field of bio-inspired computation, still remain insufficiently explored to date. The main purpose of this paper is to outline the state of the art and to identify open challenges concerning the most relevant areas within bio-inspired optimization. An analysis and discussion are also carried out over the general trajectory followed in recent years by the community working in this field, thereby highlighting the need for reaching a consensus and joining forces towards achieving valuable insights into the understanding of this family of optimization techniques
Modelos predictivos basados en deep learning para datos temporales masivos
Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Ingeniería, Ciencia de Datos y BioinformáticaClave Programa: DBICódigo Línea: 111El avance en el mundo del hardware ha revolucionado el campo de la inteligencia artificial, abriendo nuevos frentes y áreas que hasta hoy estaban limitadas. El área del deep learning es quizás una de las mas afectadas por este avance, ya que estos modelos requieren de una gran capacidad de computación debido al número de operaciones y complejidad de las mismas, motivo por el cual habían caído en desuso hasta los últimos años.
Esta Tesis Doctoral ha sido presentada mediante la modalidad de compendio de publicaciones, con un total de diez aportaciones científicas en Congresos Internacionales y revistas con alto índice de impacto en el Journal of Citation Reports (JCR). En ella se recoge una investigación orientada al estudio, análisis y desarrollo de las arquitecturas deep learning mas extendidas en la literatura para la predicción de series temporales, principalmente de tipo energético, como son la demanda eléctrica y la generación de energía solar. Además, se ha centrado gran parte de la investigación en la optimización de estos modelos, tarea primordial para la obtención de un modelo predictivo fiable.
En una primera fase, la tesis se centra en el desarrollo de modelos predictivos basados en deep learning para la predicción de series temporales aplicadas a dos fuentes de datos reales.
En primer lugar se diseñó una metodología que permitía realizar la predicción multipaso de un modelo Feed-Forward, cuyos resultados fueron publicados en el International Work-Conference on the Interplay Between Natural and Artificial Computation (IWINAC). Esta misma metodología se aplicó y comparó con otros modelos clásicos, implementados de manera distribuida, cuyos resultados fueron publicados en el 14th International Work-Conference on Artificial Neural Networks (IWANN). Fruto de la diferencia en tiempo de computación y escalabilidad del método de deep learning con los otros modelos comparados, se diseñó una versión distribuida, cuyos resultados fueron publicados en dos revistas indexadas con categoría Q1, como son Integrated Computer-Aided Engineering e Information Sciences. Todas estas aportaciones fueron probadas utilizando un conjunto de datos de demanda eléctrica en España. De forma paralela, y con el objetivo de comprobar la generalidad de la metodología, se aplicó el mismo enfoque sobre un conjunto de datos correspondiente a la generación de energía solar en Australia en dos versiones: univariante, cuyos resultados se publicaron en International on Soft Computing Models in Industrial and Environment Applications (SOCO), y la versión multivariante, que fué publicada en la revista Expert Systems, indexada con categoría Q2.
A pesar de los buenos resultados obtenidos, la estrategia de optimización de los modelos no era óptima para entornos big data debido a su carácter exhaustivo y al coste computacional que conllevaba. Motivado por esto, la segunda fase de la Tesis Doctoral se basó en la optimización de los modelos deep learning.
Se diseñó una estrategia de búsqueda aleatoria aplicada a la metodología propuesta en la primera fase, cuyos resultados fueron publicados en el IWANN. Posteriormente, se centró la atención en modelos de optimización basado en heurísticas, donde se desarrolló un algoritmo genético para optimizar el modelo feed-forward. Los resultados de esta investigación se presentaron en la revista Applied Sciences, indexada con categoría Q2. Además, e influenciado por la situación pandémica del 2020, se decidió diseñar e implementar una heurística basada en el modelo de propagación de la COVID-19. Esta estrategia de optimización se integró con una red Long-Short-Term-Memory, ofreciendo resultados altamente competitivos que fueron publicados en la revista Big Data, indexada en el JCR con categoría Q1.
Para finalizar el trabajo de tesis, toda la información y conocimientos adquiridos fueron recopilados en un artículo a modo de survey, que fue publicado en la revista indexada con categoría Q1 Big Data.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e Informátic
Data distribution and task scheduling for distributed computing of all-to-all comparison problems
This research studied distributed computing of all-to-all comparison problems with big data sets. The thesis formalised the problem, and developed a high-performance and scalable computing framework with a programming model, data distribution strategies and task scheduling policies to solve the problem. The study considered storage usage, data locality and load balancing for performance improvement in solving the problem. The research outcomes can be applied in bioinformatics, biometrics and data mining and other domains in which all-to-all comparisons are a typical computing pattern
- …