3 research outputs found
A survey on bandwidth‑aware geo‑distributed frameworks for big‑data analytic
In the era of global-scale services, organisations produce huge volumes of data, often
distributed across multiple data centres, separated by vast geographical distances.
While cluster computing applications, such as MapReduce and Spark, have been
widely deployed in data centres to support commercial applications and scientific research, they are not designed for running jobs across geo-distributed data centres.
The necessity to utilise such infrastructure introduces new challenges in the data analytics process due to bandwidth limitations of the inter-data-centre communication. In this article, we discuss challenges and survey the latest geo-distributed big-data analytics frameworks and schedulers (based on MapReduce and Spark) with WAN-bandwidth awareness
Predictive modelling of MapReduce job performance in cloud environments using machine learning techniques
Within the Hadoop ecosystem, MapReduce stands as a cornerstone for managing, processing, and mining large-scale datasets. Yet, the absence of efficient solutions for precise estimation of job execution times poses a persistent challenge, impacting task allocation and distribution within Hadoop clusters. In this study, we present a comprehensive machine learning approach for predicting the execution time of MapReduce jobs, encompassing data collection, preprocessing, feature engineering, and model evaluation. Leveraging a rich dataset derived from comprehensive Hadoop MapReduce job traces, we explore the intricate relationship between cluster parameters and job performance. Through a comparative analysis of machine learning models, including linear regression, decision tree, random forest, and gradient-boosted regression trees, we identify the random forest model as the most effective, demonstrating superior predictive accuracy and robustness. Our findings underscore the critical role of features such as data size and resource allocation in determining job performance. With this work, we aim to enhance resource management efficiency and enable more effective utilisation of cloud-based Hadoop clusters for large-scale data processing tasks</p
Feature selection methods and genomic big data: a systematic review
In the era of accelerating growth of genomic data, feature-selection techniques are
believed to become a game changer that can help substantially reduce the complexity
of the data, thus making it easier to analyze and translate it into useful information. It
is expected that within the next decade, researchers will head towards analyzing the
genomes of all living creatures making genomics the main generator of data. Feature
selection techniques are believed to become a game changer that can help substantially
reduce the complexity of genomic data, thus making it easier to analyze it and
translating it into useful information. With the absence of a thorough investigation of
the field, it is almost impossible for researchers to get an idea of how their work relates
to existing studies as well as how it contributes to the research community. In this
paper, we present a systematic and structured literature review of the feature-selection
techniques used in studies related to big genomic data analytic