Search CORE

8,227 research outputs found

Workflow-Based Big Data Analytics in The Cloud Environment Present Research Status and Future Prospects

Author: Alam Mansaf
Khan Samiya
Shakil Kashish Ara
Publication venue
Publication date: 04/11/2017
Field of study

Workflow is a common term used to describe a systematic breakdown of tasks that need to be performed to solve a problem. This concept has found best use in scientific and business applications for streamlining and improving the performance of the underlying processes targeted towards achieving an outcome. The growing complexity of big data analytical problems has invited the use of scientific workflows for performing complex tasks for specific domain applications. This research investigates the efficacy of workflow-based big data analytics in the cloud environment, giving insights on the research already performed in the area and possible future research directions in the field

arXiv.org e-Print Archive

Big Data Analytics in Bioinformatics: A Machine Learning Perspective

Author: Ahmed Hasin Afzal
Bhattacharyya Dhruba Kumar
Hoque Nazrul
Kashyap Hirak
Roy Swarup
Publication venue
Publication date: 15/06/2015
Field of study

Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative processing and high data dependency among operations. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. Similarly, graph-based architectures and in-memory big data tools have been developed to minimize I/O cost and optimize iterative processing. However, there lack standard big data architectures and tools for many important bioinformatics problems, such as fast construction of co-expression and regulatory networks and salient module identification, detection of complexes over growing protein-protein interaction data, fast analysis of massive DNA, RNA, and protein sequence data, and fast querying on incremental and heterogeneous disease networks. This paper addresses the issues and challenges posed by several big data problems in bioinformatics, and gives an overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic

arXiv.org e-Print Archive

Analytical Cost Metrics : Days of Future Past

Author: Djidjev Hristo
Prajapati Nirmal
Rajopadhye Sanjay
Publication venue
Publication date: 05/02/2018
Field of study

As we move towards the exascale era, the new architectures must be capable of running the massive computational problems efficiently. Scientists and researchers are continuously investing in tuning the performance of extreme-scale computational problems. These problems arise in almost all areas of computing, ranging from big data analytics, artificial intelligence, search, machine learning, virtual/augmented reality, computer vision, image/signal processing to computational science and bioinformatics. With Moore's law driving the evolution of hardware platforms towards exascale, the dominant performance metric (time efficiency) has now expanded to also incorporate power/energy efficiency. Therefore, the major challenge that we face in computing systems research is: "how to solve massive-scale computational problems in the most time/power/energy efficient manner?" The architectures are constantly evolving making the current performance optimizing strategies less applicable and new strategies to be invented. The solution is for the new architectures, new programming models, and applications to go forward together. Doing this is, however, extremely hard. There are too many design choices in too many dimensions. We propose the following strategy to solve the problem: (i) Models - Develop accurate analytical models (e.g. execution time, energy, silicon area) to predict the cost of executing a given program, and (ii) Complete System Design - Simultaneously optimize all the cost models for the programs (computational problems) to obtain the most time/area/power/energy efficient solution. Such an optimization problem evokes the notion of codesign

arXiv.org e-Print Archive

Big Data Analytics for Dynamic Energy Management in Smart Grids

Author: Diamantoulakis Panagiotis D.
Kapinas Vasileios M.
Karagiannidis George K.
Publication venue: 'Elsevier BV'
Publication date: 30/07/2015
Field of study

The smart electricity grid enables a two-way flow of power and data between suppliers and consumers in order to facilitate the power flow optimization in terms of economic efficiency, reliability and sustainability. This infrastructure permits the consumers and the micro-energy producers to take a more active role in the electricity market and the dynamic energy management (DEM). The most important challenge in a smart grid (SG) is how to take advantage of the users' participation in order to reduce the cost of power. However, effective DEM depends critically on load and renewable production forecasting. This calls for intelligent methods and solutions for the real-time exploitation of the large volumes of data generated by a vast amount of smart meters. Hence, robust data analytics, high performance computing, efficient data network management, and cloud computing techniques are critical towards the optimized operation of SGs. This research aims to highlight the big data issues and challenges faced by the DEM employed in SG networks. It also provides a brief description of the most commonly used data processing methods in the literature, and proposes a promising direction for future research in the field.Comment: Published in ELSEVIER Big Data Researc

arXiv.org e-Print Archive

Cloud Computing - Architecture and Applications

Author: Alasaad Amr
Alfakhri Abdullah
Alshebeili Saleh
Andreev Sergey
Ashraf Muhammad Ahmad
Florea Roman
Koucheryavy Yevgeni
Long Yun
Ometov Aleksandr
Sen Jaydip
Sethi Waleed Tariq
Surak Adam
Wang Jian
Wang Xiaoying
Yang Mengqin
Zhang Guojing
Zhao Shanrong
Publication venue: 'IntechOpen'
Publication date: 29/07/2017
Field of study

In the era of Internet of Things and with the explosive worldwide growth of electronic data volume, and associated need of processing, analysis, and storage of such humongous volume of data, it has now become mandatory to exploit the power of massively parallel architecture for fast computation. Cloud computing provides a cheap source of such computing framework for large volume of data for real-time applications. It is, therefore, not surprising to see that cloud computing has become a buzzword in the computing fraternity over the last decade. This book presents some critical applications in cloud frameworks along with some innovation design of algorithms and architecture for deployment in cloud environment. It is a valuable source of knowledge for researchers, engineers, practitioners, and graduate and doctoral students working in the field of cloud computing. It will also be useful for faculty members of graduate schools and universities.Comment: Edited Volume published by Intech Publishers, Croatia, June 2017. 138 pages. ISBN 978-953-51-3244-8, Print ISBN 978-953-51-3243-1. Link: https://www.intechopen.com/books/cloud-computing-architecture-and-application

arXiv.org e-Print Archive

A Survey of Parallel Sequential Pattern Mining

Author: Chao Han-Chieh
Fournier-Viger Philippe
Gan Wensheng
Lin Jerry Chun-Wei
Yu Philip S.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 04/04/2019
Field of study

With the growing popularity of shared resources, large volumes of complex data of different types are collected automatically. Traditional data mining algorithms generally have problems and challenges including huge memory cost, low processing speed, and inadequate hard disk space. As a fundamental task of data mining, sequential pattern mining (SPM) is used in a wide variety of real-life applications. However, it is more complex and challenging than other pattern mining tasks, i.e., frequent itemset mining and association rule mining, and also suffers from the above challenges when handling the large-scale data. To solve these problems, mining sequential patterns in a parallel or distributed computing environment has emerged as an important issue with many applications. In this paper, an in-depth survey of the current status of parallel sequential pattern mining (PSPM) is investigated and provided, including detailed categorization of traditional serial SPM approaches, and state of the art parallel SPM. We review the related work of parallel sequential pattern mining in detail, including partition-based algorithms for PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for PSPM, and provide deep description (i.e., characteristics, advantages, disadvantages and summarization) of these parallel approaches of PSPM. Some advanced topics for PSPM, including parallel quantitative / weighted / utility sequential pattern mining, PSPM from uncertain data and stream data, hardware acceleration for PSPM, are further reviewed in details. Besides, we review and provide some well-known open-source software of PSPM. Finally, we summarize some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page

arXiv.org e-Print Archive

A Taxonomy and Survey on eScience as a Service in the Cloud

Author: He Bingsheng
Ibrahim Shadi
Zhou Amelie Chi
Publication venue
Publication date: 28/07/2014
Field of study

Cloud computing has recently evolved as a popular computing infrastructure for many applications. Scientific computing, which was mainly hosted in private clusters and grids, has started to migrate development and deployment to the public cloud environment. eScience as a service becomes an emerging and promising direction for science computing. We review recent efforts in developing and deploying scientific computing applications in the cloud. In particular, we introduce a taxonomy specifically designed for scientific computing in the cloud, and further review the taxonomy with four major kinds of science applications, including life sciences, physics sciences, social and humanities sciences, and climate and earth sciences. Our major finding is that, despite existing efforts in developing cloud-based eScience, eScience still has a long way to go to fully unlock the power of cloud computing paradigm. Therefore, we present the challenges and opportunities in the future development of cloud-based eScience services, and call for collaborations and innovations from both the scientific and computer system communities to address those challenges

arXiv.org e-Print Archive

Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments

Author: Buyya Rajkumar
Islam Muhammed Tawfiqul
Publication venue
Publication date: 30/12/2018
Field of study

This chapter presents software architectures of the big data processing platforms. It will provide an in-depth knowledge on resource management techniques involved while deploying big data processing systems on cloud environment. It starts from the very basics and gradually introduce the core components of resource management which we have divided in multiple layers. It covers the state-of-art practices and researches done in SLA-based resource management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure

arXiv.org e-Print Archive

DAME: A Distributed Data Mining & Exploration Framework within the Virtual Observatory

Author: Brescia M.
Cavuoti S.
D'Abrusco R.
Laurino O.
Longo G.
Publication venue
Publication date: 04/12/2011
Field of study

Nowadays, many scientific areas share the same broad requirements of being able to deal with massive and distributed datasets while, when possible, being integrated with services and applications. In order to solve the growing gap between the incremental generation of data and our understanding of it, it is required to know how to access, retrieve, analyze, mine and integrate data from disparate sources. One of the fundamental aspects of any new generation of data mining software tool or package which really wants to become a service for the community is the possibility to use it within complex workflows which each user can fine tune in order to match the specific demands of his scientific goal. These workflows need often to access different resources (data, providers, computing facilities and packages) and require a strict interoperability on (at least) the client side. The project DAME (DAta Mining & Exploration) arises from these requirements by providing a distributed WEB-based data mining infrastructure specialized on Massive Data Sets exploration with Soft Computing methods. Originally designed to deal with astrophysical use cases, where first scientific application examples have demonstrated its effectiveness, the DAME Suite results as a multi-disciplinary platform-independent tool perfectly compliant with modern KDD (Knowledge Discovery in Databases) requirements and Information & Communication Technology trends.Comment: 20 pages, INGRID 2010 - 5th International Workshop on Distributed Cooperative Laboratories: "Instrumenting" the Grid, May 12-14, 2010, Poznan, Poland; Volume Remote Instrumentation for eScience and Related Aspects, 2011, F. Davoli et al. (eds.), SPRINGER N

arXiv.org e-Print Archive

Coinami: A Cryptocurrency with DNA Sequence Alignment as Proof-of-work

Author: Alkan Can
Gundogdu Alper
Ileri Atalay M.
Ozercan Halil I.
Ozkaya M. Yusuf
Senol Ahmet K.
Publication venue
Publication date: 19/02/2016
Field of study

Rate of growth of the amount of data generated using the high throughput sequencing (HTS) platforms now exceeds the growth stipulated by Moore's Law. The HTS data is expected to surpass those of other "big data" domains such as astronomy, before the year 2025. In addition to sequencing genomes for research purposes, genome and exome sequencing in clinical settings will be a routine part of health care. The analysis of such large amounts of data, however, is not without computational challenges. This burden is even more increased due to the periodic updates to reference genomes, which typically require re-analysis of existing data. Here we propose Coin-Application Mediator Interface (Coinami) to distribute the workload for mapping reads to reference genomes using a volunteer grid computer approach similar to Berkeley Open Infrastructure for Network Computing (BOINC). However, since HTS read mapping requires substantial computational resources and fast analysis turnout is desired, Coinami uses the HTS read mapping as proof-of-work to generate valid blocks to main its own cryptocurrency system, which may help motivate volunteers to dedicate more resources. The Coinami protocol includes mechanisms to ensure that jobs performed by volunteers are correct, and provides genomic data privacy. The prototype implementation of Coinami is available at http://coinami.github.io/

arXiv.org e-Print Archive