8,227 research outputs found
Workflow-Based Big Data Analytics in The Cloud Environment Present Research Status and Future Prospects
Workflow is a common term used to describe a systematic breakdown of tasks
that need to be performed to solve a problem. This concept has found best use
in scientific and business applications for streamlining and improving the
performance of the underlying processes targeted towards achieving an outcome.
The growing complexity of big data analytical problems has invited the use of
scientific workflows for performing complex tasks for specific domain
applications. This research investigates the efficacy of workflow-based big
data analytics in the cloud environment, giving insights on the research
already performed in the area and possible future research directions in the
field
Big Data Analytics in Bioinformatics: A Machine Learning Perspective
Bioinformatics research is characterized by voluminous and incremental
datasets and complex data analytics methods. The machine learning methods used
in bioinformatics are iterative and parallel. These methods can be scaled to
handle big data using the distributed and parallel computing technologies.
Usually big data tools perform computation in batch-mode and are not
optimized for iterative processing and high data dependency among operations.
In the recent years, parallel, incremental, and multi-view machine learning
algorithms have been proposed. Similarly, graph-based architectures and
in-memory big data tools have been developed to minimize I/O cost and optimize
iterative processing.
However, there lack standard big data architectures and tools for many
important bioinformatics problems, such as fast construction of co-expression
and regulatory networks and salient module identification, detection of
complexes over growing protein-protein interaction data, fast analysis of
massive DNA, RNA, and protein sequence data, and fast querying on incremental
and heterogeneous disease networks. This paper addresses the issues and
challenges posed by several big data problems in bioinformatics, and gives an
overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic
Analytical Cost Metrics : Days of Future Past
As we move towards the exascale era, the new architectures must be capable of
running the massive computational problems efficiently. Scientists and
researchers are continuously investing in tuning the performance of
extreme-scale computational problems. These problems arise in almost all areas
of computing, ranging from big data analytics, artificial intelligence, search,
machine learning, virtual/augmented reality, computer vision, image/signal
processing to computational science and bioinformatics. With Moore's law
driving the evolution of hardware platforms towards exascale, the dominant
performance metric (time efficiency) has now expanded to also incorporate
power/energy efficiency. Therefore, the major challenge that we face in
computing systems research is: "how to solve massive-scale computational
problems in the most time/power/energy efficient manner?"
The architectures are constantly evolving making the current performance
optimizing strategies less applicable and new strategies to be invented. The
solution is for the new architectures, new programming models, and applications
to go forward together. Doing this is, however, extremely hard. There are too
many design choices in too many dimensions. We propose the following strategy
to solve the problem: (i) Models - Develop accurate analytical models (e.g.
execution time, energy, silicon area) to predict the cost of executing a given
program, and (ii) Complete System Design - Simultaneously optimize all the cost
models for the programs (computational problems) to obtain the most
time/area/power/energy efficient solution. Such an optimization problem evokes
the notion of codesign
Big Data Analytics for Dynamic Energy Management in Smart Grids
The smart electricity grid enables a two-way flow of power and data between
suppliers and consumers in order to facilitate the power flow optimization in
terms of economic efficiency, reliability and sustainability. This
infrastructure permits the consumers and the micro-energy producers to take a
more active role in the electricity market and the dynamic energy management
(DEM). The most important challenge in a smart grid (SG) is how to take
advantage of the users' participation in order to reduce the cost of power.
However, effective DEM depends critically on load and renewable production
forecasting. This calls for intelligent methods and solutions for the real-time
exploitation of the large volumes of data generated by a vast amount of smart
meters. Hence, robust data analytics, high performance computing, efficient
data network management, and cloud computing techniques are critical towards
the optimized operation of SGs. This research aims to highlight the big data
issues and challenges faced by the DEM employed in SG networks. It also
provides a brief description of the most commonly used data processing methods
in the literature, and proposes a promising direction for future research in
the field.Comment: Published in ELSEVIER Big Data Researc
Cloud Computing - Architecture and Applications
In the era of Internet of Things and with the explosive worldwide growth of
electronic data volume, and associated need of processing, analysis, and
storage of such humongous volume of data, it has now become mandatory to
exploit the power of massively parallel architecture for fast computation.
Cloud computing provides a cheap source of such computing framework for large
volume of data for real-time applications. It is, therefore, not surprising to
see that cloud computing has become a buzzword in the computing fraternity over
the last decade. This book presents some critical applications in cloud
frameworks along with some innovation design of algorithms and architecture for
deployment in cloud environment. It is a valuable source of knowledge for
researchers, engineers, practitioners, and graduate and doctoral students
working in the field of cloud computing. It will also be useful for faculty
members of graduate schools and universities.Comment: Edited Volume published by Intech Publishers, Croatia, June 2017. 138
pages. ISBN 978-953-51-3244-8, Print ISBN 978-953-51-3243-1. Link:
https://www.intechopen.com/books/cloud-computing-architecture-and-application
A Survey of Parallel Sequential Pattern Mining
With the growing popularity of shared resources, large volumes of complex
data of different types are collected automatically. Traditional data mining
algorithms generally have problems and challenges including huge memory cost,
low processing speed, and inadequate hard disk space. As a fundamental task of
data mining, sequential pattern mining (SPM) is used in a wide variety of
real-life applications. However, it is more complex and challenging than other
pattern mining tasks, i.e., frequent itemset mining and association rule
mining, and also suffers from the above challenges when handling the
large-scale data. To solve these problems, mining sequential patterns in a
parallel or distributed computing environment has emerged as an important issue
with many applications. In this paper, an in-depth survey of the current status
of parallel sequential pattern mining (PSPM) is investigated and provided,
including detailed categorization of traditional serial SPM approaches, and
state of the art parallel SPM. We review the related work of parallel
sequential pattern mining in detail, including partition-based algorithms for
PSPM, Apriori-based PSPM, pattern growth based PSPM, and hybrid algorithms for
PSPM, and provide deep description (i.e., characteristics, advantages,
disadvantages and summarization) of these parallel approaches of PSPM. Some
advanced topics for PSPM, including parallel quantitative / weighted / utility
sequential pattern mining, PSPM from uncertain data and stream data, hardware
acceleration for PSPM, are further reviewed in details. Besides, we review and
provide some well-known open-source software of PSPM. Finally, we summarize
some challenges and opportunities of PSPM in the big data era.Comment: Accepted by ACM Trans. on Knowl. Discov. Data, 33 page
A Taxonomy and Survey on eScience as a Service in the Cloud
Cloud computing has recently evolved as a popular computing infrastructure
for many applications. Scientific computing, which was mainly hosted in private
clusters and grids, has started to migrate development and deployment to the
public cloud environment. eScience as a service becomes an emerging and
promising direction for science computing. We review recent efforts in
developing and deploying scientific computing applications in the cloud. In
particular, we introduce a taxonomy specifically designed for scientific
computing in the cloud, and further review the taxonomy with four major kinds
of science applications, including life sciences, physics sciences, social and
humanities sciences, and climate and earth sciences. Our major finding is that,
despite existing efforts in developing cloud-based eScience, eScience still has
a long way to go to fully unlock the power of cloud computing paradigm.
Therefore, we present the challenges and opportunities in the future
development of cloud-based eScience services, and call for collaborations and
innovations from both the scientific and computer system communities to address
those challenges
Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments
This chapter presents software architectures of the big data processing
platforms. It will provide an in-depth knowledge on resource management
techniques involved while deploying big data processing systems on cloud
environment. It starts from the very basics and gradually introduce the core
components of resource management which we have divided in multiple layers. It
covers the state-of-art practices and researches done in SLA-based resource
management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure
DAME: A Distributed Data Mining & Exploration Framework within the Virtual Observatory
Nowadays, many scientific areas share the same broad requirements of being
able to deal with massive and distributed datasets while, when possible, being
integrated with services and applications. In order to solve the growing gap
between the incremental generation of data and our understanding of it, it is
required to know how to access, retrieve, analyze, mine and integrate data from
disparate sources. One of the fundamental aspects of any new generation of data
mining software tool or package which really wants to become a service for the
community is the possibility to use it within complex workflows which each user
can fine tune in order to match the specific demands of his scientific goal.
These workflows need often to access different resources (data, providers,
computing facilities and packages) and require a strict interoperability on (at
least) the client side. The project DAME (DAta Mining & Exploration) arises
from these requirements by providing a distributed WEB-based data mining
infrastructure specialized on Massive Data Sets exploration with Soft Computing
methods. Originally designed to deal with astrophysical use cases, where first
scientific application examples have demonstrated its effectiveness, the DAME
Suite results as a multi-disciplinary platform-independent tool perfectly
compliant with modern KDD (Knowledge Discovery in Databases) requirements and
Information & Communication Technology trends.Comment: 20 pages, INGRID 2010 - 5th International Workshop on Distributed
Cooperative Laboratories: "Instrumenting" the Grid, May 12-14, 2010, Poznan,
Poland; Volume Remote Instrumentation for eScience and Related Aspects, 2011,
F. Davoli et al. (eds.), SPRINGER N
Coinami: A Cryptocurrency with DNA Sequence Alignment as Proof-of-work
Rate of growth of the amount of data generated using the high throughput
sequencing (HTS) platforms now exceeds the growth stipulated by Moore's Law.
The HTS data is expected to surpass those of other "big data" domains such as
astronomy, before the year 2025. In addition to sequencing genomes for research
purposes, genome and exome sequencing in clinical settings will be a routine
part of health care. The analysis of such large amounts of data, however, is
not without computational challenges. This burden is even more increased due to
the periodic updates to reference genomes, which typically require re-analysis
of existing data. Here we propose Coin-Application Mediator Interface (Coinami)
to distribute the workload for mapping reads to reference genomes using a
volunteer grid computer approach similar to Berkeley Open Infrastructure for
Network Computing (BOINC). However, since HTS read mapping requires substantial
computational resources and fast analysis turnout is desired, Coinami uses the
HTS read mapping as proof-of-work to generate valid blocks to main its own
cryptocurrency system, which may help motivate volunteers to dedicate more
resources. The Coinami protocol includes mechanisms to ensure that jobs
performed by volunteers are correct, and provides genomic data privacy. The
prototype implementation of Coinami is available at http://coinami.github.io/
- …