1,217,429 research outputs found
NebulOS: A Big Data Framework for Astrophysics
We introduce NebulOS, a Big Data platform that allows a cluster of Linux
machines to be treated as a single computer. With NebulOS, the process of
writing a massively parallel program for a datacenter is no more complicated
than writing a Python script for a desktop computer. The platform enables most
pre-existing data analysis software to be used, as scale, in a datacenter
without modification. The shallow learning curve and compatibility with
existing software greatly reduces the time required to develop distributed data
analysis pipelines. The platform is built upon industry-standard, open-source
Big Data technologies, from which it inherits several fault tolerance features.
NebulOS enhances these technologies by adding an intuitive user interface,
automated task monitoring, and other usability features. We present a summary
of the architecture, provide usage examples, and discuss the system's
performance scaling.Comment: 15 pages, 13 figure
A Big data analytical framework for portfolio optimization
With the advent of Web 2.0, various types of data are being produced every
day. This has led to the revolution of big data. Huge amount of structured and
unstructured data are produced in financial markets. Processing these data
could help an investor to make an informed investment decision. In this paper,
a framework has been developed to incorporate both structured and unstructured
data for portfolio optimization. Portfolio optimization consists of three
processes: Asset selection, Asset weighting and Asset management. This
framework proposes to achieve the first two processes using a 5-stage
methodology. The stages include shortlisting stocks using Data Envelopment
Analysis (DEA), incorporation of the qualitative factors using text mining,
stock clustering, stock ranking and optimizing the portfolio using heuristics.
This framework would help the investors to select appropriate assets to make
portfolio, invest in them to minimize the risk and maximize the return and
monitor their performance.Comment: Workshop on Internet and BigData Finance (WIBF 14
Decentralized Online Big Data Classification - a Bandit Framework
Distributed, online data mining systems have emerged as a result of
applications requiring analysis of large amounts of correlated and
high-dimensional data produced by multiple distributed data sources. We propose
a distributed online data classification framework where data is gathered by
distributed data sources and processed by a heterogeneous set of distributed
learners which learn online, at run-time, how to classify the different data
streams either by using their locally available classification functions or by
helping each other by classifying each other's data. Importantly, since the
data is gathered at different locations, sending the data to another learner to
process incurs additional costs such as delays, and hence this will be only
beneficial if the benefits obtained from a better classification will exceed
the costs. We assume that the classification functions available to each
processing element are fixed, but their prediction accuracy for various types
of incoming data are unknown and can change dynamically over time, and thus
they need to be learned online. We model the problem of joint classification by
the distributed and heterogeneous learners from multiple data sources as a
distributed contextual bandit problem where each data is characterized by a
specific context. We develop distributed online learning algorithms for which
we can prove that they have sublinear regret. Compared to prior work in
distributed online data mining, our work is the first to provide analytic
regret results characterizing the performance of the proposed algorithms.Comment: arXiv admin note: substantial text overlap with arXiv:1307.078
Development of a Big Data Framework for Connectomic Research
This paper outlines research and development of a new Hadoop-based
architecture for distributed processing and analysis of electron microscopy of
brains. We show development of a new C++ library for implementation of 3D image
analysis techniques, and deployment in a distributed map/reduce framework. We
demonstrate our new framework on a subset of the Kasthuri11 dataset from the
Open Connectome Project.Comment: 6 pages, 9 figure
A Hierarchical Distributed Processing Framework for Big Image Data
This paper introduces an effective processing framework nominated ICP (Image
Cloud Processing) to powerfully cope with the data explosion in image
processing field. While most previous researches focus on optimizing the image
processing algorithms to gain higher efficiency, our work dedicates to
providing a general framework for those image processing algorithms, which can
be implemented in parallel so as to achieve a boost in time efficiency without
compromising the results performance along with the increasing image scale. The
proposed ICP framework consists of two mechanisms, i.e. SICP (Static ICP) and
DICP (Dynamic ICP). Specifically, SICP is aimed at processing the big image
data pre-stored in the distributed system, while DICP is proposed for dynamic
input. To accomplish SICP, two novel data representations named P-Image and
Big-Image are designed to cooperate with MapReduce to achieve more optimized
configuration and higher efficiency. DICP is implemented through a parallel
processing procedure working with the traditional processing mechanism of the
distributed system. Representative results of comprehensive experiments on the
challenging ImageNet dataset are selected to validate the capacity of our
proposed ICP framework over the traditional state-of-the-art methods, both in
time efficiency and quality of results
Big Data Quality: A systematic literature review and future research directions
One of the most significant problems of Big Data is to extract knowledge
through the huge amount of data. The usefulness of the extracted information
depends strongly on data quality. In addition to the importance, data quality
has recently been taken into consideration by the big data community and there
is not any comprehensive review conducted in this area. Therefore, the purpose
of this study is to review and present the state of the art on the quality of
big data research through a hierarchical framework. The dimensions of the
proposed framework cover various aspects in the quality assessment of Big Data
including 1) the processing types of big data, i.e. stream, batch, and hybrid,
2) the main task, and 3) the method used to conduct the task. We compare and
critically review all of the studies reported during the last ten years through
our proposed framework to identify which of the available data quality
assessment methods have been successfully adopted by the big data community.
Finally, we provide a critical discussion on the limitations of existing
methods and offer suggestions on potential valuable research directions that
can be taken in future research in this domain
Sleep Stage Classification: Scalability Evaluations of Distributed Approaches
Processing and analyzing of massive clinical data are resource intensive and
time consuming with traditional analytic tools. Electroencephalogram (EEG) is
one of the major technologies in detecting and diagnosing various brain
disorders, and produces huge volume big data to process. In this study, we
propose a big data framework to diagnose sleep disorders by classifying the
sleep stages from EEG signals. The framework is developed with open source
SparkMlib Libraries. We also tested and evaluated the proposed framework by
measuring the scalabilities of well-known classification algorithms on
physionet sleep records.Comment: Proceedings of The Third International Conference on Data Mining,
Internet Computing, and Big Data, Konya, Turkey 201
United Statistical Algorithm, Small and Big Data: Future OF Statistician
This article provides the role of big idea statisticians in future of Big
Data Science. We describe the `United Statistical Algorithms' framework for
comprehensive unification of traditional and novel statistical methods for
modeling Small Data and Big Data, especially mixed data (discrete, continuous)
BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework
Big Data is considered proprietary asset of companies, organizations, and
even nations. Turning big data into real treasure requires the support of big
data systems. A variety of commercial and open source products have been
unleashed for big data storage and processing. While big data users are facing
the choice of which system best suits their needs, big data system developers
are facing the question of how to evaluate their systems with regard to general
big data processing needs. System benchmarking is the classic way of meeting
the above demands. However, existent big data benchmarks either fail to
represent the variety of big data processing requirements, or target only one
specific platform, e.g. Hadoop.
In this paper, with our industrial partners, we present BigOP, an end-to-end
system benchmarking framework, featuring the abstraction of representative
Operation sets, workload Patterns, and prescribed tests. BigOP is part of an
open-source big data benchmarking project, BigDataBench. BigOP's abstraction
model not only guides the development of BigDataBench, but also enables
automatic generation of tests with comprehensive workloads.
We illustrate the feasibility of BigOP by implementing an automatic test
generation tool and benchmarking against three widely used big data processing
systems, i.e. Hadoop, Spark and MySQL Cluster. Three tests targeting three
different application scenarios are prescribed. The tests involve relational
data, text data and graph data, as well as all operations and workload
patterns. We report results following test specifications.Comment: 10 pages, 3 figure
A Data Colocation Grid Framework for Big Data Medical Image Processing - Backend Design
When processing large medical imaging studies, adopting high performance grid
computing resources rapidly becomes important. We recently presented a "medical
image processing-as-a-service" grid framework that offers promise in utilizing
the Apache Hadoop ecosystem and HBase for data colocation by moving computation
close to medical image storage. However, the framework has not yet proven to be
easy to use in a heterogeneous hardware environment. Furthermore, the system
has not yet validated when considering variety of multi-level analysis in
medical imaging. Our target criteria are (1) improving the framework's
performance in a heterogeneous cluster, (2) performing population based summary
statistics on large datasets, and (3) introducing a table design scheme for
rapid NoSQL query. In this paper, we present a backend interface application
program interface design for Hadoop & HBase for Medical Image Processing. The
API includes: Upload, Retrieve, Remove, Load balancer and MapReduce templates.
A dataset summary statistic model is discussed and implemented by MapReduce
paradigm. We introduce a HBase table scheme for fast data query to better
utilize the MapReduce model. Briefly, 5153 T1 images were retrieved from a
university secure database and used to empirically access an in-house grid with
224 heterogeneous CPU cores. Three empirical experiments results are presented
and discussed: (1) load balancer wall-time improvement of 1.5-fold compared
with a framework with built-in data allocation strategy, (2) a summary
statistic model is empirically verified on grid framework and is compared with
the cluster when deployed with a standard Sun Grid Engine, which reduces 8-fold
of wall clock time and 14-fold of resource time, and (3) the proposed HBase
table scheme improves MapReduce computation with 7 fold reduction of wall time
compare with a na\"ive scheme when datasets are relative small.Comment: Accepted and awaiting publication at SPIE Medical Imaging,
International Society for Optics and Photonics, 201
- …