Search CORE

1,217,429 research outputs found

NebulOS: A Big Data Framework for Astrophysics

Author: Aragon-Calvo Miguel A.
Stickley Nathaniel R.
Publication venue
Publication date: 14/09/2016
Field of study

We introduce NebulOS, a Big Data platform that allows a cluster of Linux machines to be treated as a single computer. With NebulOS, the process of writing a massively parallel program for a datacenter is no more complicated than writing a Python script for a desktop computer. The platform enables most pre-existing data analysis software to be used, as scale, in a datacenter without modification. The shallow learning curve and compatibility with existing software greatly reduces the time required to develop distributed data analysis pipelines. The platform is built upon industry-standard, open-source Big Data technologies, from which it inherits several fault tolerance features. NebulOS enhances these technologies by adding an intuitive user interface, automated task monitoring, and other usability features. We present a summary of the architecture, provide usage examples, and discuss the system's performance scaling.Comment: 15 pages, 13 figure

arXiv.org e-Print Archive

A Big data analytical framework for portfolio optimization

Author: Jothimani Dhanya
Shankar Ravi
Yadav Surendra S.
Publication venue
Publication date: 24/11/2018
Field of study

With the advent of Web 2.0, various types of data are being produced every day. This has led to the revolution of big data. Huge amount of structured and unstructured data are produced in financial markets. Processing these data could help an investor to make an informed investment decision. In this paper, a framework has been developed to incorporate both structured and unstructured data for portfolio optimization. Portfolio optimization consists of three processes: Asset selection, Asset weighting and Asset management. This framework proposes to achieve the first two processes using a 5-stage methodology. The stages include shortlisting stocks using Data Envelopment Analysis (DEA), incorporation of the qualitative factors using text mining, stock clustering, stock ranking and optimizing the portfolio using heuristics. This framework would help the investors to select appropriate assets to make portfolio, invest in them to minimize the risk and maximize the return and monitor their performance.Comment: Workshop on Internet and BigData Finance (WIBF 14

arXiv.org e-Print Archive

Decentralized Online Big Data Classification - a Bandit Framework

Author: Tekin Cem
van der Schaar Mihaela
Publication venue
Publication date: 25/08/2013
Field of study

Distributed, online data mining systems have emerged as a result of applications requiring analysis of large amounts of correlated and high-dimensional data produced by multiple distributed data sources. We propose a distributed online data classification framework where data is gathered by distributed data sources and processed by a heterogeneous set of distributed learners which learn online, at run-time, how to classify the different data streams either by using their locally available classification functions or by helping each other by classifying each other's data. Importantly, since the data is gathered at different locations, sending the data to another learner to process incurs additional costs such as delays, and hence this will be only beneficial if the benefits obtained from a better classification will exceed the costs. We assume that the classification functions available to each processing element are fixed, but their prediction accuracy for various types of incoming data are unknown and can change dynamically over time, and thus they need to be learned online. We model the problem of joint classification by the distributed and heterogeneous learners from multiple data sources as a distributed contextual bandit problem where each data is characterized by a specific context. We develop distributed online learning algorithms for which we can prove that they have sublinear regret. Compared to prior work in distributed online data mining, our work is the first to provide analytic regret results characterizing the performance of the proposed algorithms.Comment: arXiv admin note: substantial text overlap with arXiv:1307.078

arXiv.org e-Print Archive

Development of a Big Data Framework for Connectomic Research

Author: Adams Terrence
Publication venue
Publication date: 24/01/2015
Field of study

This paper outlines research and development of a new Hadoop-based architecture for distributed processing and analysis of electron microscopy of brains. We show development of a new C++ library for implementation of 3D image analysis techniques, and deployment in a distributed map/reduce framework. We demonstrate our new framework on a subset of the Kasthuri11 dataset from the Open Connectome Project.Comment: 6 pages, 9 figure

arXiv.org e-Print Archive

A Hierarchical Distributed Processing Framework for Big Image Data

Author: Cao Xiaochun
Chen Qi
Dong Le
He Ling
Liang Yan
Lin Zhiyu
lzquierdo Ebroul
Zhang Ning
Publication venue
Publication date: 02/07/2016
Field of study

This paper introduces an effective processing framework nominated ICP (Image Cloud Processing) to powerfully cope with the data explosion in image processing field. While most previous researches focus on optimizing the image processing algorithms to gain higher efficiency, our work dedicates to providing a general framework for those image processing algorithms, which can be implemented in parallel so as to achieve a boost in time efficiency without compromising the results performance along with the increasing image scale. The proposed ICP framework consists of two mechanisms, i.e. SICP (Static ICP) and DICP (Dynamic ICP). Specifically, SICP is aimed at processing the big image data pre-stored in the distributed system, while DICP is proposed for dynamic input. To accomplish SICP, two novel data representations named P-Image and Big-Image are designed to cooperate with MapReduce to achieve more optimized configuration and higher efficiency. DICP is implemented through a parallel processing procedure working with the traditional processing mechanism of the distributed system. Representative results of comprehensive experiments on the challenging ImageNet dataset are selected to validate the capacity of our proposed ICP framework over the traditional state-of-the-art methods, both in time efficiency and quality of results

arXiv.org e-Print Archive

Big Data Quality: A systematic literature review and future research directions

Author: Behkamal Behshid
Mirzaie Mostafa
Paydar Samad
Publication venue
Publication date: 21/05/2020
Field of study

One of the most significant problems of Big Data is to extract knowledge through the huge amount of data. The usefulness of the extracted information depends strongly on data quality. In addition to the importance, data quality has recently been taken into consideration by the big data community and there is not any comprehensive review conducted in this area. Therefore, the purpose of this study is to review and present the state of the art on the quality of big data research through a hierarchical framework. The dimensions of the proposed framework cover various aspects in the quality assessment of Big Data including 1) the processing types of big data, i.e. stream, batch, and hybrid, 2) the main task, and 3) the method used to conduct the task. We compare and critically review all of the studies reported during the last ten years through our proposed framework to identify which of the available data quality assessment methods have been successfully adopted by the big data community. Finally, we provide a critical discussion on the limitations of existing methods and offer suggestions on potential valuable research directions that can be taken in future research in this domain

arXiv.org e-Print Archive

Sleep Stage Classification: Scalability Evaluations of Distributed Approaches

Author: Acikalin Serife
Eken Suleyman
Sayar Ahmet
Publication venue
Publication date: 01/09/2018
Field of study

Processing and analyzing of massive clinical data are resource intensive and time consuming with traditional analytic tools. Electroencephalogram (EEG) is one of the major technologies in detecting and diagnosing various brain disorders, and produces huge volume big data to process. In this study, we propose a big data framework to diagnose sleep disorders by classifying the sleep stages from EEG signals. The framework is developed with open source SparkMlib Libraries. We also tested and evaluated the proposed framework by measuring the scalabilities of well-known classification algorithms on physionet sleep records.Comment: Proceedings of The Third International Conference on Data Mining, Internet Computing, and Big Data, Konya, Turkey 201

arXiv.org e-Print Archive

United Statistical Algorithm, Small and Big Data: Future OF Statistician

Author: Mukhopadhyay Subhadeep
Parzen Emanuel
Publication venue
Publication date: 02/08/2013
Field of study

This article provides the role of big idea statisticians in future of Big Data Science. We describe the `United Statistical Algorithms' framework for comprehensive unification of traditional and novel statistical methods for modeling Small Data and Big Data, especially mixed data (discrete, continuous)

arXiv.org e-Print Archive

BigOP: Generating Comprehensive Big Data Workloads as a Benchmarking Framework

Author: Chen Xingzhen
Nambiar Raghunath
Wang Lei
Weng Chuliang
Zhan Jianfeng
Zhang Jinchao
Zhu Yuqing
Publication venue
Publication date: 26/12/2017
Field of study

Big Data is considered proprietary asset of companies, organizations, and even nations. Turning big data into real treasure requires the support of big data systems. A variety of commercial and open source products have been unleashed for big data storage and processing. While big data users are facing the choice of which system best suits their needs, big data system developers are facing the question of how to evaluate their systems with regard to general big data processing needs. System benchmarking is the classic way of meeting the above demands. However, existent big data benchmarks either fail to represent the variety of big data processing requirements, or target only one specific platform, e.g. Hadoop. In this paper, with our industrial partners, we present BigOP, an end-to-end system benchmarking framework, featuring the abstraction of representative Operation sets, workload Patterns, and prescribed tests. BigOP is part of an open-source big data benchmarking project, BigDataBench. BigOP's abstraction model not only guides the development of BigDataBench, but also enables automatic generation of tests with comprehensive workloads. We illustrate the feasibility of BigOP by implementing an automatic test generation tool and benchmarking against three widely used big data processing systems, i.e. Hadoop, Spark and MySQL Cluster. Three tests targeting three different application scenarios are prescribed. The tests involve relational data, text data and graph data, as well as all operations and workload patterns. We report results following test specifications.Comment: 10 pages, 3 figure

arXiv.org e-Print Archive

A Data Colocation Grid Framework for Big Data Medical Image Processing - Backend Design

Author: Bao Shunxing
Bermudez Camilo
Gokhale Aniruddha
Huo Yuankai
Landman Bennett A.
Llyu Ilwoo
Parvathaneni Prasanna
Plassard Andrew J.
Yao Yuang
Publication venue
Publication date: 22/12/2017
Field of study

When processing large medical imaging studies, adopting high performance grid computing resources rapidly becomes important. We recently presented a "medical image processing-as-a-service" grid framework that offers promise in utilizing the Apache Hadoop ecosystem and HBase for data colocation by moving computation close to medical image storage. However, the framework has not yet proven to be easy to use in a heterogeneous hardware environment. Furthermore, the system has not yet validated when considering variety of multi-level analysis in medical imaging. Our target criteria are (1) improving the framework's performance in a heterogeneous cluster, (2) performing population based summary statistics on large datasets, and (3) introducing a table design scheme for rapid NoSQL query. In this paper, we present a backend interface application program interface design for Hadoop & HBase for Medical Image Processing. The API includes: Upload, Retrieve, Remove, Load balancer and MapReduce templates. A dataset summary statistic model is discussed and implemented by MapReduce paradigm. We introduce a HBase table scheme for fast data query to better utilize the MapReduce model. Briefly, 5153 T1 images were retrieved from a university secure database and used to empirically access an in-house grid with 224 heterogeneous CPU cores. Three empirical experiments results are presented and discussed: (1) load balancer wall-time improvement of 1.5-fold compared with a framework with built-in data allocation strategy, (2) a summary statistic model is empirically verified on grid framework and is compared with the cluster when deployed with a standard Sun Grid Engine, which reduces 8-fold of wall clock time and 14-fold of resource time, and (3) the proposed HBase table scheme improves MapReduce computation with 7 fold reduction of wall time compare with a na\"ive scheme when datasets are relative small.Comment: Accepted and awaiting publication at SPIE Medical Imaging, International Society for Optics and Photonics, 201

arXiv.org e-Print Archive