484 research outputs found
Enabling Smart Data: Noise filtering in Big Data classification
In any knowledge discovery process the value of extracted knowledge is
directly related to the quality of the data used. Big Data problems, generated
by massive growth in the scale of data observed in recent years, also follow
the same dictate. A common problem affecting data quality is the presence of
noise, particularly in classification problems, where label noise refers to the
incorrect labeling of training instances, and is known to be a very disruptive
feature of data. However, in this Big Data era, the massive growth in the scale
of the data poses a challenge to traditional proposals created to tackle noise,
as they have difficulties coping with such a large amount of data. New
algorithms need to be proposed to treat the noise in Big Data problems,
providing high quality and clean data, also known as Smart Data. In this paper,
two Big Data preprocessing approaches to remove noisy examples are proposed: an
homogeneous ensemble and an heterogeneous ensemble filter, with special
emphasis in their scalability and performance traits. The obtained results show
that these proposals enable the practitioner to efficiently obtain a Smart
Dataset from any Big Data classification problem
Implementing a Cloud Platform for Autonomous Driving
Autonomous driving clouds provide essential services to support autonomous
vehicles. Today these services include but not limited to distributed
simulation tests for new algorithm deployment, offline deep learning model
training, and High-Definition (HD) map generation. These services require
infrastructure support including distributed computing, distributed storage, as
well as heterogeneous computing. In this paper, we present the details of how
we implement a unified autonomous driving cloud infrastructure, and how we
support these services on top of this infrastructure.Comment: 8 pages, 12 figure
A Distributed Learning Architecture for Scientific Imaging Problems
Current trends in scientific imaging are challenged by the emerging need of
integrating sophisticated machine learning with Big Data analytics platforms.
This work proposes an in-memory distributed learning architecture for enabling
sophisticated learning and optimization techniques on scientific imaging
problems, which are characterized by the combination of variant information
from different origins. We apply the resulting, Spark-compliant, architecture
on two emerging use cases from the scientific imaging domain, namely: (a) the
space variant deconvolution of galaxy imaging surveys (astrophysics), (b) the
super-resolution based on coupled dictionary training (remote sensing). We
conduct evaluation studies considering relevant datasets, and the results
report at least 60\% improvement in time response against the conventional
computing solutions. Ultimately, the offered discussion provides useful
practical insights on the impact of key Spark tuning parameters on the speedup
achieved, and the memory/disk footprint
A Parallel Patient Treatment Time Prediction Algorithm and its Applications in Hospital Queuing-Recommendation in a Big Data Environment
Effective patient queue management to minimize patient wait delays and
patient overcrowding is one of the major challenges faced by hospitals.
Unnecessary and annoying waits for long periods result in substantial human
resource and time wastage and increase the frustration endured by patients. For
each patient in the queue, the total treatment time of all patients before him
is the time that he must wait. It would be convenient and preferable if the
patients could receive the most efficient treatment plan and know the predicted
waiting time through a mobile application that updates in real-time. Therefore,
we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the
waiting time for each treatment task for a patient. We use realistic patient
data from various hospitals to obtain a patient treatment time model for each
task. Based on this large-scale, realistic dataset, the treatment time for each
patient in the current queue of each task is predicted. Based on the predicted
waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR
calculates and predicts an efficient and convenient treatment plan recommended
for the patient. Because of the large-scale, realistic dataset and the
requirement for real-time response, the PTTP algorithm and HQR system mandate
efficiency and low-latency response. We use an Apache Spark-based cloud
implementation at the National Supercomputing Center in Changsha (NSCC) to
achieve the aforementioned goals. Extensive experimentation and simulation
results demonstrate the effectiveness and applicability of our proposed model
to recommend an effective treatment plan for patients to minimize their wait
times in hospitals
A Disease Diagnosis and Treatment Recommendation System Based on Big Data Mining and Cloud Computing
It is crucial to provide compatible treatment schemes for a disease according
to various symptoms at different stages. However, most classification methods
might be ineffective in accurately classifying a disease that holds the
characteristics of multiple treatment stages, various symptoms, and
multi-pathogenesis. Moreover, there are limited exchanges and cooperative
actions in disease diagnoses and treatments between different departments and
hospitals. Thus, when new diseases occur with atypical symptoms, inexperienced
doctors might have difficulty in identifying them promptly and accurately.
Therefore, to maximize the utilization of the advanced medical technology of
developed hospitals and the rich medical knowledge of experienced doctors, a
Disease Diagnosis and Treatment Recommendation System (DDTRS) is proposed in
this paper. First, to effectively identify disease symptoms more accurately, a
Density-Peaked Clustering Analysis (DPCA) algorithm is introduced for
disease-symptom clustering. In addition, association analyses on
Disease-Diagnosis (D-D) rules and Disease-Treatment (D-T) rules are conducted
by the Apriori algorithm separately. The appropriate diagnosis and treatment
schemes are recommended for patients and inexperienced doctors, even if they
are in a limited therapeutic environment. Moreover, to reach the goals of high
performance and low latency response, we implement a parallel solution for
DDTRS using the Apache Spark cloud platform. Extensive experimental results
demonstrate that the proposed DDTRS realizes disease-symptom clustering
effectively and derives disease treatment recommendations intelligently and
accurately
biggy: An Implementation of Unified Framework for Big Data Management System
Various tools, softwares and systems are proposed and implemented to tackle
the challenges in big data on different emphases, e.g., data analysis, data
transaction, data query, data storage, data visualization, data privacy. In
this paper, we propose datar, a new prospective and unified framework for Big
Data Management System (BDMS) from the point of system architecture by
leveraging ideas from mainstream computer structure. We introduce five key
components of datar by reviewing the cur- rent status of BDMS. Datar features
with configuration chain of pluggable engines, automatic dataflow on job
pipelines, intelligent self-driving system management and interactive user
interfaces. Moreover, we present biggy as an implementation of datar with
manipulation details demonstrated by four running examples. Evaluations on
efficiency and scalability are carried out to show the performance. Our work
argues that the envisioned datar is a feasible solution to the unified
framework of BDMS, which can manage big data pluggablly, automatically and
intelligently with specific functionalities, where specific functionalities
refer to input, storage, computation, control and output of big data
Scaling-Up Reasoning and Advanced Analytics on BigData
BigDatalog is an extension of Datalog that achieves performance and
scalability on both Apache Spark and multicore systems to the point that its
graph analytics outperform those written in GraphX. Looking back, we see how
this realizes the ambitious goal pursued by deductive database researchers
beginning forty years ago: this is the goal of combining the rigor and power of
logic in expressing queries and reasoning with the performance and scalability
by which relational databases managed Big Data. This goal led to Datalog which
is based on Horn Clauses like Prolog but employs implementation techniques,
such as Semi-naive Fixpoint and Magic Sets, that extend the bottom-up
computation model of relational systems, and thus obtain the performance and
scalability that relational systems had achieved, as far back as the 80s, using
data-parallelization on shared-nothing architectures. But this goal proved
difficult to achieve because of major issues at (i) the language level and (ii)
at the system level. The paper describes how (i) was addressed by simple rules
under which the fixpoint semantics extends to programs using count, sum and
extrema in recursion, and (ii) was tamed by parallel compilation techniques
that achieve scalability on multicore systems and Apache Spark. This paper is
under consideration for acceptance in Theory and Practice of Logic Programming
(TPLP).Comment: Under consideration in Theory and Practice of Logic Programming
(TPLP
Multiple K Means++ Clustering of Satellite Image Using Hadoop MapReduce and Spark
Clustering of image is one of the important steps of mining satellite images.
In our experiment we have simultaneously run multiple K-means algorithms with
different initial centroids and values of k in the same iteration of MapReduce
jobs. For initialization of initial centroids we have implemented Scalable
K-Means++ MapReduce (MR) job [1]. We have also run a validation algorithm of
Simplified Silhouette Index [2] for multiple clustering outputs, again in the
same iteration of MR jobs. This paper explored the behavior of above mentioned
clustering algorithms when run on big data platforms like MapReduce and Spark
jobs. Spark has been chosen as it is popular for fast processing particularly
where iterations are involved.Comment: 9 Pages, Distributed Computing, Satellite Images, Clustering,
Published in International Journal of Advanced Studies in Computer Science
and Engineering, IJASCSE volume 5 issue 4, 201
A survey of systems for massive stream analytics
The immense growth of data demands switching from traditional data processing
solutions to systems, which can process a continuous stream of real time data.
Various applications employ stream processing systems to provide solutions to
emerging Big Data problems. Open-source solutions such as Storm, Spark
Streaming, and S4 are the attempts to answer key stream processing questions.
The recent introduction of real time stream processing commercial solutions
such as Amazon Kinesis, IBM Infosphere Stream reflect industry requirements.
The system and application related challenges to handle massive stream of real
time data analytics are an active field of research.
In this paper, we present a comparative analysis of the existing
state-of-the-art stream processing solutions. We also include various
application domains, which are transforming their business model to benefit
from these large scale stream processing systems
Portfolio-driven Resource Management for Transient Cloud Servers
Cloud providers have begun to offer their surplus capacity in the form of
low-cost transient servers, which can be revoked unilaterally at any time.
While the low cost of transient servers makes them attractive for a wide range
of applications, such as data processing and scientific computing, failures due
to server revocation can severely degrade application performance. Since
different transient server types offer different cost and availability
tradeoffs, we present the notion of server portfolios that is based on
financial portfolio modeling. Server portfolios enable construction of an
"optimal" mix of severs to meet an application's sensitivity to cost and
revocation risk. We implement model-driven portfolios in a system called
ExoSphere, and show how diverse applications can use portfolios and
application-specific policies to gracefully handle transient servers. We show
that ExoSphere enables widely-used parallel applications such as Spark, MPI,
and BOINC to be made transiency-aware with modest effort. Our experiments show
that allowing the applications to use suitable transiency-aware policies,
ExoSphere is able to achieve 80\% cost savings when compared to on-demand
servers and greatly reduces revocation risk compared to existing approaches
- …