484 research outputs found

    Enabling Smart Data: Noise filtering in Big Data classification

    Full text link
    In any knowledge discovery process the value of extracted knowledge is directly related to the quality of the data used. Big Data problems, generated by massive growth in the scale of data observed in recent years, also follow the same dictate. A common problem affecting data quality is the presence of noise, particularly in classification problems, where label noise refers to the incorrect labeling of training instances, and is known to be a very disruptive feature of data. However, in this Big Data era, the massive growth in the scale of the data poses a challenge to traditional proposals created to tackle noise, as they have difficulties coping with such a large amount of data. New algorithms need to be proposed to treat the noise in Big Data problems, providing high quality and clean data, also known as Smart Data. In this paper, two Big Data preprocessing approaches to remove noisy examples are proposed: an homogeneous ensemble and an heterogeneous ensemble filter, with special emphasis in their scalability and performance traits. The obtained results show that these proposals enable the practitioner to efficiently obtain a Smart Dataset from any Big Data classification problem

    Implementing a Cloud Platform for Autonomous Driving

    Full text link
    Autonomous driving clouds provide essential services to support autonomous vehicles. Today these services include but not limited to distributed simulation tests for new algorithm deployment, offline deep learning model training, and High-Definition (HD) map generation. These services require infrastructure support including distributed computing, distributed storage, as well as heterogeneous computing. In this paper, we present the details of how we implement a unified autonomous driving cloud infrastructure, and how we support these services on top of this infrastructure.Comment: 8 pages, 12 figure

    A Distributed Learning Architecture for Scientific Imaging Problems

    Full text link
    Current trends in scientific imaging are challenged by the emerging need of integrating sophisticated machine learning with Big Data analytics platforms. This work proposes an in-memory distributed learning architecture for enabling sophisticated learning and optimization techniques on scientific imaging problems, which are characterized by the combination of variant information from different origins. We apply the resulting, Spark-compliant, architecture on two emerging use cases from the scientific imaging domain, namely: (a) the space variant deconvolution of galaxy imaging surveys (astrophysics), (b) the super-resolution based on coupled dictionary training (remote sensing). We conduct evaluation studies considering relevant datasets, and the results report at least 60\% improvement in time response against the conventional computing solutions. Ultimately, the offered discussion provides useful practical insights on the impact of key Spark tuning parameters on the speedup achieved, and the memory/disk footprint

    A Parallel Patient Treatment Time Prediction Algorithm and its Applications in Hospital Queuing-Recommendation in a Big Data Environment

    Full text link
    Effective patient queue management to minimize patient wait delays and patient overcrowding is one of the major challenges faced by hospitals. Unnecessary and annoying waits for long periods result in substantial human resource and time wastage and increase the frustration endured by patients. For each patient in the queue, the total treatment time of all patients before him is the time that he must wait. It would be convenient and preferable if the patients could receive the most efficient treatment plan and know the predicted waiting time through a mobile application that updates in real-time. Therefore, we propose a Patient Treatment Time Prediction (PTTP) algorithm to predict the waiting time for each treatment task for a patient. We use realistic patient data from various hospitals to obtain a patient treatment time model for each task. Based on this large-scale, realistic dataset, the treatment time for each patient in the current queue of each task is predicted. Based on the predicted waiting time, a Hospital Queuing-Recommendation (HQR) system is developed. HQR calculates and predicts an efficient and convenient treatment plan recommended for the patient. Because of the large-scale, realistic dataset and the requirement for real-time response, the PTTP algorithm and HQR system mandate efficiency and low-latency response. We use an Apache Spark-based cloud implementation at the National Supercomputing Center in Changsha (NSCC) to achieve the aforementioned goals. Extensive experimentation and simulation results demonstrate the effectiveness and applicability of our proposed model to recommend an effective treatment plan for patients to minimize their wait times in hospitals

    A Disease Diagnosis and Treatment Recommendation System Based on Big Data Mining and Cloud Computing

    Full text link
    It is crucial to provide compatible treatment schemes for a disease according to various symptoms at different stages. However, most classification methods might be ineffective in accurately classifying a disease that holds the characteristics of multiple treatment stages, various symptoms, and multi-pathogenesis. Moreover, there are limited exchanges and cooperative actions in disease diagnoses and treatments between different departments and hospitals. Thus, when new diseases occur with atypical symptoms, inexperienced doctors might have difficulty in identifying them promptly and accurately. Therefore, to maximize the utilization of the advanced medical technology of developed hospitals and the rich medical knowledge of experienced doctors, a Disease Diagnosis and Treatment Recommendation System (DDTRS) is proposed in this paper. First, to effectively identify disease symptoms more accurately, a Density-Peaked Clustering Analysis (DPCA) algorithm is introduced for disease-symptom clustering. In addition, association analyses on Disease-Diagnosis (D-D) rules and Disease-Treatment (D-T) rules are conducted by the Apriori algorithm separately. The appropriate diagnosis and treatment schemes are recommended for patients and inexperienced doctors, even if they are in a limited therapeutic environment. Moreover, to reach the goals of high performance and low latency response, we implement a parallel solution for DDTRS using the Apache Spark cloud platform. Extensive experimental results demonstrate that the proposed DDTRS realizes disease-symptom clustering effectively and derives disease treatment recommendations intelligently and accurately

    biggy: An Implementation of Unified Framework for Big Data Management System

    Full text link
    Various tools, softwares and systems are proposed and implemented to tackle the challenges in big data on different emphases, e.g., data analysis, data transaction, data query, data storage, data visualization, data privacy. In this paper, we propose datar, a new prospective and unified framework for Big Data Management System (BDMS) from the point of system architecture by leveraging ideas from mainstream computer structure. We introduce five key components of datar by reviewing the cur- rent status of BDMS. Datar features with configuration chain of pluggable engines, automatic dataflow on job pipelines, intelligent self-driving system management and interactive user interfaces. Moreover, we present biggy as an implementation of datar with manipulation details demonstrated by four running examples. Evaluations on efficiency and scalability are carried out to show the performance. Our work argues that the envisioned datar is a feasible solution to the unified framework of BDMS, which can manage big data pluggablly, automatically and intelligently with specific functionalities, where specific functionalities refer to input, storage, computation, control and output of big data

    Scaling-Up Reasoning and Advanced Analytics on BigData

    Full text link
    BigDatalog is an extension of Datalog that achieves performance and scalability on both Apache Spark and multicore systems to the point that its graph analytics outperform those written in GraphX. Looking back, we see how this realizes the ambitious goal pursued by deductive database researchers beginning forty years ago: this is the goal of combining the rigor and power of logic in expressing queries and reasoning with the performance and scalability by which relational databases managed Big Data. This goal led to Datalog which is based on Horn Clauses like Prolog but employs implementation techniques, such as Semi-naive Fixpoint and Magic Sets, that extend the bottom-up computation model of relational systems, and thus obtain the performance and scalability that relational systems had achieved, as far back as the 80s, using data-parallelization on shared-nothing architectures. But this goal proved difficult to achieve because of major issues at (i) the language level and (ii) at the system level. The paper describes how (i) was addressed by simple rules under which the fixpoint semantics extends to programs using count, sum and extrema in recursion, and (ii) was tamed by parallel compilation techniques that achieve scalability on multicore systems and Apache Spark. This paper is under consideration for acceptance in Theory and Practice of Logic Programming (TPLP).Comment: Under consideration in Theory and Practice of Logic Programming (TPLP

    Multiple K Means++ Clustering of Satellite Image Using Hadoop MapReduce and Spark

    Full text link
    Clustering of image is one of the important steps of mining satellite images. In our experiment we have simultaneously run multiple K-means algorithms with different initial centroids and values of k in the same iteration of MapReduce jobs. For initialization of initial centroids we have implemented Scalable K-Means++ MapReduce (MR) job [1]. We have also run a validation algorithm of Simplified Silhouette Index [2] for multiple clustering outputs, again in the same iteration of MR jobs. This paper explored the behavior of above mentioned clustering algorithms when run on big data platforms like MapReduce and Spark jobs. Spark has been chosen as it is popular for fast processing particularly where iterations are involved.Comment: 9 Pages, Distributed Computing, Satellite Images, Clustering, Published in International Journal of Advanced Studies in Computer Science and Engineering, IJASCSE volume 5 issue 4, 201

    A survey of systems for massive stream analytics

    Full text link
    The immense growth of data demands switching from traditional data processing solutions to systems, which can process a continuous stream of real time data. Various applications employ stream processing systems to provide solutions to emerging Big Data problems. Open-source solutions such as Storm, Spark Streaming, and S4 are the attempts to answer key stream processing questions. The recent introduction of real time stream processing commercial solutions such as Amazon Kinesis, IBM Infosphere Stream reflect industry requirements. The system and application related challenges to handle massive stream of real time data analytics are an active field of research. In this paper, we present a comparative analysis of the existing state-of-the-art stream processing solutions. We also include various application domains, which are transforming their business model to benefit from these large scale stream processing systems

    Portfolio-driven Resource Management for Transient Cloud Servers

    Full text link
    Cloud providers have begun to offer their surplus capacity in the form of low-cost transient servers, which can be revoked unilaterally at any time. While the low cost of transient servers makes them attractive for a wide range of applications, such as data processing and scientific computing, failures due to server revocation can severely degrade application performance. Since different transient server types offer different cost and availability tradeoffs, we present the notion of server portfolios that is based on financial portfolio modeling. Server portfolios enable construction of an "optimal" mix of severs to meet an application's sensitivity to cost and revocation risk. We implement model-driven portfolios in a system called ExoSphere, and show how diverse applications can use portfolios and application-specific policies to gracefully handle transient servers. We show that ExoSphere enables widely-used parallel applications such as Spark, MPI, and BOINC to be made transiency-aware with modest effort. Our experiments show that allowing the applications to use suitable transiency-aware policies, ExoSphere is able to achieve 80\% cost savings when compared to on-demand servers and greatly reduces revocation risk compared to existing approaches
    corecore