348 research outputs found

    Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

    Full text link
    The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202

    A Service Oriented Architecture For Automated Machine Learning At Enterprise-Scale

    Get PDF
    This thesis presents a solution architecture for productizing machine learning models in an enterprise context and, tracking the model’s performance to gain insights on how and when to retrain the model. There are two challenges which this thesis deals with. First, machine learning models need to be trained regularly to incorporate unseen data to maintain it’s performance. This gives rise to the need of machine learning model management. Second, there is an overhead in deploying machine learning models into production with respect to support and operations. There is scope to reduce the time to production for a machine learning model, thus offering cost-effective solutions. These two challenges are addressed through the introduction of three services under ScienceOps called ModelDeploy, ModelMonitor and DataMonitor. ModelDeploy brings down the time to production for a machine learning model. ModelMonitor and DataMonitor helps gain insights on how and when a model should be retrained. Finally, the time to production for the proposed architecture on two cloud platforms versus a rudimentary approach is evaluated and compared. The monitoring services give insight on the model performance and how the statistics of data change over time

    On the performance of SQL scalable systems on Kubernetes: a comparative study

    Get PDF
    The popularization of Hadoop as the the-facto standard platform for data analytics in the context of Big Data applications has led to the upsurge of SQL-on-Hadoop systems, which provide scalable query execution engines allowing the use of SQL queries on data stored in HDFS. In this context, Kubernetes appears as the leading choice to simplify the deployment and scaling of containerized applications; however, there is a lack of studies about the performance of SQL-on-Hadoop systems deployed on Kubernetes, and this is the gap we intend to fill in this paper. We present an experimental study involving four representative SQL scalable platforms: Apache Drill, Apache Hive, Apache Spark SQL and Trino. Concretely, we analyze the performance of these systems when they are deployed on a Hadoop cluster with Kubernetes by using the TPC-H benchmark. The results of our study can help practitioners and users about what they can expect in terms of performance if they plan to use the advantages of Kubernetes to deploy applications using the analyzed SQL scalable platforms.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. Funding for open access charge: Universidad de Málaga / CBUA. This work has been partially funded by the Spanish Ministry of Science and Innovation via Grant PID2020-112540RB-C41 (AEI/FEDER, UE), Andalusian PAIDI program with grant P18-RT-2799, and by project ”Evolución y desarrollo de la plataforma DOP de Big Data” (702C2000044) under Andalusian “Programa de Apoyo a la I+D+i Empresarial”

    LEAN DATA ENGINEERING. COMBINING STATE OF THE ART PRINCIPLES TO PROCESS DATA EFFICIENTLYS

    Get PDF
    The present work was developed during an internship, under Erasmus+ Traineeship program, in Fieldwork Robotics, a Cambridge based company that develops robots to operate in agricultural fields. They collect data from commercial greenhouses with sensors and real sense cameras, as well as with gripper cameras placed in the robotic arms. This data is recorded mainly in bag files, consisting of unstructured data, such as images and semi-structured data, such as metadata associated with both the conditions where the images were taken and information about the robot itself. Data was uploaded, extracted, cleaned and labelled manually before being used to train Artificial Intelligence (AI) algorithms to identify raspberries during the harvesting process. The amount of available data quickly escalates with every trip to the fields, which creates an ever-growing need for an automated process. This problem was addressed via the creation of a data engineering platform encom- passing a data lake, data warehouse and its needed processing capabilities. This platform was created following a series of principles entitled Lean Data Engineering Principles (LDEP), and the systems that follows them are called Lean Data Engineering Systems (LDES). These principles urge to start with the end in mind: process incoming batch or real-time data with no resource wasting, limiting the costs to the absolutely necessary for the job completion, in other words to be as lean as possible. The LDEP principles are a combination of state-of-the-art ideas stemming from several fields, such as data engineering, software engineering and DevOps, leveraging cloud technologies at its core. The proposed custom-made solution enabled the company to scale its data operations, being able to label images almost ten times faster while reducing over 99.9% of its associated costs in comparison to the previous process. In addition, the data lifecycle time has been reduced from weeks to hours while maintaining coherent data quality results, being able, for instance, to correctly identify 94% of the labels in comparison to a human counterpart.Este trabalho foi desenvolvido durante um estágio no âmbito do programa Erasmus+ Traineeship, na Fieldwork Robotics, uma empresa sediada em Cambridge que desenvolve robôs agrícolas. Estes robôs recolhem dados no terreno com sensores e câmeras real- sense, localizados na estrutura de alumínio e nos pulsos dos braços robóticos. Os dados recolhidos são ficheiros contendo dados não estruturados, tais como imagens, e dados semi- -estruturados, associados às condições em que as imagens foram recolhidas. Originalmente, o processo de tratamento dos dados recolhidos (upload, extração, limpeza e etiquetagem) era feito de forma manual, sendo depois utilizados para treinar algoritmos de Inteligência Artificial (IA) para identificar framboesas durante o processo de colheita. Como a quantidade de dados aumentava substancialmente com cada ida ao terreno, verificou-se uma necessidade crescente de um processo automatizado. Este problema foi endereçado com a criação de uma plataforma de engenharia de dados, composta por um data lake, uma data warehouse e o respetivo processamento, para movimentar os dados nas diferentes etapas do processo. Esta plataforma foi criada seguindo uma série de princípios intitulados Lean Data Engineering Principles (LDEP), sendo os sistemas que os seguem intitulados de Lean Data Engineering Systems (LDES). Estes princípios incitam a começar com o fim em mente: processar dados em batch ou em tempo real, sem desperdício de recursos, limitando os custos ao absolutamente necessário para a concluir o trabalho, ou seja, tornando-os o mais lean possível. Os LDEP combinam vertentes do estado da arte em diversas áreas, tais como engenharia de dados, engenharia de software, DevOps, tendo no seu cerne as tecnologias na cloud. O novo processo permitiu à empresa escalar as suas operações de dados, tornando-se capaz de etiquetar imagens quase 10× mais rápido e reduzindo em mais de 99,9% os custos associados, quando comparado com o processo anterior. Adicionalmente, o ciclo de vida dos dados foi reduzido de semanas para horas, mantendo uma qualidade equiparável, ao ser capaz de identificar corretamente 94% das etiquetas em comparação com um homólogo humano

    ARM Wrestling with Big Data: A Study of Commodity ARM64 Server for Big Data Workloads

    Full text link
    ARM processors have dominated the mobile device market in the last decade due to their favorable computing to energy ratio. In this age of Cloud data centers and Big Data analytics, the focus is increasingly on power efficient processing, rather than just high throughput computing. ARM's first commodity server-grade processor is the recent AMD A1100-series processor, based on a 64-bit ARM Cortex A57 architecture. In this paper, we study the performance and energy efficiency of a server based on this ARM64 CPU, relative to a comparable server running an AMD Opteron 3300-series x64 CPU, for Big Data workloads. Specifically, we study these for Intel's HiBench suite of web, query and machine learning benchmarks on Apache Hadoop v2.7 in a pseudo-distributed setup, for data sizes up to 20GB20GB files, 5M5M web pages and 500M500M tuples. Our results show that the ARM64 server's runtime performance is comparable to the x64 server for integer-based workloads like Sort and Hive queries, and only lags behind for floating-point intensive benchmarks like PageRank, when they do not exploit data parallelism adequately. We also see that the ARM64 server takes 13rd\frac{1}{3}^{rd} the energy, and has an Energy Delay Product (EDP) that is 5071%50-71\% lower than the x64 server. These results hold promise for ARM64 data centers hosting Big Data workloads to reduce their operational costs, while opening up opportunities for further analysis.Comment: Accepted for publication in the Proceedings of the 24th IEEE International Conference on High Performance Computing, Data, and Analytics (HiPC), 201

    ANALYZING THE SYSTEM FEATURES, USABILITY, AND PERFORMANCE OF A CONTAINERIZED APPLICATION ON CLOUD COMPUTING SYSTEMS

    Get PDF
    This study analyzed the system features, usability, and performance of three serverless cloud computing platforms: Google Cloud’s Cloud Run, Amazon Web Service’s App Runner, and Microsoft Azure’s Container Apps. The analysis was conducted on a containerized mobile application designed to track real-time bus locations for San Antonio public buses on specific routes and provide estimated arrival times for selected bus stops. The study evaluated various system-related features, including service configuration, pricing, and memory & CPU capacity, along with performance metrics such as container latency, Distance Matrix API response time, and CPU utilization for each service. Easy-to-use usability was also evaluated by assessing the quality of documentation, a learning curve for be- ginner users, and a scale-to-zero factor. The results of the analysis revealed that Google’s Cloud Run demonstrated better performance and usability when com- pared to AWS’s App Runner and Microsoft Azure’s Container Apps. Cloud Run exhibited lower latency and faster response time for distance matrix queries. These findings provide valuable insights for selecting an appropriate serverless cloud ser- vice for similar containerized web applications

    ClouNS - A Cloud-native Application Reference Model for Enterprise Architects

    Full text link
    The capability to operate cloud-native applications can generate enormous business growth and value. But enterprise architects should be aware that cloud-native applications are vulnerable to vendor lock-in. We investigated cloud-native application design principles, public cloud service providers, and industrial cloud standards. All results indicate that most cloud service categories seem to foster vendor lock-in situations which might be especially problematic for enterprise architectures. This might sound disillusioning at first. However, we present a reference model for cloud-native applications that relies only on a small subset of well standardized IaaS services. The reference model can be used for codifying cloud technologies. It can guide technology identification, classification, adoption, research and development processes for cloud-native application and for vendor lock-in aware enterprise architecture engineering methodologies
    corecore