348 research outputs found
Collaborative Cloud Computing Framework for Health Data with Open Source Technologies
The proliferation of sensor technologies and advancements in data collection
methods have enabled the accumulation of very large amounts of data.
Increasingly, these datasets are considered for scientific research. However,
the design of the system architecture to achieve high performance in terms of
parallelization, query processing time, aggregation of heterogeneous data types
(e.g., time series, images, structured data, among others), and difficulty in
reproducing scientific research remain a major challenge. This is specifically
true for health sciences research, where the systems must be i) easy to use
with the flexibility to manipulate data at the most granular level, ii)
agnostic of programming language kernel, iii) scalable, and iv) compliant with
the HIPAA privacy law. In this paper, we review the existing literature for
such big data systems for scientific research in health sciences and identify
the gaps of the current system landscape. We propose a novel architecture for
software-hardware-data ecosystem using open source technologies such as Apache
Hadoop, Kubernetes and JupyterHub in a distributed environment. We also
evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202
A Service Oriented Architecture For Automated Machine Learning At Enterprise-Scale
This thesis presents a solution architecture for productizing machine learning models in an enterprise context and, tracking the model’s performance to gain insights on how and when to retrain the model. There are two challenges which this thesis deals with. First, machine learning models need to be trained regularly to incorporate unseen data to maintain it’s performance. This gives rise to the need of machine learning model management. Second, there is an overhead in deploying machine learning models into production with respect to support and operations. There is scope to reduce the time to production for a machine learning model, thus offering cost-effective solutions. These two challenges are addressed through the introduction of three services under ScienceOps called ModelDeploy, ModelMonitor and DataMonitor. ModelDeploy brings down the time to production for a machine learning model. ModelMonitor and DataMonitor helps gain insights on how and when a model should be retrained. Finally, the time to production for the proposed architecture on two cloud platforms versus a rudimentary approach is evaluated and compared. The monitoring services give insight on the model performance and how the statistics of data change over time
On the performance of SQL scalable systems on Kubernetes: a comparative study
The popularization of Hadoop as the the-facto standard platform for data analytics in the context of Big Data applications
has led to the upsurge of SQL-on-Hadoop systems, which provide scalable query execution engines allowing the use of
SQL queries on data stored in HDFS. In this context, Kubernetes appears as the leading choice to simplify the deployment
and scaling of containerized applications; however, there is a lack of studies about the performance of SQL-on-Hadoop
systems deployed on Kubernetes, and this is the gap we intend to fill in this paper. We present an experimental study
involving four representative SQL scalable platforms: Apache Drill, Apache Hive, Apache Spark SQL and Trino. Concretely, we analyze the performance of these systems when they are deployed on a Hadoop cluster with Kubernetes by
using the TPC-H benchmark. The results of our study can help practitioners and users about what they can expect in terms
of performance if they plan to use the advantages of Kubernetes to deploy applications using the analyzed SQL scalable
platforms.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. Funding for open access charge: Universidad de Málaga / CBUA. This work has been partially funded by the Spanish Ministry of Science and Innovation via Grant PID2020-112540RB-C41 (AEI/FEDER, UE), Andalusian PAIDI program with grant P18-RT-2799, and by project ”Evolución y desarrollo de la plataforma DOP de Big Data” (702C2000044) under Andalusian “Programa de Apoyo a la I+D+i Empresarial”
LEAN DATA ENGINEERING. COMBINING STATE OF THE ART PRINCIPLES TO PROCESS DATA EFFICIENTLYS
The present work was developed during an internship, under Erasmus+ Traineeship
program, in Fieldwork Robotics, a Cambridge based company that develops robots to
operate in agricultural fields. They collect data from commercial greenhouses with sensors
and real sense cameras, as well as with gripper cameras placed in the robotic arms. This
data is recorded mainly in bag files, consisting of unstructured data, such as images and
semi-structured data, such as metadata associated with both the conditions where the
images were taken and information about the robot itself.
Data was uploaded, extracted, cleaned and labelled manually before being used to
train Artificial Intelligence (AI) algorithms to identify raspberries during the harvesting
process. The amount of available data quickly escalates with every trip to the fields, which
creates an ever-growing need for an automated process.
This problem was addressed via the creation of a data engineering platform encom-
passing a data lake, data warehouse and its needed processing capabilities. This platform
was created following a series of principles entitled Lean Data Engineering Principles
(LDEP), and the systems that follows them are called Lean Data Engineering Systems
(LDES). These principles urge to start with the end in mind: process incoming batch or
real-time data with no resource wasting, limiting the costs to the absolutely necessary for
the job completion, in other words to be as lean as possible.
The LDEP principles are a combination of state-of-the-art ideas stemming from several
fields, such as data engineering, software engineering and DevOps, leveraging cloud
technologies at its core.
The proposed custom-made solution enabled the company to scale its data operations,
being able to label images almost ten times faster while reducing over 99.9% of its associated
costs in comparison to the previous process. In addition, the data lifecycle time has been
reduced from weeks to hours while maintaining coherent data quality results, being able,
for instance, to correctly identify 94% of the labels in comparison to a human counterpart.Este trabalho foi desenvolvido durante um estágio no âmbito do programa Erasmus+
Traineeship, na Fieldwork Robotics, uma empresa sediada em Cambridge que desenvolve
robôs agrícolas. Estes robôs recolhem dados no terreno com sensores e câmeras real-
sense, localizados na estrutura de alumínio e nos pulsos dos braços robóticos. Os dados
recolhidos são ficheiros contendo dados não estruturados, tais como imagens, e dados semi-
-estruturados, associados às condições em que as imagens foram recolhidas. Originalmente,
o processo de tratamento dos dados recolhidos (upload, extração, limpeza e etiquetagem)
era feito de forma manual, sendo depois utilizados para treinar algoritmos de Inteligência
Artificial (IA) para identificar framboesas durante o processo de colheita.
Como a quantidade de dados aumentava substancialmente com cada ida ao terreno,
verificou-se uma necessidade crescente de um processo automatizado. Este problema foi
endereçado com a criação de uma plataforma de engenharia de dados, composta por um
data lake, uma data warehouse e o respetivo processamento, para movimentar os dados nas
diferentes etapas do processo. Esta plataforma foi criada seguindo uma série de princípios
intitulados Lean Data Engineering Principles (LDEP), sendo os sistemas que os seguem
intitulados de Lean Data Engineering Systems (LDES). Estes princípios incitam a começar
com o fim em mente: processar dados em batch ou em tempo real, sem desperdício de
recursos, limitando os custos ao absolutamente necessário para a concluir o trabalho, ou
seja, tornando-os o mais lean possível.
Os LDEP combinam vertentes do estado da arte em diversas áreas, tais como engenharia
de dados, engenharia de software, DevOps, tendo no seu cerne as tecnologias na cloud. O
novo processo permitiu à empresa escalar as suas operações de dados, tornando-se capaz
de etiquetar imagens quase 10× mais rápido e reduzindo em mais de 99,9% os custos
associados, quando comparado com o processo anterior. Adicionalmente, o ciclo de vida
dos dados foi reduzido de semanas para horas, mantendo uma qualidade equiparável, ao
ser capaz de identificar corretamente 94% das etiquetas em comparação com um homólogo
humano
ARM Wrestling with Big Data: A Study of Commodity ARM64 Server for Big Data Workloads
ARM processors have dominated the mobile device market in the last decade due
to their favorable computing to energy ratio. In this age of Cloud data centers
and Big Data analytics, the focus is increasingly on power efficient
processing, rather than just high throughput computing. ARM's first commodity
server-grade processor is the recent AMD A1100-series processor, based on a
64-bit ARM Cortex A57 architecture. In this paper, we study the performance and
energy efficiency of a server based on this ARM64 CPU, relative to a comparable
server running an AMD Opteron 3300-series x64 CPU, for Big Data workloads.
Specifically, we study these for Intel's HiBench suite of web, query and
machine learning benchmarks on Apache Hadoop v2.7 in a pseudo-distributed
setup, for data sizes up to files, web pages and tuples. Our
results show that the ARM64 server's runtime performance is comparable to the
x64 server for integer-based workloads like Sort and Hive queries, and only
lags behind for floating-point intensive benchmarks like PageRank, when they do
not exploit data parallelism adequately. We also see that the ARM64 server
takes the energy, and has an Energy Delay Product (EDP) that
is lower than the x64 server. These results hold promise for ARM64
data centers hosting Big Data workloads to reduce their operational costs,
while opening up opportunities for further analysis.Comment: Accepted for publication in the Proceedings of the 24th IEEE
International Conference on High Performance Computing, Data, and Analytics
(HiPC), 201
ANALYZING THE SYSTEM FEATURES, USABILITY, AND PERFORMANCE OF A CONTAINERIZED APPLICATION ON CLOUD COMPUTING SYSTEMS
This study analyzed the system features, usability, and performance of three serverless cloud computing platforms: Google Cloud’s Cloud Run, Amazon Web Service’s App Runner, and Microsoft Azure’s Container Apps. The analysis was conducted on a containerized mobile application designed to track real-time bus locations for San Antonio public buses on specific routes and provide estimated arrival times for selected bus stops. The study evaluated various system-related features, including service configuration, pricing, and memory & CPU capacity, along with performance metrics such as container latency, Distance Matrix API response time, and CPU utilization for each service. Easy-to-use usability was also evaluated by assessing the quality of documentation, a learning curve for be- ginner users, and a scale-to-zero factor. The results of the analysis revealed that Google’s Cloud Run demonstrated better performance and usability when com- pared to AWS’s App Runner and Microsoft Azure’s Container Apps. Cloud Run exhibited lower latency and faster response time for distance matrix queries. These findings provide valuable insights for selecting an appropriate serverless cloud ser- vice for similar containerized web applications
ClouNS - A Cloud-native Application Reference Model for Enterprise Architects
The capability to operate cloud-native applications can generate enormous
business growth and value. But enterprise architects should be aware that
cloud-native applications are vulnerable to vendor lock-in. We investigated
cloud-native application design principles, public cloud service providers, and
industrial cloud standards. All results indicate that most cloud service
categories seem to foster vendor lock-in situations which might be especially
problematic for enterprise architectures. This might sound disillusioning at
first. However, we present a reference model for cloud-native applications that
relies only on a small subset of well standardized IaaS services. The reference
model can be used for codifying cloud technologies. It can guide technology
identification, classification, adoption, research and development processes
for cloud-native application and for vendor lock-in aware enterprise
architecture engineering methodologies
- …