174 research outputs found
LEAN DATA ENGINEERING. COMBINING STATE OF THE ART PRINCIPLES TO PROCESS DATA EFFICIENTLYS
The present work was developed during an internship, under Erasmus+ Traineeship
program, in Fieldwork Robotics, a Cambridge based company that develops robots to
operate in agricultural fields. They collect data from commercial greenhouses with sensors
and real sense cameras, as well as with gripper cameras placed in the robotic arms. This
data is recorded mainly in bag files, consisting of unstructured data, such as images and
semi-structured data, such as metadata associated with both the conditions where the
images were taken and information about the robot itself.
Data was uploaded, extracted, cleaned and labelled manually before being used to
train Artificial Intelligence (AI) algorithms to identify raspberries during the harvesting
process. The amount of available data quickly escalates with every trip to the fields, which
creates an ever-growing need for an automated process.
This problem was addressed via the creation of a data engineering platform encom-
passing a data lake, data warehouse and its needed processing capabilities. This platform
was created following a series of principles entitled Lean Data Engineering Principles
(LDEP), and the systems that follows them are called Lean Data Engineering Systems
(LDES). These principles urge to start with the end in mind: process incoming batch or
real-time data with no resource wasting, limiting the costs to the absolutely necessary for
the job completion, in other words to be as lean as possible.
The LDEP principles are a combination of state-of-the-art ideas stemming from several
fields, such as data engineering, software engineering and DevOps, leveraging cloud
technologies at its core.
The proposed custom-made solution enabled the company to scale its data operations,
being able to label images almost ten times faster while reducing over 99.9% of its associated
costs in comparison to the previous process. In addition, the data lifecycle time has been
reduced from weeks to hours while maintaining coherent data quality results, being able,
for instance, to correctly identify 94% of the labels in comparison to a human counterpart.Este trabalho foi desenvolvido durante um estágio no âmbito do programa Erasmus+
Traineeship, na Fieldwork Robotics, uma empresa sediada em Cambridge que desenvolve
robôs agrícolas. Estes robôs recolhem dados no terreno com sensores e câmeras real-
sense, localizados na estrutura de alumínio e nos pulsos dos braços robóticos. Os dados
recolhidos são ficheiros contendo dados não estruturados, tais como imagens, e dados semi-
-estruturados, associados às condições em que as imagens foram recolhidas. Originalmente,
o processo de tratamento dos dados recolhidos (upload, extração, limpeza e etiquetagem)
era feito de forma manual, sendo depois utilizados para treinar algoritmos de Inteligência
Artificial (IA) para identificar framboesas durante o processo de colheita.
Como a quantidade de dados aumentava substancialmente com cada ida ao terreno,
verificou-se uma necessidade crescente de um processo automatizado. Este problema foi
endereçado com a criação de uma plataforma de engenharia de dados, composta por um
data lake, uma data warehouse e o respetivo processamento, para movimentar os dados nas
diferentes etapas do processo. Esta plataforma foi criada seguindo uma série de princípios
intitulados Lean Data Engineering Principles (LDEP), sendo os sistemas que os seguem
intitulados de Lean Data Engineering Systems (LDES). Estes princípios incitam a começar
com o fim em mente: processar dados em batch ou em tempo real, sem desperdício de
recursos, limitando os custos ao absolutamente necessário para a concluir o trabalho, ou
seja, tornando-os o mais lean possível.
Os LDEP combinam vertentes do estado da arte em diversas áreas, tais como engenharia
de dados, engenharia de software, DevOps, tendo no seu cerne as tecnologias na cloud. O
novo processo permitiu à empresa escalar as suas operações de dados, tornando-se capaz
de etiquetar imagens quase 10× mais rápido e reduzindo em mais de 99,9% os custos
associados, quando comparado com o processo anterior. Adicionalmente, o ciclo de vida
dos dados foi reduzido de semanas para horas, mantendo uma qualidade equiparável, ao
ser capaz de identificar corretamente 94% das etiquetas em comparação com um homólogo
humano
Recommended from our members
Team One Carbon Catcher Design Report
Overview
The burning of fossil fuels largely contributes to the increase of CO2 in the atmosphere. The US Department of Transportation alone contributed almost 6 million metric tons of carbon dioxide emissions in 2018 (EIA). Due to this, this report proposes recycling captured CO2 into a base for cleaner burning fuel in order to reduce emissions from the transportation industry and many others, which has the potential to impact many areas.
Extraction of atmospheric CO2 is possible through a membrane filtration system based on traditional nitrogen generation. The passive filtration system autonomously separates the CO2 from other air components, thereby reducing energy consumption. The system's working sensors and actuators utilize similar energy saving strategies, such as distributing cloud-computing services over multiple servers and mainframes to reduce computing power. The movement of air is directed by a scalable fan device, which is presented as a modular design to allow customization of fan parts to specific size and installation requirements. As an integrated device, Team 1’s Carbon Catcher operates with a high efficiency in order to maximize the commercial opportunity of converting captured CO2 into cleaner fuel while also reducing CO2 emissions and the greenhouse effect.
Goal
The goal of Team 1’s Carbon Catcher project proposal is to design a cost-effective, scalable, and modular atmospheric carbon dioxide removal system that is capable of being utilized in a range of urban environments and may fit a variety of different customer requirements or requests
Software Engineering for Real-Time NoSQL Systems-centric Big Data Analytics
Recent advances in Big Data Analytics (BDA) have stimulated widespread interest to integrate BDA capabilities into all aspects of a business. Before these advances, companies have spent time optimizing the software development process and best practices associated with application development. These processes include project management structures and how to deliver new features of an application to its customers efficiently. While these processes are significant for application development, they cannot be utilized effectively for the software development of Big Data Analytics. Instead, some practices and technologies enable automation and monitoring across the full lifecycle of productivity from design to deployment and operations of Analytics. This paper builds on those practices and technologies and introduces a highly scalable framework for Big Data Analytics development operations. This framework builds on top of the best-known processes associated with DevOps. These best practices are then shown using a NoSQL cloud-based platform that consumes and processes structured and unstructured real-time data. As a result, the framework produces scalable, timely, and accurate analytics in real-time, which can be easily adjusted or enhanced to meet the needs of a business and its customers
Adaptive Big Data Pipeline
Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C
Scalable Software Platform Architecture for the Power Distribution Protection and Analysis
This thesis explores the benefits of microservice architecture over traditional monolithic application architecture and traditional environments for deploying software to the cloud or the edge. The microservice architecture consists of multiple services that serve a single purpose and all separate functions of the application are stored in their own containers. Containers are separate environments based on the Linux kernel. This thesis was done for ABB (ASEA Brown Boveri) Distribution Solutions to modernize one of their existing applications.
The main goal of this thesis is to describe the transition from a monolithic application architecture to a micro-service architecture. However, during the case study, we encountered problems that prevented us from going through with the project. The most significant of these problems was the high degree of dependence of the monolithic application between different parts of the program. The end result of the project was to be a proof of concept-level software. We couldn't achieve it.
We used design science as a methodology to guide us in decision-making. We chose Action Design Research (ADR) as the methodology for our work because we found it supported interactive work. This fits in very well with our situation as we were doing this research daily at the ABB’s office. Design science primarily aims at the end result, which in our case would have been to plant the old application in the new architecture.
One of our most important results is that we were able to identify critical issues that need to be addressed before moving from monolithic to microservice architecture. These findings included technological debt accumulated over the years, incomplete knowledge of the legacy application, and internal system dependencies. These dependencies represent a significant challenge in re-structuring the monolith to a microarchitecture. As a fourth finding, we found that the resources available, such as time, experts and funding, must be sufficient to produce an appropriate result.
As a theoretical contribution, we produced our own version of the Action Design Research Method. We combined the first two steps of the method so that while the customer organization was defining the problem, our research team provided solutions to the problem. Of these solutions, the client organization chose the one that suited them best. This process was possible because we had an open and continuing discussion with ABB's development unit.Tämä opinnäytetyö tutkii mikropalveluarkkitehtuurin etuja verrattuna perinteiseen monoliittiseen sovellusarkkitehtuuriin ja perinteisiin ympäristöihin. Mikropalveluarkkitehtuuri koostuu useista palveluista, jotka palvelevat kukin yhtä tarkoitusta ja sovelluksen kaikki erilliset toiminnot ovat omissa säilöissään, joita kutsutaan konteiksi. Kontit perustuvat Linux-ytimeen. Tämä opinnäytetyö tehtiin ABB (ASEA Brown Boveri) Distribution Solutions -yritykselle yhden nykyisen sovelluksen nykyaikaistamiseksi.
Tämän tutkielmamme päätavoite on kuvata siirtyminen monoliittisesta sovellusarkkitehtuurista mikropalveluarkkitehtuuriin. Törmäsimme työn aikana kuitenkin ongelmiin, jotka estivät työn läpimenon. Näistä ongelmista merkittävin oli monoliittisen sovelluksen korkea riippuvuuden aste eri ohjelman osien välillä. Projektin lopputuloksen oli tarkoitus olla todiste konseptitason ohjelmistosta. Sitä emme pystyneet toteuttamaan.
Käytimme toimintamallitutkimusta metodologiana ohjaamaan meitä päätöksenteossa. Valitsimme toimintamallitutkimuksen työmme metodologiaksi, koska huomasimme sen tukevan interaktiivista työskentelyä. Tämä sopi meidän tilanteeseemme erittäin hyvin, sillä olimme päivittäin ABB:n toimistolla tekemässä tätä tutkimusta. Toimintamallitutkimus tähtää ensisijaisesti lopputulokseen, joka meidän tapauksessamme olisi ollut vanhan sovelluksen istutus uuteen arkkitehtuuriin.
Yksi tärkeimmistä tuloksistamme on, että onnistuimme määrittelemään kriittiset kysymykset, joihin on puututtava ennen siirtymistä monoliittisesta arkkitehtuurista mikropalveluarkkitehtuuriin. Näitä löydöksiä olivat vuosien aikana kertynyt teknologinen velka, epätäydellinen tieto vanhasta sovelluksesta sekä järjestelmän sisäiset riippuvuudet. Nämä riippuvuudet muodostavat merkittävän haasteen monoliitin uudelleen jäsentämisessä mikroarkkitehtuurin mukaiseksi. Neljäntenä löydöksenä huomasimme, että käytettävissä olevia resursseja, kuten aika, asiantuntijat ja rahoitus, on oltava riittävästi tarkoituksenmukaisen tuloksen saavuttamiseksi.
Teoreettisena kontribuutiona tuotimme oman versiomme Action Design Research -menetelmään. Yhdistimme menetelmän kaksi ensimmäistä vaihetta siten, että samalla, kun asiakasorganisaatio määritteli ongelmaa, tutkimusryhmämme tarjosi ongelmiin ratkaisuja. Näistä ratkaisuista asiakasorganisaatio valitsi heille sopivimman. Tämä prosessi oli mahdollinen, koska kävimme avointa jatkuvaa keskustelua ABB:n kehitysyksikön kanssa
Collaborative Cloud Computing Framework for Health Data with Open Source Technologies
The proliferation of sensor technologies and advancements in data collection
methods have enabled the accumulation of very large amounts of data.
Increasingly, these datasets are considered for scientific research. However,
the design of the system architecture to achieve high performance in terms of
parallelization, query processing time, aggregation of heterogeneous data types
(e.g., time series, images, structured data, among others), and difficulty in
reproducing scientific research remain a major challenge. This is specifically
true for health sciences research, where the systems must be i) easy to use
with the flexibility to manipulate data at the most granular level, ii)
agnostic of programming language kernel, iii) scalable, and iv) compliant with
the HIPAA privacy law. In this paper, we review the existing literature for
such big data systems for scientific research in health sciences and identify
the gaps of the current system landscape. We propose a novel architecture for
software-hardware-data ecosystem using open source technologies such as Apache
Hadoop, Kubernetes and JupyterHub in a distributed environment. We also
evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202
A context-aware multiple Blockchain architecture for managing low memory devices
Blockchain technology constitutes a paradigm shift in the way we conceive
distributed architectures. A Blockchain system lets us build platforms where
data are immutable and tamper-proof, with some constraints on the throughput
and the amount of memory required to store the ledger. This paper aims to solve
the issue of memory and performance requirements developing a multiple
Blockchain architecture that mixes the benefits deriving from a public and a
private Blockchain. This kind of approach enables small sensors - with memory
and performance constraints - to join the network without worrying about the
amount of data to store. The development is proposed following a context-aware
approach, to make the architecture scalable and easy to use in different
scenarios
Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments
This chapter presents software architectures of the big data processing
platforms. It will provide an in-depth knowledge on resource management
techniques involved while deploying big data processing systems on cloud
environment. It starts from the very basics and gradually introduce the core
components of resource management which we have divided in multiple layers. It
covers the state-of-art practices and researches done in SLA-based resource
management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure
- …