Search CORE

174 research outputs found

LEAN DATA ENGINEERING. COMBINING STATE OF THE ART PRINCIPLES TO PROCESS DATA EFFICIENTLYS

Author: Silveira Duarte Miguel da
Publication venue
Publication date: 01/12/2022
Field of study

The present work was developed during an internship, under Erasmus+ Traineeship program, in Fieldwork Robotics, a Cambridge based company that develops robots to operate in agricultural fields. They collect data from commercial greenhouses with sensors and real sense cameras, as well as with gripper cameras placed in the robotic arms. This data is recorded mainly in bag files, consisting of unstructured data, such as images and semi-structured data, such as metadata associated with both the conditions where the images were taken and information about the robot itself. Data was uploaded, extracted, cleaned and labelled manually before being used to train Artificial Intelligence (AI) algorithms to identify raspberries during the harvesting process. The amount of available data quickly escalates with every trip to the fields, which creates an ever-growing need for an automated process. This problem was addressed via the creation of a data engineering platform encom- passing a data lake, data warehouse and its needed processing capabilities. This platform was created following a series of principles entitled Lean Data Engineering Principles (LDEP), and the systems that follows them are called Lean Data Engineering Systems (LDES). These principles urge to start with the end in mind: process incoming batch or real-time data with no resource wasting, limiting the costs to the absolutely necessary for the job completion, in other words to be as lean as possible. The LDEP principles are a combination of state-of-the-art ideas stemming from several fields, such as data engineering, software engineering and DevOps, leveraging cloud technologies at its core. The proposed custom-made solution enabled the company to scale its data operations, being able to label images almost ten times faster while reducing over 99.9% of its associated costs in comparison to the previous process. In addition, the data lifecycle time has been reduced from weeks to hours while maintaining coherent data quality results, being able, for instance, to correctly identify 94% of the labels in comparison to a human counterpart.Este trabalho foi desenvolvido durante um estágio no âmbito do programa Erasmus+ Traineeship, na Fieldwork Robotics, uma empresa sediada em Cambridge que desenvolve robôs agrícolas. Estes robôs recolhem dados no terreno com sensores e câmeras real- sense, localizados na estrutura de alumínio e nos pulsos dos braços robóticos. Os dados recolhidos são ficheiros contendo dados não estruturados, tais como imagens, e dados semi- -estruturados, associados às condições em que as imagens foram recolhidas. Originalmente, o processo de tratamento dos dados recolhidos (upload, extração, limpeza e etiquetagem) era feito de forma manual, sendo depois utilizados para treinar algoritmos de Inteligência Artificial (IA) para identificar framboesas durante o processo de colheita. Como a quantidade de dados aumentava substancialmente com cada ida ao terreno, verificou-se uma necessidade crescente de um processo automatizado. Este problema foi endereçado com a criação de uma plataforma de engenharia de dados, composta por um data lake, uma data warehouse e o respetivo processamento, para movimentar os dados nas diferentes etapas do processo. Esta plataforma foi criada seguindo uma série de princípios intitulados Lean Data Engineering Principles (LDEP), sendo os sistemas que os seguem intitulados de Lean Data Engineering Systems (LDES). Estes princípios incitam a começar com o fim em mente: processar dados em batch ou em tempo real, sem desperdício de recursos, limitando os custos ao absolutamente necessário para a concluir o trabalho, ou seja, tornando-os o mais lean possível. Os LDEP combinam vertentes do estado da arte em diversas áreas, tais como engenharia de dados, engenharia de software, DevOps, tendo no seu cerne as tecnologias na cloud. O novo processo permitiu à empresa escalar as suas operações de dados, tornando-se capaz de etiquetar imagens quase 10× mais rápido e reduzindo em mais de 99,9% os custos associados, quando comparado com o processo anterior. Adicionalmente, o ciclo de vida dos dados foi reduzido de semanas para horas, mantendo uma qualidade equiparável, ao ser capaz de identificar corretamente 94% das etiquetas em comparação com um homólogo humano

Repositório da Universidade Nova de Lisboa

Recommended from our members

Team One Carbon Catcher Design Report

Author: Brusa Josh
Chow Russell
Cortez Justin
Garcia Ohtli
Hoang Hung
Ibarra John
Jiang Liruilin
Klinge Benjamin
Lee Chak Yin Jeffrey
Mammano Dominic
Muktevi Sai
Pham Brian
Pham Donny
Shah Vishwa Darshak
Trinh Minh
Truong Amber
Yu Cliff
Publication venue: eScholarship, University of California
Publication date: 03/04/2020
Field of study

Overview The burning of fossil fuels largely contributes to the increase of CO2 in the atmosphere. The US Department of Transportation alone contributed almost 6 million metric tons of carbon dioxide emissions in 2018 (EIA). Due to this, this report proposes recycling captured CO2 into a base for cleaner burning fuel in order to reduce emissions from the transportation industry and many others, which has the potential to impact many areas. Extraction of atmospheric CO2 is possible through a membrane filtration system based on traditional nitrogen generation. The passive filtration system autonomously separates the CO2 from other air components, thereby reducing energy consumption. The system's working sensors and actuators utilize similar energy saving strategies, such as distributing cloud-computing services over multiple servers and mainframes to reduce computing power. The movement of air is directed by a scalable fan device, which is presented as a modular design to allow customization of fan parts to specific size and installation requirements. As an integrated device, Team 1’s Carbon Catcher operates with a high efficiency in order to maximize the commercial opportunity of converting captured CO2 into cleaner fuel while also reducing CO2 emissions and the greenhouse effect. Goal The goal of Team 1’s Carbon Catcher project proposal is to design a cost-effective, scalable, and modular atmospheric carbon dioxide removal system that is capable of being utilized in a range of urban environments and may fit a variety of different customer requirements or requests

eScholarship - University of California

Software Engineering for Real-Time NoSQL Systems-centric Big Data Analytics

Author: Clark William F
Publication venue: 'East Carolina University'
Publication date: 18/12/2020
Field of study

Recent advances in Big Data Analytics (BDA) have stimulated widespread interest to integrate BDA capabilities into all aspects of a business. Before these advances, companies have spent time optimizing the software development process and best practices associated with application development. These processes include project management structures and how to deliver new features of an application to its customers efficiently. While these processes are significant for application development, they cannot be utilized effectively for the software development of Big Data Analytics. Instead, some practices and technologies enable automation and monitoring across the full lifecycle of productivity from design to deployment and operations of Analytics. This paper builds on those practices and technologies and introduces a highly scalable framework for Big Data Analytics development operations. This framework builds on top of the best-known processes associated with DevOps. These best practices are then shown using a NoSQL cloud-based platform that consumes and processes structured and unstructured real-time data. As a result, the framework produces scalable, timely, and accurate analytics in real-time, which can be easily adjusted or enhanced to meet the needs of a business and its customers

ScholarShip

Adaptive Big Data Pipeline

Author: Orozco-GómezSerrano Aldo
Publication venue: 'ITESO, A.C.'
Publication date: 01/09/2020
Field of study

Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C

Repositorio Institucional del ITESO

Scalable Software Platform Architecture for the Power Distribution Protection and Analysis

Author: Hokkanen Aki Ville Valtteri
Publication venue
Publication date: 10/03/2020
Field of study

This thesis explores the benefits of microservice architecture over traditional monolithic application architecture and traditional environments for deploying software to the cloud or the edge. The microservice architecture consists of multiple services that serve a single purpose and all separate functions of the application are stored in their own containers. Containers are separate environments based on the Linux kernel. This thesis was done for ABB (ASEA Brown Boveri) Distribution Solutions to modernize one of their existing applications. The main goal of this thesis is to describe the transition from a monolithic application architecture to a micro-service architecture. However, during the case study, we encountered problems that prevented us from going through with the project. The most significant of these problems was the high degree of dependence of the monolithic application between different parts of the program. The end result of the project was to be a proof of concept-level software. We couldn't achieve it. We used design science as a methodology to guide us in decision-making. We chose Action Design Research (ADR) as the methodology for our work because we found it supported interactive work. This fits in very well with our situation as we were doing this research daily at the ABB’s office. Design science primarily aims at the end result, which in our case would have been to plant the old application in the new architecture. One of our most important results is that we were able to identify critical issues that need to be addressed before moving from monolithic to microservice architecture. These findings included technological debt accumulated over the years, incomplete knowledge of the legacy application, and internal system dependencies. These dependencies represent a significant challenge in re-structuring the monolith to a microarchitecture. As a fourth finding, we found that the resources available, such as time, experts and funding, must be sufficient to produce an appropriate result. As a theoretical contribution, we produced our own version of the Action Design Research Method. We combined the first two steps of the method so that while the customer organization was defining the problem, our research team provided solutions to the problem. Of these solutions, the client organization chose the one that suited them best. This process was possible because we had an open and continuing discussion with ABB's development unit.Tämä opinnäytetyö tutkii mikropalveluarkkitehtuurin etuja verrattuna perinteiseen monoliittiseen sovellusarkkitehtuuriin ja perinteisiin ympäristöihin. Mikropalveluarkkitehtuuri koostuu useista palveluista, jotka palvelevat kukin yhtä tarkoitusta ja sovelluksen kaikki erilliset toiminnot ovat omissa säilöissään, joita kutsutaan konteiksi. Kontit perustuvat Linux-ytimeen. Tämä opinnäytetyö tehtiin ABB (ASEA Brown Boveri) Distribution Solutions -yritykselle yhden nykyisen sovelluksen nykyaikaistamiseksi. Tämän tutkielmamme päätavoite on kuvata siirtyminen monoliittisesta sovellusarkkitehtuurista mikropalveluarkkitehtuuriin. Törmäsimme työn aikana kuitenkin ongelmiin, jotka estivät työn läpimenon. Näistä ongelmista merkittävin oli monoliittisen sovelluksen korkea riippuvuuden aste eri ohjelman osien välillä. Projektin lopputuloksen oli tarkoitus olla todiste konseptitason ohjelmistosta. Sitä emme pystyneet toteuttamaan. Käytimme toimintamallitutkimusta metodologiana ohjaamaan meitä päätöksenteossa. Valitsimme toimintamallitutkimuksen työmme metodologiaksi, koska huomasimme sen tukevan interaktiivista työskentelyä. Tämä sopi meidän tilanteeseemme erittäin hyvin, sillä olimme päivittäin ABB:n toimistolla tekemässä tätä tutkimusta. Toimintamallitutkimus tähtää ensisijaisesti lopputulokseen, joka meidän tapauksessamme olisi ollut vanhan sovelluksen istutus uuteen arkkitehtuuriin. Yksi tärkeimmistä tuloksistamme on, että onnistuimme määrittelemään kriittiset kysymykset, joihin on puututtava ennen siirtymistä monoliittisesta arkkitehtuurista mikropalveluarkkitehtuuriin. Näitä löydöksiä olivat vuosien aikana kertynyt teknologinen velka, epätäydellinen tieto vanhasta sovelluksesta sekä järjestelmän sisäiset riippuvuudet. Nämä riippuvuudet muodostavat merkittävän haasteen monoliitin uudelleen jäsentämisessä mikroarkkitehtuurin mukaiseksi. Neljäntenä löydöksenä huomasimme, että käytettävissä olevia resursseja, kuten aika, asiantuntijat ja rahoitus, on oltava riittävästi tarkoituksenmukaisen tuloksen saavuttamiseksi. Teoreettisena kontribuutiona tuotimme oman versiomme Action Design Research -menetelmään. Yhdistimme menetelmän kaksi ensimmäistä vaihetta siten, että samalla, kun asiakasorganisaatio määritteli ongelmaa, tutkimusryhmämme tarjosi ongelmiin ratkaisuja. Näistä ratkaisuista asiakasorganisaatio valitsi heille sopivimman. Tämä prosessi oli mahdollinen, koska kävimme avointa jatkuvaa keskustelua ABB:n kehitysyksikön kanssa

Osuva

Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

Author: Bisong Ekaba
Miao Zhuqi
Scheufele Elisabeth
Weil Sage A
Winn Peter A
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 20/07/2020
Field of study

The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202

arXiv.org e-Print Archive

Crossref

A context-aware multiple Blockchain architecture for managing low memory devices

Author: Acciani Giuseppe
Fiore Marco
Mongiello Marina
Publication venue
Publication date: 05/05/2023
Field of study

Blockchain technology constitutes a paradigm shift in the way we conceive distributed architectures. A Blockchain system lets us build platforms where data are immutable and tamper-proof, with some constraints on the throughput and the amount of memory required to store the ledger. This paper aims to solve the issue of memory and performance requirements developing a multiple Blockchain architecture that mixes the benefits deriving from a public and a private Blockchain. This kind of approach enables small sensors - with memory and performance constraints - to join the network without worrying about the amount of data to store. The development is proposed following a context-aware approach, to make the architecture scalable and easy to use in different scenarios

arXiv.org e-Print Archive

Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments

Author: Islam Muhammed Tawfiqul
Buyya Rajkumar
Publication venue
Publication date: 01/09/2018
Field of study

This chapter presents software architectures of the big data processing platforms. It will provide an in-depth knowledge on resource management techniques involved while deploying big data processing systems on cloud environment. It starts from the very basics and gradually introduce the core components of resource management which we have divided in multiple layers. It covers the state-of-art practices and researches done in SLA-based resource management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure

arXiv.org e-Print Archive

Servicio de Difusión de la Creación Intelectual