174 research outputs found

    LEAN DATA ENGINEERING. COMBINING STATE OF THE ART PRINCIPLES TO PROCESS DATA EFFICIENTLYS

    Get PDF
    The present work was developed during an internship, under Erasmus+ Traineeship program, in Fieldwork Robotics, a Cambridge based company that develops robots to operate in agricultural fields. They collect data from commercial greenhouses with sensors and real sense cameras, as well as with gripper cameras placed in the robotic arms. This data is recorded mainly in bag files, consisting of unstructured data, such as images and semi-structured data, such as metadata associated with both the conditions where the images were taken and information about the robot itself. Data was uploaded, extracted, cleaned and labelled manually before being used to train Artificial Intelligence (AI) algorithms to identify raspberries during the harvesting process. The amount of available data quickly escalates with every trip to the fields, which creates an ever-growing need for an automated process. This problem was addressed via the creation of a data engineering platform encom- passing a data lake, data warehouse and its needed processing capabilities. This platform was created following a series of principles entitled Lean Data Engineering Principles (LDEP), and the systems that follows them are called Lean Data Engineering Systems (LDES). These principles urge to start with the end in mind: process incoming batch or real-time data with no resource wasting, limiting the costs to the absolutely necessary for the job completion, in other words to be as lean as possible. The LDEP principles are a combination of state-of-the-art ideas stemming from several fields, such as data engineering, software engineering and DevOps, leveraging cloud technologies at its core. The proposed custom-made solution enabled the company to scale its data operations, being able to label images almost ten times faster while reducing over 99.9% of its associated costs in comparison to the previous process. In addition, the data lifecycle time has been reduced from weeks to hours while maintaining coherent data quality results, being able, for instance, to correctly identify 94% of the labels in comparison to a human counterpart.Este trabalho foi desenvolvido durante um estágio no âmbito do programa Erasmus+ Traineeship, na Fieldwork Robotics, uma empresa sediada em Cambridge que desenvolve robôs agrícolas. Estes robôs recolhem dados no terreno com sensores e câmeras real- sense, localizados na estrutura de alumínio e nos pulsos dos braços robóticos. Os dados recolhidos são ficheiros contendo dados não estruturados, tais como imagens, e dados semi- -estruturados, associados às condições em que as imagens foram recolhidas. Originalmente, o processo de tratamento dos dados recolhidos (upload, extração, limpeza e etiquetagem) era feito de forma manual, sendo depois utilizados para treinar algoritmos de Inteligência Artificial (IA) para identificar framboesas durante o processo de colheita. Como a quantidade de dados aumentava substancialmente com cada ida ao terreno, verificou-se uma necessidade crescente de um processo automatizado. Este problema foi endereçado com a criação de uma plataforma de engenharia de dados, composta por um data lake, uma data warehouse e o respetivo processamento, para movimentar os dados nas diferentes etapas do processo. Esta plataforma foi criada seguindo uma série de princípios intitulados Lean Data Engineering Principles (LDEP), sendo os sistemas que os seguem intitulados de Lean Data Engineering Systems (LDES). Estes princípios incitam a começar com o fim em mente: processar dados em batch ou em tempo real, sem desperdício de recursos, limitando os custos ao absolutamente necessário para a concluir o trabalho, ou seja, tornando-os o mais lean possível. Os LDEP combinam vertentes do estado da arte em diversas áreas, tais como engenharia de dados, engenharia de software, DevOps, tendo no seu cerne as tecnologias na cloud. O novo processo permitiu à empresa escalar as suas operações de dados, tornando-se capaz de etiquetar imagens quase 10× mais rápido e reduzindo em mais de 99,9% os custos associados, quando comparado com o processo anterior. Adicionalmente, o ciclo de vida dos dados foi reduzido de semanas para horas, mantendo uma qualidade equiparável, ao ser capaz de identificar corretamente 94% das etiquetas em comparação com um homólogo humano

    Software Engineering for Real-Time NoSQL Systems-centric Big Data Analytics

    Get PDF
    Recent advances in Big Data Analytics (BDA) have stimulated widespread interest to integrate BDA capabilities into all aspects of a business. Before these advances, companies have spent time optimizing the software development process and best practices associated with application development. These processes include project management structures and how to deliver new features of an application to its customers efficiently. While these processes are significant for application development, they cannot be utilized effectively for the software development of Big Data Analytics. Instead, some practices and technologies enable automation and monitoring across the full lifecycle of productivity from design to deployment and operations of Analytics. This paper builds on those practices and technologies and introduces a highly scalable framework for Big Data Analytics development operations. This framework builds on top of the best-known processes associated with DevOps. These best practices are then shown using a NoSQL cloud-based platform that consumes and processes structured and unstructured real-time data. As a result, the framework produces scalable, timely, and accurate analytics in real-time, which can be easily adjusted or enhanced to meet the needs of a business and its customers

    Adaptive Big Data Pipeline

    Get PDF
    Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C

    Scalable Software Platform Architecture for the Power Distribution Protection and Analysis

    Get PDF
    This thesis explores the benefits of microservice architecture over traditional monolithic application architecture and traditional environments for deploying software to the cloud or the edge. The microservice architecture consists of multiple services that serve a single purpose and all separate functions of the application are stored in their own containers. Containers are separate environments based on the Linux kernel. This thesis was done for ABB (ASEA Brown Boveri) Distribution Solutions to modernize one of their existing applications. The main goal of this thesis is to describe the transition from a monolithic application architecture to a micro-service architecture. However, during the case study, we encountered problems that prevented us from going through with the project. The most significant of these problems was the high degree of dependence of the monolithic application between different parts of the program. The end result of the project was to be a proof of concept-level software. We couldn't achieve it. We used design science as a methodology to guide us in decision-making. We chose Action Design Research (ADR) as the methodology for our work because we found it supported interactive work. This fits in very well with our situation as we were doing this research daily at the ABB’s office. Design science primarily aims at the end result, which in our case would have been to plant the old application in the new architecture. One of our most important results is that we were able to identify critical issues that need to be addressed before moving from monolithic to microservice architecture. These findings included technological debt accumulated over the years, incomplete knowledge of the legacy application, and internal system dependencies. These dependencies represent a significant challenge in re-structuring the monolith to a microarchitecture. As a fourth finding, we found that the resources available, such as time, experts and funding, must be sufficient to produce an appropriate result. As a theoretical contribution, we produced our own version of the Action Design Research Method. We combined the first two steps of the method so that while the customer organization was defining the problem, our research team provided solutions to the problem. Of these solutions, the client organization chose the one that suited them best. This process was possible because we had an open and continuing discussion with ABB's development unit.Tämä opinnäytetyö tutkii mikropalveluarkkitehtuurin etuja verrattuna perinteiseen monoliittiseen sovellusarkkitehtuuriin ja perinteisiin ympäristöihin. Mikropalveluarkkitehtuuri koostuu useista palveluista, jotka palvelevat kukin yhtä tarkoitusta ja sovelluksen kaikki erilliset toiminnot ovat omissa säilöissään, joita kutsutaan konteiksi. Kontit perustuvat Linux-ytimeen. Tämä opinnäytetyö tehtiin ABB (ASEA Brown Boveri) Distribution Solutions -yritykselle yhden nykyisen sovelluksen nykyaikaistamiseksi. Tämän tutkielmamme päätavoite on kuvata siirtyminen monoliittisesta sovellusarkkitehtuurista mikropalveluarkkitehtuuriin. Törmäsimme työn aikana kuitenkin ongelmiin, jotka estivät työn läpimenon. Näistä ongelmista merkittävin oli monoliittisen sovelluksen korkea riippuvuuden aste eri ohjelman osien välillä. Projektin lopputuloksen oli tarkoitus olla todiste konseptitason ohjelmistosta. Sitä emme pystyneet toteuttamaan. Käytimme toimintamallitutkimusta metodologiana ohjaamaan meitä päätöksenteossa. Valitsimme toimintamallitutkimuksen työmme metodologiaksi, koska huomasimme sen tukevan interaktiivista työskentelyä. Tämä sopi meidän tilanteeseemme erittäin hyvin, sillä olimme päivittäin ABB:n toimistolla tekemässä tätä tutkimusta. Toimintamallitutkimus tähtää ensisijaisesti lopputulokseen, joka meidän tapauksessamme olisi ollut vanhan sovelluksen istutus uuteen arkkitehtuuriin. Yksi tärkeimmistä tuloksistamme on, että onnistuimme määrittelemään kriittiset kysymykset, joihin on puututtava ennen siirtymistä monoliittisesta arkkitehtuurista mikropalveluarkkitehtuuriin. Näitä löydöksiä olivat vuosien aikana kertynyt teknologinen velka, epätäydellinen tieto vanhasta sovelluksesta sekä järjestelmän sisäiset riippuvuudet. Nämä riippuvuudet muodostavat merkittävän haasteen monoliitin uudelleen jäsentämisessä mikroarkkitehtuurin mukaiseksi. Neljäntenä löydöksenä huomasimme, että käytettävissä olevia resursseja, kuten aika, asiantuntijat ja rahoitus, on oltava riittävästi tarkoituksenmukaisen tuloksen saavuttamiseksi. Teoreettisena kontribuutiona tuotimme oman versiomme Action Design Research -menetelmään. Yhdistimme menetelmän kaksi ensimmäistä vaihetta siten, että samalla, kun asiakasorganisaatio määritteli ongelmaa, tutkimusryhmämme tarjosi ongelmiin ratkaisuja. Näistä ratkaisuista asiakasorganisaatio valitsi heille sopivimman. Tämä prosessi oli mahdollinen, koska kävimme avointa jatkuvaa keskustelua ABB:n kehitysyksikön kanssa

    Collaborative Cloud Computing Framework for Health Data with Open Source Technologies

    Full text link
    The proliferation of sensor technologies and advancements in data collection methods have enabled the accumulation of very large amounts of data. Increasingly, these datasets are considered for scientific research. However, the design of the system architecture to achieve high performance in terms of parallelization, query processing time, aggregation of heterogeneous data types (e.g., time series, images, structured data, among others), and difficulty in reproducing scientific research remain a major challenge. This is specifically true for health sciences research, where the systems must be i) easy to use with the flexibility to manipulate data at the most granular level, ii) agnostic of programming language kernel, iii) scalable, and iv) compliant with the HIPAA privacy law. In this paper, we review the existing literature for such big data systems for scientific research in health sciences and identify the gaps of the current system landscape. We propose a novel architecture for software-hardware-data ecosystem using open source technologies such as Apache Hadoop, Kubernetes and JupyterHub in a distributed environment. We also evaluate the system using a large clinical data set of 69M patients.Comment: This paper is accepted in ACM-BCB 202

    A context-aware multiple Blockchain architecture for managing low memory devices

    Full text link
    Blockchain technology constitutes a paradigm shift in the way we conceive distributed architectures. A Blockchain system lets us build platforms where data are immutable and tamper-proof, with some constraints on the throughput and the amount of memory required to store the ledger. This paper aims to solve the issue of memory and performance requirements developing a multiple Blockchain architecture that mixes the benefits deriving from a public and a private Blockchain. This kind of approach enables small sensors - with memory and performance constraints - to join the network without worrying about the amount of data to store. The development is proposed following a context-aware approach, to make the architecture scalable and easy to use in different scenarios

    Resource Management and Scheduling for Big Data Applications in Cloud Computing Environments

    Get PDF
    This chapter presents software architectures of the big data processing platforms. It will provide an in-depth knowledge on resource management techniques involved while deploying big data processing systems on cloud environment. It starts from the very basics and gradually introduce the core components of resource management which we have divided in multiple layers. It covers the state-of-art practices and researches done in SLA-based resource management with a specific focus on the job scheduling mechanisms.Comment: 27 pages, 9 figure
    corecore