203 research outputs found
Rule based ETL (RETL) approach for GEO spatial data warehouse
This paper presents the use of Service Oriented
Architecture (SOA) for integrating multi source
heterogeneous geospatial data in order to facilitate
geospatial data warehouse. In this study, Real Based
ETL (RETL) concept is adapted in order to extract, transform and load data from a variety of
heterogeneous data sources. ETL will transform
data to schematic format and loading data into the Geo
spatial data warehouse.By using a rule-based
technique, the distribution of parallel ETL pipeline
will enhance and perform more efficient in large scale of data and overcome data bottleneck and
performance overhead. This can ease the disaster
management and enables planners to monitor disaster
emergency response in an efficient manner
LEAN DATA ENGINEERING. COMBINING STATE OF THE ART PRINCIPLES TO PROCESS DATA EFFICIENTLYS
The present work was developed during an internship, under Erasmus+ Traineeship
program, in Fieldwork Robotics, a Cambridge based company that develops robots to
operate in agricultural fields. They collect data from commercial greenhouses with sensors
and real sense cameras, as well as with gripper cameras placed in the robotic arms. This
data is recorded mainly in bag files, consisting of unstructured data, such as images and
semi-structured data, such as metadata associated with both the conditions where the
images were taken and information about the robot itself.
Data was uploaded, extracted, cleaned and labelled manually before being used to
train Artificial Intelligence (AI) algorithms to identify raspberries during the harvesting
process. The amount of available data quickly escalates with every trip to the fields, which
creates an ever-growing need for an automated process.
This problem was addressed via the creation of a data engineering platform encom-
passing a data lake, data warehouse and its needed processing capabilities. This platform
was created following a series of principles entitled Lean Data Engineering Principles
(LDEP), and the systems that follows them are called Lean Data Engineering Systems
(LDES). These principles urge to start with the end in mind: process incoming batch or
real-time data with no resource wasting, limiting the costs to the absolutely necessary for
the job completion, in other words to be as lean as possible.
The LDEP principles are a combination of state-of-the-art ideas stemming from several
fields, such as data engineering, software engineering and DevOps, leveraging cloud
technologies at its core.
The proposed custom-made solution enabled the company to scale its data operations,
being able to label images almost ten times faster while reducing over 99.9% of its associated
costs in comparison to the previous process. In addition, the data lifecycle time has been
reduced from weeks to hours while maintaining coherent data quality results, being able,
for instance, to correctly identify 94% of the labels in comparison to a human counterpart.Este trabalho foi desenvolvido durante um estágio no âmbito do programa Erasmus+
Traineeship, na Fieldwork Robotics, uma empresa sediada em Cambridge que desenvolve
robôs agrÃcolas. Estes robôs recolhem dados no terreno com sensores e câmeras real-
sense, localizados na estrutura de alumÃnio e nos pulsos dos braços robóticos. Os dados
recolhidos são ficheiros contendo dados não estruturados, tais como imagens, e dados semi-
-estruturados, associados às condições em que as imagens foram recolhidas. Originalmente,
o processo de tratamento dos dados recolhidos (upload, extração, limpeza e etiquetagem)
era feito de forma manual, sendo depois utilizados para treinar algoritmos de Inteligência
Artificial (IA) para identificar framboesas durante o processo de colheita.
Como a quantidade de dados aumentava substancialmente com cada ida ao terreno,
verificou-se uma necessidade crescente de um processo automatizado. Este problema foi
endereçado com a criação de uma plataforma de engenharia de dados, composta por um
data lake, uma data warehouse e o respetivo processamento, para movimentar os dados nas
diferentes etapas do processo. Esta plataforma foi criada seguindo uma série de princÃpios
intitulados Lean Data Engineering Principles (LDEP), sendo os sistemas que os seguem
intitulados de Lean Data Engineering Systems (LDES). Estes princÃpios incitam a começar
com o fim em mente: processar dados em batch ou em tempo real, sem desperdÃcio de
recursos, limitando os custos ao absolutamente necessário para a concluir o trabalho, ou
seja, tornando-os o mais lean possÃvel.
Os LDEP combinam vertentes do estado da arte em diversas áreas, tais como engenharia
de dados, engenharia de software, DevOps, tendo no seu cerne as tecnologias na cloud. O
novo processo permitiu à empresa escalar as suas operações de dados, tornando-se capaz
de etiquetar imagens quase 10× mais rápido e reduzindo em mais de 99,9% os custos
associados, quando comparado com o processo anterior. Adicionalmente, o ciclo de vida
dos dados foi reduzido de semanas para horas, mantendo uma qualidade equiparável, ao
ser capaz de identificar corretamente 94% das etiquetas em comparação com um homólogo
humano
Automated credit assessment framework using ETL process and machine learning
In the current business scenario, real-time analysis of enterprise data through Business Intelligence (BI) is crucial for supporting operational activities and taking any strategic decision. The automated ETL (extraction, transformation, and load) process ensures data ingestion into the data warehouse in near real-time, and insights are generated through the BI process based on real-time data. In this paper, we have concentrated on automated credit risk assessment in the financial domain based on the machine learning approach. The machine learning-based classification techniques can furnish a self-regulating process to categorize data. Establishing an automated credit decision-making system helps the lending institution to manage the risks, increase operational efficiency and comply with regulators. In this paper, an empirical approach is taken for credit risk assessment using logistic regression and neural network classification method in compliance with Basel II standards. Here, Basel II standards are adopted to calculate the expected loss. The required data integration for building machine learning models is done through an automated ETL process. We have concluded this research work by evaluating this new methodology for credit risk assessment
An integrated personalization framework for SaaS-based cloud services
Software as a Service (SaaS) has recently emerged as one the most popular service delivery models in cloud computing. The number of SaaS services and their users is continuously increasing and new SaaS service providers emerge on a regular basis. As users are exposed to a wide range of SaaS services, they may soon become more demanding when receiving/consuming such services. Similar to the web and/or mobile applications, personalization can play a critical role in modern SaaS-based cloud services. This paper introduces a fully designed, cloud-enabled personalization framework to facilitate the collection of preferences and the delivery of corresponding SaaS services. The approach we adapt in the design and development of the proposed framework is to synthesize various models and techniques in a novel way. The objective is to provide an integrated and structured environment wherein SaaS services can be provisioned with enhanced personalization quality and performance
Cost-Based Optimization of Integration Flows
Integration flows are increasingly used to specify and execute data-intensive integration tasks between heterogeneous systems and applications. There are many different application areas such as real-time ETL and data synchronization between operational systems. For the reasons of an increasing amount of data, highly distributed IT infrastructures, and high requirements for data consistency and up-to-dateness of query results, many instances of integration flows are executed over time. Due to this high load and blocking synchronous source systems, the performance of the central integration platform is crucial for an IT infrastructure. To tackle these high performance requirements, we introduce the concept of cost-based optimization of imperative integration flows that relies on incremental statistics maintenance and inter-instance plan re-optimization. As a foundation, we introduce the concept of periodical re-optimization including novel cost-based optimization techniques that are tailor-made for integration flows. Furthermore, we refine the periodical re-optimization to on-demand re-optimization in order to overcome the problems of many unnecessary re-optimization steps and adaptation delays, where we miss optimization opportunities. This approach ensures low optimization overhead and fast workload adaptation
Big Data Now, 2015 Edition
Now in its fifth year, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve talked about over the past year. For 2015, we’ve included a collection of blog posts, authored by leading thinkers and experts in the field, that reflect a unique set of themes we’ve identified as gaining significant attention and traction.
Our list of 2015 topics include:
Data-driven cultures
Data science
Data pipelines
Big data architecture and infrastructure
The Internet of Things and real time
Applications of big data
Security, ethics, and governance
Is your organization on the right track? Get a hold of this free report now and stay in tune with the latest significant developments in big data
- …