Search CORE

467 research outputs found

BigDimETL with NoSQL Database

Author: Gargouri Faiez
Ghozzi Faiza
Mallek Hana
Teste Olivier
Publication venue: 'Elsevier BV'
Publication date: 01/01/2018
Field of study

In the last decade, we have witnessed an explosion of data volume available on the Web. This is due to the rapid technological advances with the availability of smart devices and social networks such as Twitter, Facebook, Instagram, etc. Hence, the concept of Big Data was created to face this constant increase. In this context, many domains should take in consideration this growth of data, especially, the Business Intelligence (BI) domain. Where, it is full of important knowledge that is crucial for effective decision making. However, new problems and challenges have appeared for the Decision Support System that must be addressed. Accordingly, the purpose of this paper is to adapt Extract-Transform-Load (ETL) processes with Big Data technologies, in order to support decision-making and knowledge discovery. In this paper, we propose a new approach called Big Dimensional ETL (BigDimETL) dealing with ETL development process and taking into account the Multidimensional structure. In addition, in order to accelerate data handling we used the MapReduce paradigm and Hbase as a distributed storage mechanism that provides data warehousing capabilities. Experimental results show that our ETL operation adaptation can perform well especially with Join operation

Scientific Publications of the University of Toulouse II Le Mirail

Open Archive Toulouse Archive Ouverte

The potential of semantic paradigm in warehousing of big data

Author: Boris Vrdoljak
Marina Ptiček
Marko Gulić
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2019
Field of study

Big data have analytical potential that was hard to realize with available technologies. After new storage paradigms intended for big data such as NoSQL databases emerged, traditional systems got pushed out of the focus. The current research is focused on their reconciliation on different levels or paradigm replacement. Similarly, the emergence of NoSQL databases has started to push traditional (relational) data warehouses out of the research and even practical focus. Data warehousing is known for the strict modelling process, capturing the essence of the business processes. For that reason, a mere integration to bridge the NoSQL gap is not enough. It is necessary to deal with this issue on a higher abstraction level during the modelling phase. NoSQL databases generally lack clear, unambiguous schema, making the comprehension of their contents difficult and their integration and analysis harder. This motivated involving semantic web technologies to enrich NoSQL database contents by additional meaning and context. This paper reviews the application of semantics in data integration and data warehousing and analyses its potential in integrating NoSQL data and traditional data warehouses with some focus on document stores. Also, it gives a proposal of the future pursuit directions for the big data warehouse modelling phases

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

”Big Data” Initiative as an IT Solution for Improved Operation and Maintenance of Wind Turbines

Author: Benczúr András
Csempesz János
Garzó András
Kis Krisztián Balázs
Petrás István
Sidló Csaba István
Viharos Zsolt János
Publication venue
Publication date: 01/01/2013
Field of study

SZTAKI Publication Repository

Recommended from our members

An Architecture for Big Data Analytics

Author: Chan Joseph O.
Publication venue: CSUSB ScholarWorks
Publication date: 19/06/2014
Field of study

Big Data is the new experience curve in the new economy driven by data with high volume, velocity, variety, and veracity. They come from various sources that include the Internet, mobile devices, social media, geospatial devices, sensors, and other machine-generated data. Unlocking the value of Big Data allows business to better sense and respond to the environment, and is becoming a key to creating competitive advantages in a complex and rapidly changing market. Government is also taking notice of the Big Data phenomenon and has created initiatives to exploit Big Data in many areas such as science and engineering, healthcare and national security. Traditional data processing and analysis of structured data using RDBMS and data warehousing no longer satisfy the challenges of Big Data. Technology trends for Big Data embrace open source software, commodity servers, and massively parallel-distributed processing platforms. Analytics is at the core of exploiting values from Big Data to create consumable insights for business and government. This paper presents architecture for Big Data Analytics and explores Big Data technologies that include NoSQL databases, Hadoop Distributed File System and MapReduce

CSUSB ScholarWorks

Big Data Mining and Semantic Technologies: Challenges and Opportunities

Author: Ms. Yesha Mehta, Dr. Sanjay Buch
Publication venue: 'Auricle Technologies, Pvt., Ltd.'
Publication date: 31/07/2015
Field of study

Big data a term coined due to the explosion in the quantity and diversity of high frequency digital data which is having a potential for valuable insights has drawn the most attention in the area of research and development. Converting big data to actionable insights requires depth understanding of big data, its characteristics, challenges and current technological trends. A rise of big data is changing the existing data storage, management, processing and analytical mechanisms and leads to the new architecture/ecosystems to handle big data applications. This paper covers finding of our research study about big data characteristic, various types of analysis associated with it and basic big data types. First, we are presenting the big data study from data mining and analysis perspective and discuss the challenges and next, we present the result of research study on meaningful use of big data in the context of semantic technologies. Moreover, we discuss various case studies related to social media analysis and recent development trends to identify potential research directions for big data with semantic technologies. DOI: 10.17762/ijritcc2321-8169.150711

International Journal on Recent and Innovation Trends in Computing and Communication

On-Demand Big Data Integration: A Hybrid ETL Approach for Reproducible Scientific Research

Author: Galhardas Helena
Kathiravelu Pradeeban
Sharma Ashish
Van Roy Peter
Veiga Luıs
Publication venue
Publication date: 01/01/2017
Field of study

Scientific research requires access, analysis, and sharing of data that is distributed across various heterogeneous data sources at the scale of the Internet. An eager ETL process constructs an integrated data repository as its first step, integrating and loading data in its entirety from the data sources. The bootstrapping of this process is not efficient for scientific research that requires access to data from very large and typically numerous distributed data sources. a lazy ETL process loads only the metadata, but still eagerly. Lazy ETL is faster in bootstrapping. However, queries on the integrated data repository of eager ETL perform faster, due to the availability of the entire data beforehand. In this paper, we propose a novel ETL approach for scientific data integration, as a hybrid of eager and lazy ETL approaches, and applied both to data as well as metadata. This way, Hybrid ETL supports incremental integration and loading of metadata and data from the data sources. We incorporate a human-in-the-loop approach, to enhance the hybrid ETL, with selective data integration driven by the user queries and sharing of integrated data between users. We implement our hybrid ETL approach in a prototype platform, Obidos, and evaluate it in the context of data sharing for medical research. Obidos outperforms both the eager ETL and lazy ETL approaches, for scientific research data integration and sharing, through its selective loading of data and metadata, while storing the integrated data in a scalable integrated data repository.Comment: Pre-print Submitted to the DMAH Special Issue of the Springer DAPD Journa

arXiv.org e-Print Archive

DIAL UCLouvain

A unified view of data-intensive flows in business intelligence systems : a survey

Author: Abelló Gamazo Alberto
Jovanovic Petar
Romero Moral Óscar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Data-intensive flows are central processes in today’s business intelligence (BI) systems, deploying different technologies to deliver data, from a multitude of data sources, in user-preferred and analysis-ready formats. To meet complex requirements of next generation BI systems, we often need an effective combination of the traditionally batched extract-transform-load (ETL) processes that populate a data warehouse (DW) from integrated data sources, and more real-time and operational data flows that integrate source data at runtime. Both academia and industry thus must have a clear understanding of the foundations of data-intensive flows and the challenges of moving towards next generation BI environments. In this paper we present a survey of today’s research on data-intensive flows and the related fundamental fields of database theory. The study is based on a proposed set of dimensions describing the important challenges of data-intensive flows in the next generation BI setting. As a result of this survey, we envision an architecture of a system for managing the lifecycle of data-intensive flows. The results further provide a comprehensive understanding of data-intensive flows, recognizing challenges that still are to be addressed, and how the current solutions can be applied for addressing these challenges.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Evaluation of Hadoop/Mapreduce Framework Migration Tools

Author: Adewumi A. O.
Misra Sanjay
Odia Trust
Publication venue
Publication date: 01/01/2014
Field of study

In distributed systems, database migration is not an easy task. Companies will encounter challenges moving data including legacy data to the big data platform. This paper reviews some tools for migrating from traditional databases to the big data platform and thus suggests a model, based on the review

Covenant University Repository

Crossref

Approach for testing the extract-transform-load process in data warehouse systems, An

Author: Homayouni Hajar
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2018
Field of study

2018 Spring.Includes bibliographical references.Enterprises use data warehouses to accumulate data from multiple sources for data analysis and research. Since organizational decisions are often made based on the data stored in a data warehouse, all its components must be rigorously tested. In this thesis, we first present a comprehensive survey of data warehouse testing approaches, and then develop and evaluate an automated testing approach for validating the Extract-Transform-Load (ETL) process, which is a common activity in data warehousing. In the survey we present a classification framework that categorizes the testing and evaluation activities applied to the different components of data warehouses. These approaches include both dynamic analysis as well as static evaluation and manual inspections. The classification framework uses information related to what is tested in terms of the data warehouse component that is validated, and how it is tested in terms of various types of testing and evaluation approaches. We discuss the specific challenges and open problems for each component and propose research directions. The ETL process involves extracting data from source databases, transforming it into a form suitable for research and analysis, and loading it into a data warehouse. ETL processes can use complex one-to-one, many-to-one, and many-to-many transformations involving sources and targets that use different schemas, databases, and technologies. Since faulty implementations in any of the ETL steps can result in incorrect information in the target data warehouse, ETL processes must be thoroughly validated. In this thesis, we propose automated balancing tests that check for discrepancies between the data in the source databases and that in the target warehouse. Balancing tests ensure that the data obtained from the source databases is not lost or incorrectly modified by the ETL process. First, we categorize and define a set of properties to be checked in balancing tests. We identify various types of discrepancies that may exist between the source and the target data, and formalize three categories of properties, namely, completeness, consistency, and syntactic validity that must be checked during testing. Next, we automatically identify source-to-target mappings from ETL transformation rules provided in the specifications. We identify one-to-one, many-to-one, and many-to-many mappings for tables, records, and attributes involved in the ETL transformations. We automatically generate test assertions to verify the properties for balancing tests. We use the source-to-target mappings to automatically generate assertions corresponding to each property. The assertions compare the data in the target data warehouse with the corresponding data in the sources to verify the properties. We evaluate our approach on a health data warehouse that uses data sources with different data models running on different platforms. We demonstrate that our approach can find previously undetected real faults in the ETL implementation. We also provide an automatic mutation testing approach to evaluate the fault finding ability of our balancing tests. Using mutation analysis, we demonstrated that our auto-generated assertions can detect faults in the data inside the target data warehouse when faulty ETL scripts execute on mock source data

Mountain Scholar (Digital Collections of Colorado and Wyoming)