Search CORE

74 research outputs found

Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud

Author: A Thusoo
AF Gates
AF Gates
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Data Integration over NoSQL Stores Using Access Path Based Mappings

Author: A. Thusoo
C. Olston
E. Rahm
J. Dean
M. Kifer
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

International audienceDue to the large amount of data generated by user interactions on the Web, some companies are currently innovating in the domain of data management by designing their own systems. Many of them are referred to as NoSQL databases, standing for 'Not only SQL'. With their wide adoption will emerge new needs and data integration will certainly be one of them. In this paper, we adapt a framework encountered for the integration of relational data to a broader context where both NoSQL and relational databases can be integrated. One important extension consists in the efficient answering of queries expressed over these data sources. The highly denormalized aspect of NoSQL databases results in varying performance costs for several possible query translations. Thus a data integration targeting NoSQL databases needs to generate an optimized translation for a given query. Our contributions are to propose (i) an access path based mapping solution that takes benefit of the design choices of each data source, (ii) integrate preferences to handle conflicts between sources and (iii) a query language that bridges the gap between the SQL query expressed by the user and the query language of the data sources. We also present a prototype implementation, where the target schema is represented as a set of relations and which enables the integration of two of the most popular NoSQL database models, namely document and a column family stores

Crossref

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Big Data Analysis

Author: A Acquisti
A Thusoo
B Glavic
C-C Lee
D Bollier
JG Koomey
M Li
M Strohbach
N Marz
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

The value of big data is predicated on the ability to detect trends and patterns and more generally to make sense of the large volumes of data that is often comprised of a heterogeneous mix of format, structure, and semantics. Big data analysis is the component of the big data value chain that focuses on transforming raw acquired data into a coherent usable resource suitable for analysis. Using a range of interviews with key stakeholders in small and large companies and academia, this chapter outlines key insights, state of the art, emerging trends, future requirements, and sectorial case studies for data analysis

OAPEN Library

TUbiblio

Crossref

Springer - Publisher Connector

Open Research Online (The Open University)

Queensland University of Technology ePrints Archive

DI-fusion

Directory of Open Access Books (DOAB)

White Rose Research Online

Just-In-Time Data Distribution for Analytical Query Processing

Author: A. Thusoo
C. Plattner
D. Jiang
D. Kossmann
M. Ivanova
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Distributed processing commonly requires data spread across machines using a priori static or hash-based data allocation. In this paper, we explore an alternative approach that starts from a master node in control of the complete database, and a variable number of worker nodes for delegated query processing. Data is shipped just-in-time to the worker nodes using a need to know policy, and is being reused, if possible, in subsequent queries. A bidding mechanism among the workers yields a scheduling with the most efficient reuse of previously shipped data, minimizing the data transfer costs. Just-in-time data shipment allows our system to benefit from locally available idle resources to boost overall performance. The system is maintenance-free and allocation is fully transparent to users. Our experiments show that the proposed adaptive distributed architecture is a viable and flexible alternative for small scale MapReduce-type of settings

Crossref

CWI's Institutional Repository

International Migration, Integration and Social Cohesion online publications

Supply chain hybrid simulation: From Big Data to distributions and approaches comparison

Author: Antonio A. C. Vieira
Blanco
Cha-Ume
Chen
Cheng
Costa
Costa
Costa
Costa
Di Tria
Dias
Elmasri
Finke
Fornasiero
Golfarelli
Goss
Grover
Guilherme A. B. Pereira
Jahangirian
José A. Oliveira
Kagermann
Kırılmaz
Kv
Lasi
Lee
Longo
Luís M. S. Dias
Macchion
Madden
Maribel Y. Santos
Masoud
Mishra
Mohanty
Nodarakis
Pires
Ponte
Sahoo
Santos
Schmitt
Schwede
Simchi-Levi
Simchi-Levi
Thun
Thusoo
Thusoo
Tiwari
Vieira
Zhong
Zikopoulos
Publication venue: 'Elsevier BV'
Publication date: 01/01/2019
Field of study

The uncertainty and variability of Supply Chains paves the way for simulation to be employed to mitigate such risks. Due to the amounts of data generated by the systems used to manage relevant Supply Chain processes, it is widely recognized that Big Data technologies may bring benefits to Supply Chain simulation models. Nevertheless, a simulation model should also consider statistical distributions, which allow it to be used for purposes such as testing risk scenarios or for prediction. However, when Supply Chains are complex and of huge-scale, performing distribution fitting may not be feasible, which often results in users focusing on subsets of problems or selecting samples of elements, such as suppliers or materials. This paper proposed a hybrid simulation model that runs using data stored in a Big Data Warehouse, statistical distributions or a combination of both approaches. The results show that the former approach brings benefits to the simulations and is essential when setting the model to run based on statistical distributions. Furthermore, this paper also compared these approaches, emphasizing the pros and cons of each, as well as their differences in computational requirements, hence establishing a milestone for future researches in this domain.This work has been supported by national funds through FCT -Fundacao para a Ciencia e Tecnologia within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

Universidade do Minho: RepositoriUM

Crossref

Challenges in managing real-time data in health information system (HIS)

Author: A Thusoo
D Apiletti
J Dean
K Kaur
K Rabbi
L-C Huang
M Hussain
N Peek
P Gorp Van
R Cattell
W Raghupathi
W-S Jian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

© Springer International Publishing Switzerland 2016. In this paper, we have discussed the challenges in handling real-time medical big data collection and storage in health information system (HIS). Based on challenges, we have proposed a model for realtime analysis of medical big data. We exemplify the approach through Spark Streaming and Apache Kafka using the processing of health big data Stream. Apache Kafka works very well in transporting data among different systems such as relational databases, Apache Hadoop and nonrelational databases. However, Apache Kafka lacks analyzing the stream, Spark Streaming framework has the capability to perform some operations on the stream. We have identified the challenges in current realtime systems and proposed our solution to cope with the medical big data streams

ZU Scholars (Zayed University)

Crossref

On the use of simulation as a Big Data semantic validator for supply chain management

Author: António AC Vieira
Bokrantz
Bottani
Correia
Costa
Costa
Costa
Costa
Dias
Domingos
Ehmke
Ghadge
Goss
Grover
Guilherme AB Pereira
Gupta
Jahangirian
José A Oliveira
Kagermann
Kaisler
Kugu
Kv
Laranjeiro
Lasi
Li
Luís MS Dias
Madden
Maribel Y Santos
Masoud
Mohanty
Nageshwaraniyer
Pires
Popovics
Rabe
Santos
Sargent
Schmidt
Simchi-Levi
Simchi-Levi
Thun
Thusoo
Thusoo
Tiwari
Truong
Truong
Truong
Vieira
Vieira
Vieira
Vieira
Wang
Zhong
Zikopoulos
Publication venue: 'Elsevier BV'
Publication date: 01/01/2020
Field of study

Simulation stands out as an appropriate method for the Supply Chain Management (SCM) field. Nevertheless, to produce accurate simulations of Supply Chains (SCs), several business processes must be considered. Thus, when using real data in these simulation models, Big Data concepts and technologies become necessary, as the involved data sources generate data at increasing volume, velocity and variety, in what is known as a Big Data context. While developing such solution, several data issues were found, with simulation proving to be more efficient than traditional data profiling techniques in identifying them. Thus, this paper proposes the use of simulation as a semantic validator of the data, proposed a classification for such issues and quantified their impact in the volume of data used in the final achieved solution. This paper concluded that, while SC simulations using Big Data concepts and technologies are within the grasp of organizations, their data models still require considerable improvements, in order to produce perfect mimics of their SCs. In fact, it was also found that simulation can help in identifying and bypassing some of these issues.This work has been supported by FCT (Fundacao para a Ciencia e Tecnologia) within the Project Scope: UID/CEC/00319/2019 and by the Doctoral scholarship PDE/BDE/114566/2016 funded by FCT, the Portuguese Ministry of Science, Technology and Higher Education, through national funds, and co-financed by the European Social Fund (ESF) through the Operational Programme for Human Capital (POCH)

Universidade do Minho: RepositoriUM

Crossref

An insight into imbalanced Big Data classification: outcomes and challenges

Author: A Fernández
A Fernández
A Thusoo
B Krawczyk
C Bunkhumpornpat
CP Chen
D Lyubimov
E Elsebakhi
E Ramentol
F Hu
F Hu
G Haixiang
GEAPA Batista
GM Weiss
H He
H Yu
I Triguero
I Triguero
J Alcalá-Fdez
J Dean
J Huang
J Li
JA Sáez
JM Tomczak
K Kambatla
L Rokach
M Galar
M Galar
M Wasikowski
NV Chawla
NV Chawla
PC Zikopoulos
R Baeza-Yates
R Barandela
R Blagus
RC Prati
S Alshomrani
S Barua
S Elhag
S Kamal
S Owen
S Río
S Río
S-H Park
T Jo
T White
V García
V López
V López
V López
X Meng
X Wu
Y Guo
Y Sun
Y-S Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Springer - Publisher Connector

Repositorio Institucional Universidad de Granada