4,105 research outputs found
A bi-objective cost model for optimizing database queries in a multi-cloud environment
AbstractCost models are broadly used in query processing to drive the query optimization process, accurately predict the query execution time, schedule database query tasks, apply admission control and derive resource requirements to name a few applications. The main role of cost models is to estimate the time needed to run the query on a specific machine. In a multi-cloud environment, cost models should be easily calibrated for a wide range of different physical machines, and time estimates need to be complemented with monetary cost information, since both the economic cost and the performance are of primary importance. This work aims to serve as the first proposal for a bi-objective query cost model suitable for queries executed over resources provided by potentially multiple cloud providers. We leverage existing calibrating modeling techniques for time estimates and we couple such estimates with monetary cost information covering the main charging options for using cloud resources. Moreover, we explain how the cost model can become part of an optimizer. Our approach is applicable to more generic data flow graphs, the execution plans of which do not necessarily comprise relational operators. Finally, we give a concrete example about the usage of our proposal and we validate its accuracy through real case studies
Towards the Exploration of Task and Workflow Scheduling Methods and Mechanisms in Cloud Computing Environment
Cloud computing sets a domain and application-specific distributed environment to distribute the services and resources among users. There are numerous heterogeneous VMs available in the environment to handle user requests. The user requests are defined with a specific deadline. The scheduling methods are defined to set up the order of request execution in the cloud environment. The scheduling methods in a cloud environment are divided into two main categories called Task and Workflow Scheduling. This paper, is a study of work performed on task and workflow scheduling. Various feature processing, constraints-restricted, and priority-driven methods are discussed in this research. The paper also discussed various optimization methods to improve scheduling performance and reliability in the cloud environment. Various constraints and performance parameters are discussed in this research
Intermediate Results Materialization Selection and Format for Data-Intensive Flows
Data-intensive flows deploy a variety of complex data transformations to build information pipelines from data sources to different end users. As data are processed, these workflows generate large intermediate results, typically pipelined from one operator to the following ones. Materializing intermediate results, shared among multiple flows, brings benefits not only in terms of performance but also in resource usage and consistency. Similar ideas have been proposed in the context of data warehouses, which are studied under the materialized view selection problem. With the rise of Big Data systems, new challenges emerge due to new quality metrics captured by service level agreements which must be taken into account. Moreover, the way such results are stored must be reconsidered, as different data layouts can be used to reduce the I/O cost. In this paper, we propose a novel approach for automatic selection of multi-objective materialization of intermediate results in data-intensive flows, which can tackle multiple and conflicting quality objectives. In addition, our approach chooses the optimal storage data format for selected materialized intermediate results based on subsequent access patterns. The experimental results show that our approach provides 40% better average speedup with respect to the current state-of-the-art, as well as an improvement on disk access time of 18% as compared to fixed format solutions
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Data Warehousing Modernization: Big Data Technology Implementation
Considering the challenges posed by Big Data, the cost to scale traditional data warehouses is high and the performances would be inadequate to meet the growing needs of the volume, variety and velocity of data. The Hadoop ecosystem answers both of the shortcomings. Hadoop has the ability to store and analyze large data sets in parallel on a distributed environment but cannot replace the existing data warehouses and RDBMS systems due to its own limitations explained in this paper. In this paper, I identify the reasons why many enterprises fail and struggle to adapt to Big Data technologies. A brief outline of two different technologies to handle Big Data will be presented in this paper: Using IBM’s Pure Data system for analytics (Netezza) usually used in reporting, and Hadoop with Hive which is used in analytics. Also, this paper covers the Enterprise architecture consisting of Hadoop that successful companies are adapting to analyze, filter, process, and store the data running along a massively parallel processing data warehouse. Despite, having the technology to support and process Big Data, industries are still struggling to meet their goals due to the lack of skilled personnel to study and analyze the data, in short data scientists and data statisticians
From Traditional Adaptive Data Caching to Adaptive Context Caching: A Survey
Context data is in demand more than ever with the rapid increase in the
development of many context-aware Internet of Things applications. Research in
context and context-awareness is being conducted to broaden its applicability
in light of many practical and technical challenges. One of the challenges is
improving performance when responding to large number of context queries.
Context Management Platforms that infer and deliver context to applications
measure this problem using Quality of Service (QoS) parameters. Although
caching is a proven way to improve QoS, transiency of context and features such
as variability, heterogeneity of context queries pose an additional real-time
cost management problem. This paper presents a critical survey of
state-of-the-art in adaptive data caching with the objective of developing a
body of knowledge in cost- and performance-efficient adaptive caching
strategies. We comprehensively survey a large number of research publications
and evaluate, compare, and contrast different techniques, policies, approaches,
and schemes in adaptive caching. Our critical analysis is motivated by the
focus on adaptively caching context as a core research problem. A formal
definition for adaptive context caching is then proposed, followed by
identified features and requirements of a well-designed, objective optimal
adaptive context caching strategy.Comment: This paper is currently under review with ACM Computing Surveys
Journal at this time of publishing in arxiv.or
A time efficient and accurate retrieval of range aggregate queries using fuzzy clustering means (FCM) approach
Massive growth in the big data makes difficult to analyse and retrieve the useful information from the set of available data’s. Statistical analysis: Existing approaches cannot guarantee an efficient retrieval of data from the database. In the existing work stratified sampling is used to partition the tables in terms of static variables. However k means clustering algorithm cannot guarantees an efficient retrieval where the choosing centroid in the large volume of data would be difficult. And less knowledge about the static variable might leads to the less efficient partitioning of tables. Findings: This problem is overcome in the proposed methodology by introducing the FCM clustering instead of k means clustering which can cluster the large volume of data which are similar in nature. Stratification problem is overcome by introducing the post stratification approach which will leads to efficient selection of static variable. Improvements: This methodology leads to an efficient retrieval process in terms of user query within less time and more accuracy
- …