2,161 research outputs found
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
An Approach to Designing Clusters for Large Data Processing
Cloud computing is increasingly being adopted due to its cost savings and abilities to scale. As data continues to grow rapidly, an increasing amount of institutions are adopting non standard SQL clusters to address the storage and processing demands of large data. However, evaluating and modelling non SQL clusters presents many challenges. In order to address some of these challenges, this thesis proposes a methodology for designing and modelling large scale processing configurations that respond to the end user requirements. Firstly, goals are established for the big data cluster. In this thesis, we use performance and cost as our goals. Secondly, the data is transformed from relational data schema to an appropriate HBase schema. In the third step, we iteratively deploy different clusters. We then model the clusters and evaluate different topologies (size of instances, number of instances, number of clusters, etc.). We use HBase as the large data processing cluster and we evaluate our methodology on traffic data from a large city and on a distributed community cloud infrastructure
Efficient Multi-way Theta-Join Processing Using MapReduce
Multi-way Theta-join queries are powerful in describing complex relations and
therefore widely employed in real practices. However, existing solutions from
traditional distributed and parallel databases for multi-way Theta-join queries
cannot be easily extended to fit a shared-nothing distributed computing
paradigm, which is proven to be able to support OLAP applications over immense
data volumes. In this work, we study the problem of efficient processing of
multi-way Theta-join queries using MapReduce from a cost-effective perspective.
Although there have been some works using the (key,value) pair-based
programming model to support join operations, efficient processing of multi-way
Theta-join queries has never been fully explored. The substantial challenge
lies in, given a number of processing units (that can run Map or Reduce tasks),
mapping a multi-way Theta-join query to a number of MapReduce jobs and having
them executed in a well scheduled sequence, such that the total processing time
span is minimized. Our solution mainly includes two parts: 1) cost metrics for
both single MapReduce job and a number of MapReduce jobs executed in a certain
order; 2) the efficient execution of a chain-typed Theta-join with only one
MapReduce job. Comparing with the query evaluation strategy proposed in [23]
and the widely adopted Pig Latin and Hive SQL solutions, our method achieves
significant improvement of the join processing efficiency.Comment: VLDB201
How can SMEs benefit from big data? Challenges and a path forward
Big data is big news, and large companies in all sectors are making significant advances in their customer relations, product selection and development and consequent profitability through using this valuable commodity. Small and medium enterprises (SMEs) have proved themselves to be slow adopters of the new technology of big data analytics and are in danger of being left behind. In Europe, SMEs are a vital part of the economy, and the challenges they encounter need to be addressed as a matter of urgency. This paper identifies barriers to SME uptake of big data analytics and recognises their complex challenge to all stakeholders, including national and international policy makers, IT, business management and data science communities.
The paper proposes a big data maturity model for SMEs as a first step towards an SME roadmap to data analytics. It considers the âstate-of-the-artâ of IT with respect to usability and usefulness for SMEs and discusses how SMEs can overcome the barriers preventing them from adopting existing solutions. The paper then considers management perspectives and the role of maturity models in enhancing and structuring the adoption of data analytics in an organisation. The history of total quality management is reviewed to inform the core aspects of implanting a new paradigm. The paper concludes with recommendations to help SMEs develop their big data capability and enable them to continue as the engines of European industrial and business success. Copyright © 2016 John Wiley & Sons, Ltd.Peer ReviewedPostprint (author's final draft
A Business Intelligence Solution, based on a Big Data Architecture, for processing and analyzing the World Bank data
The rapid growth in data volume and complexity has needed the adoption of advanced technologies to extract valuable insights for decision-making. This project aims to address this need by developing a comprehensive framework that combines Big Data processing, analytics, and visualization techniques to enable effective analysis of World Bank data. The problem addressed in this study is the need for a scalable and efficient Business Intelligence solution that can handle the vast amounts of data generated by the World Bank. Therefore, a Big Data architecture is implemented on a real use case for the International Bank of Reconstruction and Development. The findings of this project demonstrate the effectiveness of the proposed solution. Through the integration of Apache Spark and Apache Hive, data is processed using Extract, Transform and Load techniques, allowing for efficient data preparation. The use of Apache Kylin enables the construction of a multidimensional model, facilitating fast and interactive queries on the data. Moreover, data visualization techniques are employed to create intuitive and informative visual representations of the analysed data. The key conclusions drawn from this project highlight the advantages of a Big Data-driven Business Intelligence solution in processing and analysing World Bank data. The implemented framework showcases improved scalability, performance, and flexibility compared to traditional approaches. In conclusion, this bachelor thesis presents a Business Intelligence solution based on a Big Data architecture for processing and analysing the World Bank data. The project findings emphasize the importance of scalable and efficient data processing techniques, multidimensional modelling, and data visualization for deriving valuable insights. The application of these techniques contributes to the field by demonstrating the potential of Big Data Business Intelligence solutions in addressing the challenges associated with large-scale data analysis
User-centric Visualization of Data Provenance
The need to understand and track files (and inherently, data) in cloud computing systems is in high demand. Over the past years, the use of logs and data representation using graphs have become the main method for tracking and relating information to the cloud users. While it is still in use, tracking and relating information with âData Provenanceâ (i.e. series of chronicles and the derivation history of data on meta-data) is the new trend for cloud users. However, there is still much room for improving representation of data activities in cloud systems for end-users.
In this thesis, we propose âUVisP (User-centric Visualization of Data Provenance with Gestalt)â, a novel user-centric visualization technique for data provenance. This technique aims to facilitate the missing link between data movements in cloud computing environments and the end-usersâ uncertain queries over their filesâ security and life cycle within cloud systems.
The proof of concept for the UVisP technique integrates D3 (an open-source visualization API) with Gestaltsâ theory of perception to provide a range of user-centric visualizations. UVisP allows users to transform and visualize provenance (logs) with implicit prior knowledge of âGestaltsâ theory of perception.â We presented the initial development of the UVisP technique and our results show that the integration of Gestalt and the existence of âperceptual key(s)â in provenance visualization allows end-users to enhance their visualizing capabilities, extract useful knowledge and understand the visualizations better. This technique also enables end-users to develop certain methods and preferences when sighting different visualizations. For example, having the prior knowledge of Gestaltâs theory of perception and integrated with the types of visualizations offers the user-centric experience when using different visualizations. We also present significant future work that will help profile new user-centric visualizations for cloud users
Essays on Business Value Creation in Digital Platform Ecosystems
Digital platforms and the surrounding ecosystems have garnered great interest from researchers and practitioners. Notwithstanding this attention, it remains unclear how and when digital platforms create business value for platform owners and complementors. This three-essay dissertation focuses on understanding business value creation in digital platform ecosystems. The first essay reviews and synthesizes literature across disciplines and offers an integrative framework of digital platform business value. Advised by the findings from the review, the second and third essays focus on the value creation for platform complementors. The second essay examines how IT startups entering a platform ecosystem at different times can strategically design their products (i.e., product diversification across platform architectural layers and product differentiation) to gain competitive advantages. Longitudinal evidence from the Hadoop ecosystem demonstrates that product diversification has an inverted U-shaped relationship with complementors success, and such an effect is more salient for earlier entrants than later entrants. Earlier entrants should develop products that are similar to other ecosystem competitors to reduce uncertainty whereas later entrants are advised to explore market niche and differentiate their products.The third essay investigates how platform complementors strategies and products co-evolve over time in the co-created ecosystem network environment. Our longitudinal analysis of the Hadoop ecosystem indicates that complementors technological architecture coverage and alliance exploration strategies increase their product evolution rate. In turn, complementors with faster product evolution are more likely to explore new partners but less likely to cover a wider range of the focal platforms technological layers in subsequent periods. Network density, co-created by all platform complementors, weakens the effects of complementors strategies on their product evolution but amplifies the effects of past product evolutions on strategies.This three-essay dissertation uncovers various understudied competitive strategies in the digital platform context and enriches our understanding of business value creation in digital platform ecosystems
- âŠ