2,161 research outputs found

    Storage Solutions for Big Data Systems: A Qualitative Study and Comparison

    Full text link
    Big data systems development is full of challenges in view of the variety of application areas and domains that this technology promises to serve. Typically, fundamental design decisions involved in big data systems design include choosing appropriate storage and computing infrastructures. In this age of heterogeneous systems that integrate different technologies for optimized solution to a specific real world problem, big data system are not an exception to any such rule. As far as the storage aspect of any big data system is concerned, the primary facet in this regard is a storage infrastructure and NoSQL seems to be the right technology that fulfills its requirements. However, every big data application has variable data characteristics and thus, the corresponding data fits into a different data model. This paper presents feature and use case analysis and comparison of the four main data models namely document oriented, key value, graph and wide column. Moreover, a feature analysis of 80 NoSQL solutions has been provided, elaborating on the criteria and points that a developer must consider while making a possible choice. Typically, big data storage needs to communicate with the execution engine and other processing and visualization technologies to create a comprehensive solution. This brings forth second facet of big data storage, big data file formats, into picture. The second half of the research paper compares the advantages, shortcomings and possible use cases of available big data file formats for Hadoop, which is the foundation for most big data computing technologies. Decentralized storage and blockchain are seen as the next generation of big data storage and its challenges and future prospects have also been discussed

    An Approach to Designing Clusters for Large Data Processing

    Get PDF
    Cloud computing is increasingly being adopted due to its cost savings and abilities to scale. As data continues to grow rapidly, an increasing amount of institutions are adopting non standard SQL clusters to address the storage and processing demands of large data. However, evaluating and modelling non SQL clusters presents many challenges. In order to address some of these challenges, this thesis proposes a methodology for designing and modelling large scale processing configurations that respond to the end user requirements. Firstly, goals are established for the big data cluster. In this thesis, we use performance and cost as our goals. Secondly, the data is transformed from relational data schema to an appropriate HBase schema. In the third step, we iteratively deploy different clusters. We then model the clusters and evaluate different topologies (size of instances, number of instances, number of clusters, etc.). We use HBase as the large data processing cluster and we evaluate our methodology on traffic data from a large city and on a distributed community cloud infrastructure

    Efficient Multi-way Theta-Join Processing Using MapReduce

    Full text link
    Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key,value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.Comment: VLDB201

    How can SMEs benefit from big data? Challenges and a path forward

    Get PDF
    Big data is big news, and large companies in all sectors are making significant advances in their customer relations, product selection and development and consequent profitability through using this valuable commodity. Small and medium enterprises (SMEs) have proved themselves to be slow adopters of the new technology of big data analytics and are in danger of being left behind. In Europe, SMEs are a vital part of the economy, and the challenges they encounter need to be addressed as a matter of urgency. This paper identifies barriers to SME uptake of big data analytics and recognises their complex challenge to all stakeholders, including national and international policy makers, IT, business management and data science communities. The paper proposes a big data maturity model for SMEs as a first step towards an SME roadmap to data analytics. It considers the ‘state-of-the-art’ of IT with respect to usability and usefulness for SMEs and discusses how SMEs can overcome the barriers preventing them from adopting existing solutions. The paper then considers management perspectives and the role of maturity models in enhancing and structuring the adoption of data analytics in an organisation. The history of total quality management is reviewed to inform the core aspects of implanting a new paradigm. The paper concludes with recommendations to help SMEs develop their big data capability and enable them to continue as the engines of European industrial and business success. Copyright © 2016 John Wiley & Sons, Ltd.Peer ReviewedPostprint (author's final draft

    A Business Intelligence Solution, based on a Big Data Architecture, for processing and analyzing the World Bank data

    Get PDF
    The rapid growth in data volume and complexity has needed the adoption of advanced technologies to extract valuable insights for decision-making. This project aims to address this need by developing a comprehensive framework that combines Big Data processing, analytics, and visualization techniques to enable effective analysis of World Bank data. The problem addressed in this study is the need for a scalable and efficient Business Intelligence solution that can handle the vast amounts of data generated by the World Bank. Therefore, a Big Data architecture is implemented on a real use case for the International Bank of Reconstruction and Development. The findings of this project demonstrate the effectiveness of the proposed solution. Through the integration of Apache Spark and Apache Hive, data is processed using Extract, Transform and Load techniques, allowing for efficient data preparation. The use of Apache Kylin enables the construction of a multidimensional model, facilitating fast and interactive queries on the data. Moreover, data visualization techniques are employed to create intuitive and informative visual representations of the analysed data. The key conclusions drawn from this project highlight the advantages of a Big Data-driven Business Intelligence solution in processing and analysing World Bank data. The implemented framework showcases improved scalability, performance, and flexibility compared to traditional approaches. In conclusion, this bachelor thesis presents a Business Intelligence solution based on a Big Data architecture for processing and analysing the World Bank data. The project findings emphasize the importance of scalable and efficient data processing techniques, multidimensional modelling, and data visualization for deriving valuable insights. The application of these techniques contributes to the field by demonstrating the potential of Big Data Business Intelligence solutions in addressing the challenges associated with large-scale data analysis

    User-centric Visualization of Data Provenance

    Get PDF
    The need to understand and track files (and inherently, data) in cloud computing systems is in high demand. Over the past years, the use of logs and data representation using graphs have become the main method for tracking and relating information to the cloud users. While it is still in use, tracking and relating information with ‘Data Provenance’ (i.e. series of chronicles and the derivation history of data on meta-data) is the new trend for cloud users. However, there is still much room for improving representation of data activities in cloud systems for end-users. In this thesis, we propose “UVisP (User-centric Visualization of Data Provenance with Gestalt)”, a novel user-centric visualization technique for data provenance. This technique aims to facilitate the missing link between data movements in cloud computing environments and the end-users’ uncertain queries over their files’ security and life cycle within cloud systems. The proof of concept for the UVisP technique integrates D3 (an open-source visualization API) with Gestalts’ theory of perception to provide a range of user-centric visualizations. UVisP allows users to transform and visualize provenance (logs) with implicit prior knowledge of ‘Gestalts’ theory of perception.’ We presented the initial development of the UVisP technique and our results show that the integration of Gestalt and the existence of ‘perceptual key(s)’ in provenance visualization allows end-users to enhance their visualizing capabilities, extract useful knowledge and understand the visualizations better. This technique also enables end-users to develop certain methods and preferences when sighting different visualizations. For example, having the prior knowledge of Gestalt’s theory of perception and integrated with the types of visualizations offers the user-centric experience when using different visualizations. We also present significant future work that will help profile new user-centric visualizations for cloud users

    Essays on Business Value Creation in Digital Platform Ecosystems

    Get PDF
    Digital platforms and the surrounding ecosystems have garnered great interest from researchers and practitioners. Notwithstanding this attention, it remains unclear how and when digital platforms create business value for platform owners and complementors. This three-essay dissertation focuses on understanding business value creation in digital platform ecosystems. The first essay reviews and synthesizes literature across disciplines and offers an integrative framework of digital platform business value. Advised by the findings from the review, the second and third essays focus on the value creation for platform complementors. The second essay examines how IT startups entering a platform ecosystem at different times can strategically design their products (i.e., product diversification across platform architectural layers and product differentiation) to gain competitive advantages. Longitudinal evidence from the Hadoop ecosystem demonstrates that product diversification has an inverted U-shaped relationship with complementors success, and such an effect is more salient for earlier entrants than later entrants. Earlier entrants should develop products that are similar to other ecosystem competitors to reduce uncertainty whereas later entrants are advised to explore market niche and differentiate their products.The third essay investigates how platform complementors strategies and products co-evolve over time in the co-created ecosystem network environment. Our longitudinal analysis of the Hadoop ecosystem indicates that complementors technological architecture coverage and alliance exploration strategies increase their product evolution rate. In turn, complementors with faster product evolution are more likely to explore new partners but less likely to cover a wider range of the focal platforms technological layers in subsequent periods. Network density, co-created by all platform complementors, weakens the effects of complementors strategies on their product evolution but amplifies the effects of past product evolutions on strategies.This three-essay dissertation uncovers various understudied competitive strategies in the digital platform context and enriches our understanding of business value creation in digital platform ecosystems
    • 

    corecore