166 research outputs found

    Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

    Get PDF
    Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.Comment: SIGMOD'1

    A Business Intelligence Solution, based on a Big Data Architecture, for processing and analyzing the World Bank data

    Get PDF
    The rapid growth in data volume and complexity has needed the adoption of advanced technologies to extract valuable insights for decision-making. This project aims to address this need by developing a comprehensive framework that combines Big Data processing, analytics, and visualization techniques to enable effective analysis of World Bank data. The problem addressed in this study is the need for a scalable and efficient Business Intelligence solution that can handle the vast amounts of data generated by the World Bank. Therefore, a Big Data architecture is implemented on a real use case for the International Bank of Reconstruction and Development. The findings of this project demonstrate the effectiveness of the proposed solution. Through the integration of Apache Spark and Apache Hive, data is processed using Extract, Transform and Load techniques, allowing for efficient data preparation. The use of Apache Kylin enables the construction of a multidimensional model, facilitating fast and interactive queries on the data. Moreover, data visualization techniques are employed to create intuitive and informative visual representations of the analysed data. The key conclusions drawn from this project highlight the advantages of a Big Data-driven Business Intelligence solution in processing and analysing World Bank data. The implemented framework showcases improved scalability, performance, and flexibility compared to traditional approaches. In conclusion, this bachelor thesis presents a Business Intelligence solution based on a Big Data architecture for processing and analysing the World Bank data. The project findings emphasize the importance of scalable and efficient data processing techniques, multidimensional modelling, and data visualization for deriving valuable insights. The application of these techniques contributes to the field by demonstrating the potential of Big Data Business Intelligence solutions in addressing the challenges associated with large-scale data analysis

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    Enhancing Big Data Warehousing and Analytics for Spatio-Temporal Massive Data

    Get PDF
    The increasing amount of data generated by earth observation missions like Copernicus, NASA Earth Data, and climate stations is overwhelming. Every day, terabytes of data are collected from these resources for different environment applications. Thus, this massive amount of data should be effectively managed and processed to support decision-makers. In this paper, we propose an information system-based on a low latency spatio-temporal data warehouse which aims to improve drought monitoring analytics and to support the decision-making process. The proposed framework consists of 4 main modules: (1) data collection, (2) data preprocessing, (3) data loading and storage, and (4) the visualization and interpretation module. The used data are multi-source and heterogeneous collected from various sources like remote sensing sensors, biophysical sensors, and climate sensors. Hence, this allows us to study drought in different dimensions. Experiments were carried out on a real case of drought monitoring in China between 2000 and 2020

    Quality of Service Aware Data Stream Processing for Highly Dynamic and Scalable Applications

    Get PDF
    Huge amounts of georeferenced data streams are arriving daily to data stream management systems that are deployed for serving highly scalable and dynamic applications. There are innumerable ways at which those loads can be exploited to gain deep insights in various domains. Decision makers require an interactive visualization of such data in the form of maps and dashboards for decision making and strategic planning. Data streams normally exhibit fluctuation and oscillation in arrival rates and skewness. Those are the two predominant factors that greatly impact the overall quality of service. This requires data stream management systems to be attuned to those factors in addition to the spatial shape of the data that may exaggerate the negative impact of those factors. Current systems do not natively support services with quality guarantees for dynamic scenarios, leaving the handling of those logistics to the user which is challenging and cumbersome. Three workloads are predominant for any data stream, batch processing, scalable storage and stream processing. In this thesis, we have designed a quality of service aware system, SpatialDSMS, that constitutes several subsystems that are covering those loads and any mixed load that results from intermixing them. Most importantly, we natively have incorporated quality of service optimizations for processing avalanches of geo-referenced data streams in highly dynamic application scenarios. This has been achieved transparently on top of the codebases of emerging de facto standard best-in-class representatives, thus relieving the overburdened shoulders of the users in the presentation layer from having to reason about those services. Instead, users express their queries with quality goals and our system optimizers compiles that down into query plans with an embedded quality guarantee and leaves logistic handling to the underlying layers. We have developed standard compliant prototypes for all the subsystems that constitutes SpatialDSMS

    Big Data Management Challenges, Approaches, Tools and their limitations

    No full text
    International audienceBig Data is the buzzword everyone talks about. Independently of the application domain, today there is a consensus about the V's characterizing Big Data: Volume, Variety, and Velocity. By focusing on Data Management issues and past experiences in the area of databases systems, this chapter examines the main challenges involved in the three V's of Big Data. Then it reviews the main characteristics of existing solutions for addressing each of the V's (e.g., NoSQL, parallel RDBMS, stream data management systems and complex event processing systems). Finally, it provides a classification of different functions offered by NewSQL systems and discusses their benefits and limitations for processing Big Data

    Storage and Analysis of Big Data Tools for Sessionized Data

    Get PDF
    The Oracle database currently used to mine data at PEGGY is approaching end-of-life and a new infrastructure overhaul is required. It has also been identified that a critical business requirement is the need to load and store very large historical data sets. These data sets contain raw electronic consumer events and interactions from a website such as page views, clicks, downloads, return visits, length of time spent on pages, and how they got to the site / originated. This project will be focused on finding a tool to analyze and measure sessionized data, which is a unit of measurement in web analytics that captures either a user\u27s actions within a particular time period, or the process of segmenting user activity of each user into sessions, each representing a single visit to the site. This sessionized data can be used as the input for a variety of data mining tasks such as clustering, association rule mining, sequence mining etc (Ansari. 2011) This sessionized data must be delivered in a reorganized and readable format timely enough to make informed go-to-market decisions as it relates to the current and existing industry trends. It is also pertinent to understand any development work required and the burden on the resources. Legacy on-premise data warehouse solutions are becoming more expensive, less efficient, less dynamic, and unscalable when compared to current Cloud Infrastructure as a Service (IaaS) that offer real time, on-demand, pay-as-you-go solutions . Therefore, this study will examine the total cost of ownership (TCO) by considering, researching, and analyzing the following factors against a system wide upgrade of the current on-premise Oracle Real Application Cluster (RAC) System: High performance: real-time (or as close to as possible) query speed against sessionized data SQL compliance Cloud based or, at least a hybrid (read: on-premise paired with cloud) Security: encryption preferred Cost structure: cost-effective pay-as-you-go pricing model and resources required for the migration and operations. These technologies analyzed against the current Oracle database are: Amazon Redshift Google Bigquery Hadoop Hadoop + Hive The cost of building an on-premise data warehouse is substantial. The project will determine the performance capabilities and affordability of Amazon Redshift, when compared to other emerging highly ranked solutions, for running e-commerce standard analytics queries on terabytes of sessionized data. Rather than redesigning, upgrading, or over purchasing infrastructure at a high cost for an on-premise data warehouse, this project considers data warehousing solutions through cloud based infrastructure as a service (IaaS) solutions. The proposed objective of this project is to determine the most cost-effective high performer between Amazon Redshift, Apache Hadoop, and Google BigQuery when running e-commerce standard analytics queries on terabytes of sessionized data
    • …
    corecore