11 research outputs found

    Motor Consultas Analíticas Políglota

    Full text link
    Hoy en día las empresas manejan gran cantidad de datos provenientes de diversas fuentes de datos como por ejemplo la página web de la empresa, las distintas redes sociales o los propios sistemas de información de la empresa. En la actualidad, no hay un sistema de gestión de datos que sea capaz de almacenar y procesar los distintos tipos de datos que estas fuentes generan. Esto hace que, por ejemplo, una entidad bancaria, emplee una base de datos relacional para almacenar los datos de las cuentas de sus clientes con el fin de garantizar transaccionalidad y coherencia de los datos. Esta organización también puede tener la necesidad de usar una base de datos NoSQL para almacenar y procesar la información sobre opiniones u otros hechos relevantes para la empresa que se producen en redes tipo Twitter. Las bases de datos o gestores de datos NoSQL permiten almacenar información con modelos más flexibles que el relacional y además suelen escalar más que las bases de datos relacionales. Estas propiedades se obtienen a costa de sacrificar la transaccionalidad en el acceso a los datos. Desde hace varios años ha habido una explosión de distintos tipos de gestores de datos NoSQL (clave-valor, orientado a documentos, grafos…), los cuales no disponen de un lenguaje de consulta estándar como las bases de datos relacionales, sino que cada gestor tiene su propio lenguaje de consulta. Este hecho, añadido a la gran cantidad de datos que se genera cada día, hace que el procesamiento de datos en las empresas se cada vez más complejo. La persistencia políglota surge como solución a este problema proporcionando un motor de consultas capaz de interaccionar con distintos tipos de gestores de datos, permitiendo incluso combinar los resultados de consultas entre los distintos tipos de gestores de datos, por ejemplo, permitiendo la realización de operaciones tipo join. El lenguaje ofrecido por el motor de consultas políglota debería: (i) preservar la expresividad nativa del lenguaje de consulta de cada gestor de datos con el fin de mantener la expresividad del mismo (ii) Aprovechar el paralelismo inherente a la disponibilidad de varios gestores de datos para realizar el procesamiento en paralelo de las consultas y de esta forma ser capaz de trabajar con grandes cantidades de información. Los objetivos principales de esta tesis se centran en ambos puntos desarrollando un gestor de datos políglota que proporciona el lenguaje de consultas CloudMdsQL para expresar consultas nativas combinando con instrucciones SQL. El gestor políglota se ha implementado en el motor de consultas distribuido de LeanXcale, una base de datos relacional altamente escalable. El gestor políglota paraleliza las consultas teniendo en cuenta que cada gestor de datos puede ser, a su vez, un gestor de datos distribuido. Además, el gestor políglota implementa técnicas de optimización, como por ejemplo el bind join, que puede mejorar el rendimiento de las operaciones de unión (join) selectivas. El rendimiento del motor de consultas poliglota ha sido evaluado emplean do benchmarks industriales (TPC-H) en una variedad de escenarios. Otro de los objetivos de esta tesis ha sido el diseño de una arquitectura novedosa que permita realizar analíticas tanto sobre datos históricos como sobre datos obtenidos en tiempo real sin tener que esperar a realizar procesos tipo ETL. La arquitectura presentada en esta tesis permite procesar consultas federadas en un conjunto de datos actual e histórico manteniendo la coherencia de los datos incluso mientras que se mueven los datos al gestor de datos para el procesamiento analítico. ----------ABSTRACT---------- Modern enterprises today handle a large amount of data coming from various data sources such as the company's website, different social networks or the company's own information systems. Currently, there is no data management system that is capable of storing and processing these different types of data that those sources can generate. This means that, for instance, a finance institution uses a relational database to store the data of its clients' accounts in order to guarantee transactional semantics and consistency of the data. This organization may also have the need to use a NoSQL database to store and process information on opinions or other relevant events for the company that occur on Twitter-like networks. NoSQL data management systems allow this information to be stored with more flexible models than the relational ones and also tend to scale more effectively than relational databases. These properties are obtained at the cost of sacrificing transactional semantics in order to access to data. For several years there has been an explosion of different types of NoSQL data management systems (key-value, document-oriented, graphs etc.), which do not have a standard query language such as relational databases, but each datastore provides its own query language. This fact, added to the large amount of data that is generated every day, makes the data processing in modern enterprises increasingly complex. Polyglot query processing arises as a solution to this problem by providing a query engine capable of interacting with different types of data managers, even allowing the combination of the results of queries between the different types of data managers, as for instance, allowing the performance of join operations. The language offered by the polyglot query engine should: (i) preserve the native expressiveness of the query language of each of the data management systems in order to maintain its expressiveness, (ii) take advantage of the level of parallelism offered by the underlying data management systems in order to perform both parallel data retrieval and query processing and thus be able to work with large amounts of information. The main objectives of this thesis focus on both points by developing a polyglot data manager that provides the CloudMdsQL query language to express native queries combined with SQL statements. The polyglot analytical query engine has been implemented by extending LeanXcale's distributed query engine, which is a highly scalable relational database. The polyglot analytical query engine parallelizes the queries taking into account that each data manager can, in turn, be a distributed data manager. In addition, the polyglot analytical query engine implements optimization techniques, such as the bind join, which can improve the performance of selective join operations. The performance of the polyglot query engine has been evaluated using industry benchmarks (TPC-H) in a variety of scenarios. Another objective of this thesis has been the design of a novel architecture that allows analytics both on historical data and on data obtained in real time without having to wait to carry out ETL-type processes. The architecture presented in this thesis allows federated queries to be processed on a current and historical dataset, while maintaining data consistency, even while the data is being moved to the data warehouse responsible for the analytical processing

    Parallel efficient data loading

    Full text link
    In this paper we discuss how we architected and developed a parallel data loader for LeanXcale database. The loader is characterized for its efficiency and parallelism. LeanXcale can scale up and scale out to very large numbers and loading data in the traditional way it is not exploiting its full potential in terms of the loading rate it can reach. For this reason, we have created a parallel loader that can reach the maximum insertion rate LeanXcale can handle. LeanXcale also exhibits a dual interface, key-value and SQL, that has been exploited by the parallel loader. Basically, the loading leverages the key-value API and results in a highly efficient process that avoids the overhead of SQL processing. Finally, in order to guarantee the parallelism we have developed a data sampler that samples data to generate a histogram of data distribution and use it to pre-split the regions across LeanXcale instances to guarantee that all instances get an even amount of data during loading, thus g uaranteeing the peak processing loading capability of the deployment

    Parallel Query Processing in a Polystore

    Full text link
    International audienceThe blooming of different data stores has made polystores a major topic in the cloud and big data landscape. As the amount of data grows rapidly, it becomes critical to exploit the inherent parallel processing capabilities of underlying data stores and data processing platforms. To fully achieve this, a polystore should: (i) preserve the expressivity of each data store's native query or scripting language and (ii) leverage a distributed architecture to enable parallel data integration, i.e. joins, on top of parallel retrieval of underlying partitioned datasets. In this paper, we address these points by: (i) using the polyglot approach of the CloudMdsQL query language that allows native queries to be expressed as inline scripts and combined with SQL statements for ad-hoc integration and (ii) incorporating the approach within the LeanXcale distributed query engine, thus allowing for native scripts to be processed in parallel at data store shards. In addition, (iii) efficient optimization techniques, such as bind join, can take place to improve the performance of selective joins. We evaluate the performance benefits of exploiting parallelism in combination with high expressivity and optimization through our experimental validation

    Parallel Polyglot Query Processing on Heterogeneous Cloud Data Stores with LeanXcale

    Full text link
    International audienceThe blooming of different cloud data stores has turned polystore systems to a major topic in the nowadays cloud landscape. Especially, as the amount of processed data grows rapidly each year, much attention is being paid on taking advantage of the parallel processing capabilities of the underlying data stores. To provide data federation, a typical polystore solution defines a common data model and query language with translations to API calls or queries to each data store. However, this may lead to losing important querying capabilities. The polyglot approach of the CloudMdsQL query language allows data store native queries to be expressed as inline scripts and combined with regular SQL statements in ad-hoc integration queries. Moreover, efficient optimization techniques, such as bind join, can still take place to improve the performance of selective joins. In this paper, we introduce the distributed architecture of the LeanXcale query engine that processes polyglot queries in the CloudMdsQL query language, yet allowing native scripts to be handled in parallel at data store shards, so that efficient and scalable parallel joins take place at the query engine level. The experimental evaluation of the LeanXcale parallel query engine on various join queries illustrates well the performance benefits of exploiting the parallelism of the underlying data management technologies in combination with the high expressivity provided by their scripting/querying frameworks

    D4.1 – AI-BASED DATA OPERATIONS V1

    Full text link
    <p>This is the first of the series of deliverables related to the activities of WP4 ("AI-based Data Management for Green Data Operations"). Following the MobiSpaces Reference Architecture defined under the scope of T2.1 ("Design of Reference Architecture") and its current release reported in D2.1 ("Conceptual Model & Reference Architecture v1"), this document gives more details about one of the major architectural pillars, the AI-based Data Operations Toolbox. </p><p>The overall activities conducted under the scope of WP4 ("AI-based Data Management for Green Data Operations") that are being reported in this document, mostly focus on this particular pillar, with the exception of the T4.4 ("Privacy-driven Data Aggregation"). The latter is considered as part of the Trustworthy Data Governance Services, however, the progress of this task is reported in these series of deliverables that summarize the activities of the whole WP4. In this document, we present the individual software components that are part of the AI-based Data Operations Toolbox, we give details of their interactions, the background technologies that these components are currently being built upon, along with more detailed description of their internal building blocks. </p><p>WP4 focuses on both the data management aspects of MobiSpaces and the data operations of the platform in terms of automating the definition of AI workflows in a declarative manner and their corresponding runtime deployment and orchestration of their entire data lifecycle. The first category of components consists of the Data Management Toolset of the integrated solution that offers a variety of different but complementary data management systems to be exploited by the data users and application developers. For the second category of components, we provide the tools and algorithms for automating the definition and execution of complex AI workflows, consuming data from the aforementioned Data Management Toolset in a transparent manner. The target objective is to execute these workflows in an energy efficient manner, using our novel resource allocator to reduce the carbon emission. </p><p>The duration of WP4 spans from M04 to M34. This deliverable reports the work that has been conducted until M10, which accomplishes the milestone MS04 ("Software prototypes - Iteration I"). At this phase of the project, we have identified the internal building blocks of the AI-based Data Operations Toolbox, the details of their interactions and we have delivered the first release of the corresponding prototypes. In this report our primary focus is on the individual evaluation of the components, while D2.7 ("AI-based Data Operations Toolbox v1") will later focus on the integrated solution based on our prototypes described here, to be evaluated by the project's use cases. Given the different maturity levels of the different components in WP4 at this moment, in this document we either provide some initial evaluation results or a concrete plan for evaluation that will be followed during the next period. Two additional versions are planned to be submitted in M22 and M34, where the second and third release of the prototypes will be available, giving more details of the implementation and final evaluation, implementing all target objectives of the WP4. </p&gt

    D4.6 REUSABLE MODEL & ANALYTICAL TOOLS: SOFTWARE PROTOTYPE 3

    Full text link
    This document is the third and last software demonstrator deliverable of PolicyCLOUD at M34, October 2022, of the project and is intended for the reviewers of the software deliverables. This deliverable provides the description of the final software demonstrator for the components of the Integrated Data Acquisition and Analytics (DAA) Layer, which provides the analytical capabilities of the PolicyCLOUD platform. The components include the DAA API Gateway (responsible for the overall orchestration and the layer API), the built-in analytical tools such as the new Trend Analysis tool as well as the analytical tools already presented in D4.4 [6]: Data cleaning and interoperability, Situational Knowledge, Opinion Mining & Sentiment Analysis and Social Dynamics & Behavioural Data analysis, and the Operational Data Repository. The full integration of the DAA API Gateway (responsible for the overall orchestration and the layer API) was demonstrated during the review of June 2021 for two use cases. As of the publication of this deliverable, this has been extended in multiple ways: Additional ingest analytics capabilities such as Trend Analysis were added as detailed in section 2.1.3 and 2.1.4 The new integration of the Politika framework with PolicyCloud as detailed in section 2.2.5 of D4.5 [3] and as further detailed in sections 2.1.5.4 and 2.1.6 of this deliverable. The integrated Politika framework was itself used for two use cases, as detailed in section 2.2.6.2 The novel seamless architecture which was introduced in D4.4 [6] and which presents single logical datasets to users that can be explored with SQL had further been enhanced in the past year by performance enhancements of the SQL JOIN as detailed in section 4.4.4 of D4.5. The impact of using the seamless technology is minimal, as the users do not even know in what storage tier their datasets are stored. The only impact is a change of the SQL statement that now makes use of atable function that accepts standard SQL statements though, as detailed in section 2.2.8.3.This deliverable is submitted to the EC, not yet approved

    D3.1 – Integrated Data Governance Operations v1

    Full text link
    <p>This report is the first deliverable (D3.1) of Work Package 3 "Trustworthy and Transparent Data Governance" of the MobiSpaces project. It includes MobiSpaces offerings in regards with: data management at the Edge, data services for accessing trustworthy and reliable data, security enforcement and privacy compliance, as well as data sharing, exchange and interoperability. </p><p>The scope of this deliverable is to describe the basic data services that constitute the Data Governance Platform, one of the main outputs of MobiSpaces. The offered data services cover essential parts of data governance regarding all aspects of the data life cycle, the so-called "data path" in MobiSpaces terms. Starting from data acquisition, the MobiSpaces data services offer data cleaning operations, semantic modeling and representation in RDF (the de-facto standard on the WWW), in compliance with an ontology that covers the maritime and urban mobility domains, and data interlinking with external data sets (e.g., weather data). In this way, MobiSpaces offers a trustworthy and reliable process to generate curated, enriched data sets that respect the FAIR principles. Moreover, a dedicated data service tracks the provenance of data as soon as it enters the MobiSpaces Data Governance platform. Also, to support interoperability and standardized data exchange with third parties and external organizations, MobiSpaces follows and complies with reference architectures for data spaces. </p><p>Apart from the necessary data services for trustworthy data governance, this deliverable presents data management solutions for processing data at the Edge, which is an innovative aspect of the project, as mobility data is typically generated in a decentralized way. This makes several of the offered data services applicable on edge devices, with important benefits related both to efficiency (by means of decentralized and in-situ processing) as well as to privacy preservation (by suppressing private information and avoiding its centralized assembly). Furthermore, a set of useful security tools are provided both for design time as well as for runtime. A security risk modeler helps to assess risks and propose mitigation strategies at design time, while a privacy compliance engine collects runtime information and detects vulnerabilites at runtime. </p><p>In summary, this deliverable presents a holistic approach to data governance, tailored for the mobility domain and respecting key principles of data spaces. The data (governance) services generate reliable and trustworthy mobility data, which can be shared with external organizations in a secure and trusted manner. This added value of generating enriched mobility data sets is the first key contribution of this work. To better fit with the domain of mobility data management, the deliverable offers infrastructure support for edge processing and storage, thus making our data services applicable to diverse settings where decentralized devices collect mobility data, making our work appealing to urban and maritime use cases. This is another innovation and key contribution of our work. </p&gt

    D5.1 – MOBILITY-AWARE DECENTRALIZED ANALYTICS v1

    Full text link
    <p>The "Mobility-aware Decentralized Analytics v1" of the MobiSpaces Project, hereafter referred as the Deliverable D5.1, reports the work performed under WP5 during its first reporting period (M4 – M10) regarding (i) mobility aware learning at the edge, (ii) FL of spatiotemporal data, (iii) XAI techniques for trustworthiness, fairness and explainability of created models, (iv) visual analytics, and (v) authorization and access control framework. </p><p>In particular, during the aforementioned period, the work of the participating partners, organised under five (5) tasks – from Task 5.1 to Task 5.5 – resulted in methods, tools, and services, related to mobility data, and aiming at their efficient large-scale processing and analytics. Some of the developments are considered preparatory infrastructure actions, to serve (coordinate, integrate, validate, etc.) results that will follow in the next reporting periods, while others constitute our first self-standing research results in mobility data analytics. In particular, </p><p>− the group of preparatory infrastructure actions includes a mobility-aware benchmarking pipeline tool (under Task 5.1), a stream processors and worker containers tool for Federated Learning (FL) client model management (under Task 5.2), a privacy-aware Visual Analytics (VA) backend/frontend tool (under Task 5.4), and a coordinated authorization and access control mechanism (under Task 5.5), whereas </p><p>− a simulated FL environment and a cross-silo Federated forecasting method, called FedVRF (both under Task 5.2), and an eXplainable AI (XAI) library for timeseries (under Task 5.3) are considered self-standing research results. </p><p>Regarding the relation to the Use Cases, specs and data from all four (4) use cases where exploited, including time series and movement data (UC1/2) along with air quality measurements and traffic emissions (UC2) and vessel tracking AIS data (UC3/4). </p><p>The majority of the above results is oriented to feed the Edge Analytics Suite under development in MobiSpaces, hence they are demonstrated in deliverable D2.10; the only exceptions are the benchmarking pipeline tool and the coordinated authorization and access control mechanism, which are considered as parts of the Green and Environmental Dimensioning Workbench and the Data Governance Platform, respectively (thus, demonstrated in deliverables D2.16 and D2.4, respectively).</p&gt

    PolicyCLOUD: Analytics as a Service Facilitating Efficient Data-Driven Public Policy Management

    Full text link
    Part 2: Clustering/Unsupervised Learning/AnalyticsInternational audienceWhile several application domains are exploiting the added-value of analytics over various datasets to obtain actionable insights and drive decision making, the public policy management domain has not yet taken advantage of the full potential of the aforementioned analytics and data models. Diverse and heterogeneous datasets are being generated from various sources, which could be utilized across the complete policies lifecycle (i.e. modelling, creation, evaluation and optimization) to realize efficient policy management. To this end, in this paper we present an overall architecture of a cloud-based environment that facilitates data retrieval and analytics, as well as policy modelling, creation and optimization. The environment enables data collection from heterogeneous sources, linking and aggregation, complemented with data cleaning and interoperability techniques in order to make the data ready for use. An innovative approach for analytics as a service is introduced and linked with a policy development toolkit, which is an integrated web-based environment to fulfil the requirements of the public policy ecosystem stakeholders
    corecore