45 research outputs found
Differentiated Multiple Aggregations in Multidimensional Databases
International audienceMany models have been proposed for modeling multidimensional data warehouse and most consider a same function to determine how measure values are aggregated according to different data detail levels. We provide a conceptual model that supports (1) multiple aggregations, associating to the same measure a different aggregation function according to analysis axes or hierarchies, and (2) differentiated aggregation, allowing specific aggregations at each detail level. Our model is based on a graphical formalism that allows controlling the validity of aggregation functions (distributive, algebraic or holistic). We also show how conceptual modeling can be used, in an R-OLAP environment, for building lattices of pre-computed aggregates
An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management
(c) 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.[EN] Smart urban transportation management can be considered as a multifaceted big data challenge. It strongly relies on the information collected into multiple, widespread, and heterogeneous data sources as well as on the ability to extract actionable insights from them. Besides data, full stack (from platform to services and applications) Information and Communications Technology (ICT) solutions need to be specifically adopted to address smart cities challenges. Smart urban transportation management is one of the key use cases addressed in the context of the EUBra-BIGSEA (Europe-Brazil Collaboration of Big Data Scientific Research through Cloud-Centric Applications) project. This paper specifically focuses on the City Administration Dashboard, a public transport analytics application that has been developed on top of the EUBra-BIGSEA platform and used by the Municipality stakeholders of Curitiba, Brazil, to tackle urban traffic data analysis and planning challenges. The solution proposed in this paper joins together a scalable big and fast data analytics platform, a flexible and dynamic cloud infrastructure, data quality and entity matching algorithms as well as security and privacy techniques. By exploiting an interoperable programming framework based on Python Application Programming Interface (API), it allows an easy, rapid and transparent development of smart cities applications.This work was supported by the European Commission through the Cooperation Programme under EUBra-BIGSEA Horizon 2020 Grant [Este projeto e resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informacao e Comunicacao (TIC), anunciada pelo Ministerio de Ciencia, Tecnologia e Inovacao (MCTI)] under Grant 690116.Fiore, S.; Elia, D.; Pires, CE.; Mestre, DG.; Cappiello, C.; Vitali, M.; Andrade, N.... (2019). An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management. IEEE Access. 7:117652-117677. https://doi.org/10.1109/ACCESS.2019.2936941S117652117677
Enabling Efficient and General Subpopulation Analytics in Multidimensional Data Streams
Todayâs large-scale services (e.g., video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a âsketch of sketchesâ to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times
Enabling efficient and general subpopulation analytics in multidimensional data streams
Today's large-scale services (
e.g.
, video streaming platforms, data centers, sensor grids) need diverse real-time summary statistics across multiple subpopulations of multidimensional datasets. However, state-of-the-art frameworks do not offer general and accurate analytics in real time at reasonable costs. The root cause is the combinatorial explosion of data subpopulations and the diversity of summary statistics we need to monitor simultaneously. We present Hydra, an efficient framework for multidimensional analytics that presents a novel combination of using a "sketch of sketches" to avoid the overhead of monitoring exponentially-many subpopulations and universal sketching to ensure accurate estimates for multiple statistics. We build Hydra as an Apache Spark plugin and address practical system challenges to minimize overheads at scale. Across multiple real-world and synthetic multidimensional datasets, we show that Hydra can achieve robust error bounds and is an order of magnitude more efficient in terms of operational cost and memory footprint than existing frameworks (e.g., Spark, Druid) while ensuring interactive estimation times.Red Hat; CNS-2107086 - National Science Foundation; CNS-2106946 - National Science Foundation; CNS-2132643 - National Science FoundationPublished versio
âAn Artificial Intelligence Framework for Supporting Coarse-Grained Workload Classification in Complex Virtual Environments
Cloud-based machine learning tools for enhanced Big Data applications}â, âwhere the main idea is that of predicting the ``\emph{next}'' \emph{workload} occurring against the target Cloud infrastructure via an innovative \emph{ensemble-based approach} that combines the effectiveness of different well-known \emph{classifiers} in order to enhance the whole accuracy of the final classificationâ, âwhich is very relevant at now in the specific context of \emph{Big Data}â. âThe so-called \emph{workload categorization problem} plays a critical role in improving the efficiency and reliability of Cloud-based big data applicationsâ. âImplementation-wiseâ, âour method proposes deploying Cloud entities that participate in the distributed classification approach on top of \emph{virtual machines}â, âwhich represent classical ``commodity'' settings for Cloud-based big data applicationsâ. âGiven a number of known reference workloadsâ, âand an unknown workloadâ, âin this paper we deal with the problem of finding the reference workload which is most similar to the unknown oneâ. âThe depicted scenario turns out to be useful in a plethora of modern information system applicationsâ. âWe name this problem as \emph{coarse-grained workload classification}â, âbecauseâ, âinstead of characterizing the unknown workload in terms of finer behaviorsâ, âsuch as CPUâ, âmemoryâ, âdiskâ, âor network intensive patternsâ, âwe classify the whole unknown workload as one of the (possible) reference workloadsâ. âReference workloads represent a category of workloads that are relevant in a given applicative environmentâ. âIn particularâ, âwe focus our attention on the classification problem described above in the special case represented by \emph{virtualized environments}â. âTodayâ, â\emph{Virtual Machines} (VMs) have become very popular because they offer important advantages to modern computing environments such as cloud computing or server farmsâ. âIn virtualization frameworksâ, âworkload classification is very useful for accountingâ, âsecurity reasonsâ, âor user profilingâ. âHenceâ, âour research makes more sense in such environmentsâ, âand it turns out to be very useful in a special context like Cloud Computingâ, âwhich is emerging nowâ. âIn this respectâ, âour approach consists of running several machine learning-based classifiers of different workload modelsâ, âand then deriving the best classifier produced by the \emph{Dempster-Shafer Fusion}â, âin order to magnify the accuracy of the final classificationâ. âExperimental assessment and analysis clearly confirm the benefits derived from our classification frameworkâ. âThe running programs which produce unknown workloads to be classified are treated in a similar wayâ. âA fundamental aspect of this paper concerns the successful use of data fusion in workload classificationâ. âDifferent types of metrics are in fact fused together using the Dempster-Shafer theory of evidence combinationâ, âgiving a classification accuracy of slightly less than â. âThe acquisition of data from the running processâ, âthe pre-processing algorithmsâ, âand the workload classification are described in detailâ. âVarious classical algorithms have been used for classification to classify the workloadsâ, âand the results are comparedâ