36 research outputs found
Emerging approaches for data-driven innovation in Europe: Sandbox experiments on the governance of data and technology
Europe’s digital transformation of the economy and society is one of the priorities of the current Commission
and is framed by the European strategy for data. This strategy aims at creating a single market for data
through the establishment of a common European data space, based in turn on domain-specific data spaces
in strategic sectors such as environment, agriculture, industry, health and transportation. Acknowledging the
key role that emerging technologies and innovative approaches for data sharing and use can play to make
European data spaces a reality, this document presents a set of experiments that explore emerging
technologies and tools for data-driven innovation, and also deepen in the socio-technical factors and forces
that occur in data-driven innovation. Experimental results shed some light in terms of lessons learned and
practical recommendations towards the establishment of European data spaces
Predicting model training time to optimize distributed machine learning applications
Despite major advances in recent years, the field of Machine Learning continues to face research and technical challenges. Mostly, these stem from big data and streaming data, which require models to be frequently updated or re-trained, at the expense of significant computational resources. One solution is the use of distributed learning algorithms, which can learn in a distributed manner, from distributed datasets. In this paper, we describe CEDEs—a distributed learning system in which models are heterogeneous distributed Ensembles, i.e., complex models constituted by different base models, trained with different and distributed subsets of data. Specifically, we address the issue of predicting the training time of a given model, given its characteristics and the characteristics of the data. Given that the creation of an Ensemble may imply the training of hundreds of base models, information about the predicted duration of each of these individual tasks is paramount for an efficient management of the cluster’s computational resources and for minimizing makespan, i.e., the time it takes to train the whole Ensemble. Results show that the proposed approach is able to predict the training time of Decision Trees with an average error of 0.103 s, and the training time of Neural Networks with an average error of 21.263 s. We also show how results depend significantly on the hyperparameters of the model and on the characteristics of the input data.This work has been supported by national funds through FCT – Fundação para a Ciência e Tecnologia through projects UIDB/04728/2020, EXPL/CCI-COM/0706/2021, and CPCA-IAC/AV/475278/2022
Emerging approaches for data-driven innovation in Europe
Europe’s digital transformation of the economy and society is one of the priorities of the current Commission and is framed by the European strategy for data. This strategy aims at creating a single market for data through the establishment of a common European data space, based in turn on domain-specific data spaces in strategic sectors such as environment, agriculture, industry, health and transportation. Acknowledging the key role that emerging technologies and innovative approaches for data sharing and use can play to make European data spaces a reality, this document presents a set of experiments that explore emerging technologies and tools for data-driven innovation, and also deepen in the socio-technical factors and forces that occur in data-driven innovation. Experimental results shed some light in terms of lessons learned and practical recommendations towards the establishment of European data spaces
FBsim and the Fully Buffered DIMM Memory System Architecture
As DRAM device data rates increase in chase of ever increasing memory request rates, parallel bus limitations and cost constraints require a sharp decrease in load on the multi-drop buses between the devices and the memory controller, thus limiting the memory system's scalability and failing to meet the capacity requirements of modern server and workstation applications.
A new technology, the Fully Buffered DIMM architecture is currently being introduced to address these challenges. FB-DIMM uses narrower, faster, buffered point to point channels to meet memory capacity and throughput requirements at the price of latency.
This study provides a detailed look at the proposed architecture and its adoption, introduces an FB-DIMM simulation model - the FBSim simulator - and uses it to explore the design space of this new technology - identifying and experimentally proving some of its strengths, weaknesses and limitations, and uncovering future paths of academic research into the field
PiCo: A Domain-Specific Language for Data Analytics Pipelines
In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks.
From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics.
The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level.
Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world
Recommended from our members
Intermediary XML schemas
The methodology of intermediary XML schemas is introduced and its application to complex metadata environments is explored. Intermediary schemas are designed to mediate to other ‘referent’ schemas: instances conforming to these are not generally intended for dissemination but must usually be realized by XSLT transformations for delivery. In some cases, these schemas may also generate instances conforming to themselves. Three subsidiary methods of this methodology are introduced. The first is application-specific schemas that act as intermediaries to established schemas which are problematic by virtue of their over-complexity or flexibility. The second employs the METS packaging standard as a template for navigating instances of a complex schema by defining an abstract map of its instances. The third employs the METS structural map to define templates or conceptual models from which instances of metadata for complex applications may be realized by XSLT transformations. The first method is placed in the context of earlier approaches to semantic interoperability such as crosswalks, switching across, derivation and application profiles. The second is discussed in the context of such methods for mapping complex objects as OAI-ORE and the Fedora Content Model Architecture. The third is examined in relation to earlier approaches to templating within XML architectures. The relevance of these methods to contemporary research is discussed in three areas: digital ecosystems, archival description and Linked Open Data in digital asset management and preservation. Their relevance to future research is discussed in the form of suggested enhancements to each, a possible synthesis of the second and third to overcome possible problems of interoperability presented by the first, and their potential role in future developments in digital preservation. This methodology offers an original approach to resolving issues of interoperability and the management of complex metadata environments; it significantly extends earlier techniques and does so entirely within XML architectures
Ontologies and datasets for energy measurement and validation interoperability
birov2015aInternational audienceThis document presents a final report of the work carried out as part of work package 3 of theREADY4SmartCities project, whose goal it is to identify the knowledge and data resources that supportinteroperability for energy measurement and validation. The document is divided into two parts
Ontologies and datasets for energy management system interoperability
weise2015aInternational audienceThis document presents a final report of the work carried out as part of work package 2 of the READY4SmartCitiesproject (R4SC), whose goal it is to identify the knowledge and data resources that support interoperability for energymanagement systems. The document is divided into two parts