279 research outputs found
Resiliency in numerical algorithm design for extreme scale simulations
This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz,
Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft
High performance data processing
Dissertação de mestrado em Informatics EngeneeringÀ medida que as aplicações atingem uma maior quantidade de utilizadores, precisam de processar uma crescente quantidade de pedidos. Para além disso, precisam de muitas vezes satisfazer pedidos de utilizadores de diferentes partes do globo, onde
as latências de rede têm um impacto significativo no desempenho em instalações
monolíticas. Portanto, distribuição é uma solução muito procurada para melhorar a
performance das camadas aplicacional e de dados. Contudo, distribuir dados não é
uma tarefa simples se pretendemos assegurar uma forte consistência. Isto leva a que
muitos sistemas de base de dados dependam de protocolos de sincronização pesados,
como two-phase commit, consenso distribuído, bloqueamento distribuído, entre outros,
enquanto que outros sistemas dependem em consistência fraca, não viável para alguns
casos de uso.
Esta tese apresenta o design, implementação e avaliação de duas soluções que
têm como objetivo reduzir o impacto de assegurar garantias de forte consistência
em sistemas de base de dados, especialmente aqueles distribuídos pelo globo. A
primeira é o Primary Semi-Primary, uma arquitetura de base de dados distribuída
com total replicação que permite que as réplicas evoluam independentemente, para
evitar que os clientes precisem de esperar que escritas precedentes que não geram
conflitos sejam propagadas. Apesar das réplicas poderem processar tanto leituras
como escritas, melhorando a escalabilidade, o sistema continua a oferecer garantias de
consistência forte, através do envio da certificação de transações para um nó central.
O seu design é independente de modelos de dados, mas a sua implementação pode
tirar partido do controlo de concorrência nativo oferecido por algumas base de dados,
como é mostrado na implementação usando PostgreSQL e o seu Snapshot Isolation.
Os resultados apresentam várias vantagens tanto em ambientes locais como globais. A
segunda solução são os Multi-Record Values, uma técnica que particiona dinâmicamente
valores numéricos em múltiplos registros, permitindo que escritas concorrentes possam
executar com uma baixa probabilidade de colisão, reduzindo a taxa de abortos e/ou
contenção na adquirição de locks. Garantias de limites inferiores, exigido por objetos
como saldos bancários ou inventários, são assegurados por esta estratégia, ao contrário
de muitas outras alternativas. O seu design é também indiferente do modelo de dados,
sendo que as suas vantagens podem ser encontradas em sistemas SQL e NoSQL, bem
como distribuídos ou centralizados, tal como apresentado na secção de avaliação.As applications reach an wider audience that ever before, they must process larger and larger amounts of requests. In addition, they often must be able to serve users all over the globe, where network latencies have a significant negative impact on
monolithic deployments. Therefore, distribution is a well sought-after solution to
improve performance of both applicational and database layers. However, distributing
data is not an easy task if we want to ensure strong consistency guarantees. This leads
many databases systems to rely on expensive synchronization controls protocols such
as two-phase commit, distributed consensus, distributed locking, among others, while
other systems rely on weak consistency, unfeasible for some use cases.
This thesis presents the design, implementation and evaluation of two solutions
aimed at reducing the impact of ensuring strong consistency guarantees on database
systems, especially geo-distributed ones. The first is the Primary Semi-Primary, a full replication distributed database architecture that allows different replicas to evolve
independently, to avoid that clients wait for preceding non-conflicting updates. Al though replicas can process both reads and writes, improving scalability, the system
still ensures strong consistency guarantees, by relaying transactions’ certifications
to a central node. Its design is independent of the underlying data model, but its
implementation can take advantage of the native concurrency control offered by some
systems, as is exemplified by an implementation using PostgreSQL and its Snapshot
Isolation. The results present several advantages in both throughput and response time,
when comparing to other alternative architectures, in both local and geo-distributed
environments. The second solution is the Multi-Record Values, a technique that dynami cally partitions numeric values into multiple records, allowing concurrent writes to
execute with low conflict probability, reducing abort rate and/or locking contention.
Lower limit guarantees, required by objects such as balances or stocks, are ensure by
this strategy, unlike many other similar alternatives. Its design is also data model
agnostic, given its advantages can be found in both SQL and NoSQL systems, as well
as both centralized and distributed database, as presented in the evaluation section
A Systematic Approach to Constructing Incremental Topology Control Algorithms Using Graph Transformation
Communication networks form the backbone of our society. Topology control
algorithms optimize the topology of such communication networks. Due to the
importance of communication networks, a topology control algorithm should
guarantee certain required consistency properties (e.g., connectivity of the
topology), while achieving desired optimization properties (e.g., a bounded
number of neighbors). Real-world topologies are dynamic (e.g., because nodes
join, leave, or move within the network), which requires topology control
algorithms to operate in an incremental way, i.e., based on the recently
introduced modifications of a topology. Visual programming and specification
languages are a proven means for specifying the structure as well as
consistency and optimization properties of topologies. In this paper, we
present a novel methodology, based on a visual graph transformation and graph
constraint language, for developing incremental topology control algorithms
that are guaranteed to fulfill a set of specified consistency and optimization
constraints. More specifically, we model the possible modifications of a
topology control algorithm and the environment using graph transformation
rules, and we describe consistency and optimization properties using graph
constraints. On this basis, we apply and extend a well-known constructive
approach to derive refined graph transformation rules that preserve these graph
constraints. We apply our methodology to re-engineer an established topology
control algorithm, kTC, and evaluate it in a network simulation study to show
the practical applicability of our approachComment: This document corresponds to the accepted manuscript of the
referenced journal articl
Recommended from our members
Compiler and system for resilient distributed heterogeneous graph analytics
Graph analytics systems are used in a wide variety of applications including health care, electronic circuit design, machine learning, and cybersecurity. Graph analytics systems must handle very large graphs such as the Facebook friends graph, which has more than a billion nodes and 200 billion edges. Since machines have limited main memory, distributed-memory clusters with sufficient memory and computation power are required for processing of these graphs. In distributed graph analytics, the graph is partitioned among the machines in a cluster, and communication between partitions is implemented using a substrate like MPI. However, programming distributed-memory systems are not easy and the recent trend towards the processor heterogeneity has added to this complexity. To simplify the programming of graph applications on such platforms, this dissertation first presents a compiler called Abelian that translates shared-memory descriptions of graph algorithms written in the Galois programming model into efficient code for distributed-memory platforms with heterogeneous processors. An important runtime parameter to the compiler-generated distributed code is the partitioning policy. We present an experimental study of partitioning strategies for distributed work-efficient graph analytics applications on different CPU architecture clusters at large scale (up to 256 machines). Based on the study we present a simple rule of thumb to select among myriad policies. Another challenge of distributed graph analytics that we address in this dissertation is to deal with machine fail-stop failures, which is an important concern especially for long-running graph analytics applications on large clusters. We present a novel communication and synchronization substrate called Phoenix that leverages the algorithmic properties of graph analytics applications to recover from faults with zero overheads during fault-free execution and show that Phoenix is 24x faster than previous state-of-the-art systems. In this dissertation, we also look at the new opportunities for graph analytics on massive datasets brought by a new kind of byte-addressable memory technology with higher density and lower cost than DRAM such as intel Optane DC Persistent Memory. This enables the design of affordable systems that support up to 6TB of randomly accessible memory. In this dissertation, we present key runtime and algorithmic principles to consider when performing graph analytics on massive datasets on Optane DC Persistent Memory as well as highlight ideas that apply to graph analytics on all large-memory platforms. Finally, we show that our distributed graph analytics infrastructure can be used for a new domain of applications, in particular, embedding algorithms such as Word2Vec. Word2Vec trains the vector representations of words (also known as word embeddings) on large text corpus and resulting vector embeddings have been shown to capture semantic and syntactic relationships among words. Other examples include Node2Vec, Code2Vec, Sequence2Vec, etc (collectively known as Any2Vec) with a wide variety of uses. We formulate the training of such applications as a graph problem and present GraphAny2Vec, a distributed Any2Vec training framework that leverages the state-of-the-art distributed heterogeneous graph analytics infrastructure developed in this dissertation to scale Any2Vec training to large distributed clusters. GraphAny2Vec also demonstrates a novel way of combining model gradients during training, which allows it to scale without losing accuracyComputer Science
Seeing the City Digitally
This book explores what's happening to ways of seeing urban spaces in the contemporary moment, when so many of the technologies through which cities are visualised are digital. Cities have always been pictured, in many media and for many different purposes. This edited collection explores how that picturing is changing in an era of digital visual culture. Analogue visual technologies like film cameras were understood as creating some sort of a trace of the real city. Digital visual technologies, in contrast, harvest and process digital data to create images that are constantly refreshed, modified and circulated. Each of the chapters in this volume examines a different example of this processual visuality is reconfiguring the spatial and temporal organisation of urban life
Seeing the City Digitally
This book explores what's happening to ways of seeing urban spaces in the contemporary moment, when so many of the technologies through which cities are visualised are digital. Cities have always been pictured, in many media and for many different purposes. This edited collection explores how that picturing is changing in an era of digital visual culture. Analogue visual technologies like film cameras were understood as creating some sort of a trace of the real city. Digital visual technologies, in contrast, harvest and process digital data to create images that are constantly refreshed, modified and circulated. Each of the chapters in this volume examines a different example of this processual visuality is reconfiguring the spatial and temporal organisation of urban life
Efficient Passive Clustering and Gateways selection MANETs
Passive clustering does not employ control packets to collect topological information in ad hoc networks. In our proposal, we avoid making frequent changes in cluster architecture due to repeated election and re-election of cluster heads and gateways. Our primary objective has been to make Passive Clustering more practical by employing optimal number of gateways and reduce the number of rebroadcast packets
Recommended from our members
Proceedings of the 1st International Conference on Live Coding
Open Access peer reviewed papers on live coding published at the 1st International Conference on Live Coding (ICLC) in Leeds
Recommended from our members
Improving System Reliability for Cyber-Physical Systems
Cyber-physical systems (CPS) are systems featuring a tight combination of, and coordination between, the system's computational and physical elements. Cyber-physical systems include systems ranging from critical infrastructure such as a power grid and transportation system to health and biomedical devices. System reliability, i.e., the ability of a system to perform its intended function under a given set of environmental and operational conditions for a given period of time, is a fundamental requirement of cyber-physical systems. An unreliable system often leads to disruption of service, financial cost and even loss of human life. An important and prevalent type of cyber-physical system meets the following criteria: processing large amounts of data; employing software as a system component; running online continuously; having operator-in-the-loop because of human judgment and an accountability requirement for safety critical systems. This thesis aims to improve system reliability for this type of cyber-physical system. To improve system reliability for this type of cyber-physical system, I present a system evaluation approach entitled automated online evaluation (AOE), which is a data-centric runtime monitoring and reliability evaluation approach that works in parallel with the cyber-physical system to conduct automated evaluation along the workflow of the system continuously using computational intelligence and self-tuning techniques and provide operator-in-the-loop feedback on reliability improvement. For example, abnormal input and output data at or between the multiple stages of the system can be detected and flagged through data quality analysis. As a result, alerts can be sent to the operator-in-the-loop. The operator can then take actions and make changes to the system based on the alerts in order to achieve minimal system downtime and increased system reliability. One technique used by the approach is data quality analysis using computational intelligence, which applies computational intelligence in evaluating data quality in an automated and efficient way in order to make sure the running system perform reliably as expected. Another technique used by the approach is self-tuning which automatically self-manages and self-configures the evaluation system to ensure that it adapts itself based on the changes in the system and feedback from the operator. To implement the proposed approach, I further present a system architecture called autonomic reliability improvement system (ARIS). This thesis investigates three hypotheses. First, I claim that the automated online evaluation empowered by data quality analysis using computational intelligence can effectively improve system reliability for cyber-physical systems in the domain of interest as indicated above. In order to prove this hypothesis, a prototype system needs to be developed and deployed in various cyber-physical systems while certain reliability metrics are required to measure the system reliability improvement quantitatively. Second, I claim that the self-tuning can effectively self-manage and self-configure the evaluation system based on the changes in the system and feedback from the operator-in-the-loop to improve system reliability. Third, I claim that the approach is efficient. It should not have a large impact on the overall system performance and introduce only minimal extra overhead to the cyberphysical system. Some performance metrics should be used to measure the efficiency and added overhead quantitatively. Additionally, in order to conduct efficient and cost-effective automated online evaluation for data-intensive CPS, which requires large volumes of data and devotes much of its processing time to I/O and data manipulation, this thesis presents COBRA, a cloud-based reliability assurance framework. COBRA provides automated multi-stage runtime reliability evaluation along the CPS workflow using data relocation services, a cloud data store, data quality analysis and process scheduling with self-tuning to achieve scalability, elasticity and efficiency. Finally, in order to provide a generic way to compare and benchmark system reliability for CPS and to extend the approach described above, this thesis presents FARE, a reliability benchmark framework that employs a CPS reliability model, a set of methods and metrics on evaluation environment selection, failure analysis, and reliability estimation. The main contributions of this thesis include validation of the above hypotheses and empirical studies of ARIS automated online evaluation system, COBRA cloud-based reliability assurance framework for data-intensive CPS, and FARE framework for benchmarking reliability of cyber-physical systems. This work has advanced the state of the art in the CPS reliability research, expanded the body of knowledge in this field, and provided some useful studies for further research
Anti-fragile ICT Systems
This book introduces a novel approach to the design and operation of large ICT systems. It views the technical solutions and their stakeholders as complex adaptive systems and argues that traditional risk analyses cannot predict all future incidents with major impacts. To avoid unacceptable events, it is necessary to establish and operate anti-fragile ICT systems that limit the impact of all incidents, and which learn from small-impact incidents how to function increasingly well in changing environments. The book applies four design principles and one operational principle to achieve anti-fragility for different classes of incidents. It discusses how systems can achieve high availability, prevent malware epidemics, and detect anomalies. Analyses of Netflix’s media streaming solution, Norwegian telecom infrastructures, e-government platforms, and Numenta’s anomaly detection software show that cloud computing is essential to achieving anti-fragility for classes of events with negative impacts
- …