531 research outputs found

    Coded Computing for Fault-Tolerant Parallel QR Decomposition

    Full text link
    QR decomposition is an essential operation for solving linear equations and obtaining least-squares solutions. In high-performance computing systems, large-scale parallel QR decomposition often faces node faults. We address this issue by proposing a fault-tolerant algorithm that incorporates `coded computing' into the parallel Gram-Schmidt method, commonly used for QR decomposition. Coded computing introduces error-correcting codes into computational processes to enhance resilience against intermediate failures. While traditional coding strategies cannot preserve the orthogonality of QQ, recent work has proven a post-orthogonalization condition that allows low-cost restoration of the degraded orthogonality. In this paper, we construct a checksum-generator matrix for multiple-node failures that satisfies the post-orthogonalization condition and prove that our code satisfies the maximum-distance separable (MDS) property with high probability. Furthermore, we consider in-node checksum storage setting where checksums are stored in original nodes. We obtain the minimal number of checksums required to be resilient to any ff failures under the in-node checksum storage, and also propose an in-node systematic MDS coding strategy that achieves the lower bound. Extensive experiments validate our theories and showcase the negligible overhead of our coded computing framework for fault-tolerant QR decomposition

    Training Quantum Kernels for Clustering Algorithms

    Get PDF
    Clustering ist eine außerordentlich wichtige Sparte des Maschinellen Lernens von der besonders Anwendungen, die sich mit großen Datenmengen beschäftigen, profitieren können. In diesen Anwendungen wird Clustering verwendet um Strukturen in Daten zu finden und ähnliche Datenpunkte zu gruppieren. Allerdings ist Ähnlichkeit in diesem Kontext kein wohldefiniertes Maß und hängt stark von der jeweiligen Situation und dem Datensatz ab. Diese Situationsabhängigkeit ist auch der Grund weshalb es nicht nur einen Clustering Algorithmus gibt. Vielmehr gibt es unterschiedliche Modelle dafür wie ein Cluster definiert sein kann und entsprechend unterschiedliche Algorithmen, die Cluster basierend auf diesen Modellen finden können. Außerdem verwenden viele dieser Algorithmen austauschbare Kernel-Funktionen, welche die Ähnlichkeit zwischen zwei Datenpunkten definiert. Diese Kernel-Funktionen können allerdings nicht ausschließlich auf klassischen Computern berechnet werden, denn jüngste Forschung im Bereich des Quanten Maschinellen Lernens zeigt, dass auch Quantencomputer verwendet werden können um Kernel-Funktionen zu berechnen. Quantencomputer sind hauptsächlich dafür bekannt bestimmte Probleme, die im Klassischen unlösbar oder nur schwer lösbar sind, effizient berechnen zu können. Die Hoffnung ist, dass es bestimmte Quanten Kernel-Funktionen gibt die klassisch nicht berechnet werden können, aber einen Vorteil gegenüber klassischen Kernel-Funktionen darstellen. Quanten Kernel-Funktionen können des weiteren mit variierbaren Quanten Schaltkreisen kombiniert werden. Diese Quanten Schaltkreise sind sehr ähnlich zu klassischen Neuronalen Netzen, denn beide besitzen variierbare Parameter die optimiert werden können, um eine bestimmte Zielfunktion zu realisieren. Die Kombination einer Quanten Kernel-Funktion und einem variierbaren Quanten Schaltkreis, auch variierbarer Quanten Kernel genannt, sorgt dafür, dass die Kernel-Funktion durch Training an den vorhandenen Datensatz angepasst werden kann. Diese Idee wurde zuerst von den Autoren von [Hub+21] eingeführt, die die Kernel-Funktion mit einer Support Vector Machine kombiniert haben. Im Vergleich dazu schlägt diese Arbeit eine Kombination von variierbaren Quanten Kernels mit klassischen Clustering Algorithmen vor. Da Clustering generell im Bereich des unüberwachten Maschinellen Lernens angesiedelt ist, wird die Optimierung der Parameter ohne klassifizierte Trainingsdaten schwierig. Deshalb betrachtet diese Arbeit zwei unterschiedliche Wege die Parameter zu trainieren. Einerseits vollständig unüberwacht, wodurch keine klassifizierten Trainingsdaten vorhanden sind, andererseits semi-überwacht, bei dem kleine Mengen von klassifizierten Daten für das Training verwendet werden können. Die trainierten Quanten Kernels werden anschließend mit unterschiedlichen klassischen Clustering Algorithmen, die Kernel-Funktionen nutzen können, kombiniert. Um diesen Ansatz zu evaluieren wurde eine Trainings- und Testpipeline implementiert, die simulierte Quantencomputer auf klassischer Hardware verwendet. Umfangreiche Experimente zeigen, dass variierbare Quanten Kernels mit unterschiedlichen semi-überwachten Kostenfunktionen trainiert werden können. Desweiteren können diese trainierten Quanten Kernels in klassischen Clustering Verfahren gute Ergebnisse erzielen. In den meisten Fällen sind die Clustering Ergebnisse mit trainierten Quanten Kernels besser als mit dem häufig benutzten, klassischen radial basis (rbf) Kernel. Allerdings konnten für die vorgestellten unüberwachten Kostenfunktionen keine guten Ergebnisse erreicht werden, weshalb diese momentan nicht geeignet sind, um einen variierbaren Quanten Kernel zu trainieren

    Resiliency in numerical algorithm design for extreme scale simulations

    Get PDF
    This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.Peer Reviewed"Article signat per 36 autors/es: Emmanuel Agullo, Mirco Altenbernd, Hartwig Anzt, Leonardo Bautista-Gomez, Tommaso Benacchio, Luca Bonaventura, Hans-Joachim Bungartz, Sanjay Chatterjee, Florina M. Ciorba, Nathan DeBardeleben, Daniel Drzisga, Sebastian Eibl, Christian Engelmann, Wilfried N. Gansterer, Luc Giraud, Dominik G ̈oddeke, Marco Heisig, Fabienne Jezequel, Nils Kohl, Xiaoye Sherry Li, Romain Lion, Miriam Mehl, Paul Mycek, Michael Obersteiner, Enrique S. Quintana-Ortiz, Francesco Rizzi, Ulrich Rude, Martin Schulz, Fred Fung, Robert Speck, Linda Stals, Keita Teranishi, Samuel Thibault, Dominik Thonnes, Andreas Wagner and Barbara Wohlmuth"Postprint (author's final draft

    Coding for Privacy in Distributed Computing

    Get PDF
    I et distribuert datanettverk samarbeider flere enheter for å løse et problem. Slik kan vi oppnå mer enn summen av delene: samarbeid gjør at problemet kan løses mer effektivt, og samtidig blir det mulig å løse problemer som hver enkelt enhet ikke kan løse på egen hånd. På den annen side kan enheter som bruker veldig lang tid på å fullføre sin oppgave øke den totale beregningstiden betydelig. Denne såkalte straggler-effekten kan oppstå som følge av tilfeldige hendelser som minnetilgang og oppgaver som kjører i bakgrunnen på de ulike enhetene. Straggler-problemet blokkerer vanligvis hele beregningen siden alle enhetene må vente på at de treigeste enhetene blir ferdige. Videre kan deling av data og delberegninger mellom de ulike enhetene belaste kommunikasjonsnettverket betydelig. Spesielt i et trådløst nettverk hvor enhetene må dele en enkelt kommunikasjonskanal, for eksempel ved beregninger langs kanten av et nettverk (såkalte kantberegninger) og ved føderert læring, blir kommunikasjonen ofte flaskehalsen. Sist men ikke minst gir deling av data med upålitelige enheter økt bekymring for personvernet. En som ønsker å bruke et distribuert datanettverk kan være skeptisk til å dele personlige data med andre enheter uten å beskytte sensitiv informasjon tilstrekkelig. Denne avhandlingen studerer hvordan ideer fra kodeteori kan dempe straggler-problemet, øke effektiviteten til kommunikasjonen og garantere datavern i distribuert databehandling. Spesielt gir del A en innføring i kantberegning og føderert læring, to populære instanser av distribuert databehandling, lineær regresjon, et vanlig problem som kan løses ved distribuert databehandling, og relevante ideer fra kodeteori. Del B består av forskningsartikler skrevet innenfor rammen av denne avhandlingen. Artiklene presenterer metoder som utnytter ideer fra kodeteori for å redusere beregningstiden samtidig som datavernet ivaretas ved kantberegninger og ved føderert læring. De foreslåtte metodene gir betydelige forbedringer sammenlignet med tidligere metoder i litteraturen. For eksempel oppnår en metode fra artikkel I en 8%-hastighetsforbedring for kantberegninger sammenlignet med en nylig foreslått metode. Samtidig ivaretar vår metode datavernet, mens den metoden som vi sammenligner med ikke gjør det. Artikkel II presenterer en metode som for noen brukstilfeller er opp til 18 ganger raskere for føderert læring sammenlignet med tidligere metoder i litteraturen.In a distributed computing network, multiple devices combine their resources to solve a problem. Thereby the network can achieve more than the sum of its parts: cooperation of the devices can enable the devices to compute more efficiently than each device on its own could and even enable the devices to solve a problem neither of them could solve on its own. However, devices taking exceptionally long to finish their tasks can exacerbate the overall latency of the computation. This so-called straggler effect can arise from random effects such as memory access and tasks running in the background of the devices. The effect typically stalls the whole network because most devices must wait for the stragglers to finish. Furthermore, sharing data and results among devices can severely strain the communication network. Especially in a wireless network where devices have to share a common channel, e.g., in edge computing and federated learning, the communication links often become the bottleneck. Last but not least, offloading data to untrusted devices raises privacy concerns. A participant in the distributed computing network might be weary of sharing personal data with other devices without adequately protecting sensitive information. This thesis analyses how ideas from coding theory can mitigate the straggler effect, reduce the communication load, and guarantee data privacy in distributed computing. In particular, Part A gives background on edge computing and federated learning, two popular instances of distributed computing, linear regression, a common problem to be solved by distributed computing, and the specific ideas from coding theory that are proposed to tackle the problems arising in distributed computing. Part B contains papers on the research performed in the framework of this thesis. The papers propose schemes that combine the introduced coding theory ideas to minimize the overall latency while preserving data privacy in edge computing and federated learning. The proposed schemes significantly outperform state-of-the-art schemes. For example, a scheme from Paper I achieves an 8% speed-up for edge computing compared to a recently proposed non-private scheme while guaranteeing data privacy, whereas the schemes from Paper II achieve a speed-up factor of up to 18 for federated learning compared to current schemes in the literature for considered scenarios.Doktorgradsavhandlin

    Investigating the behavior of deep convolution networks in image recognition

    Get PDF
    This research project investigates the role of key factors that led to the resurgence of deep CNNs and their success in classifying large datasets of natural images. Our investigation included the role of new network components, the role of the training data, and the role of data augmentation. Investigating the role of data augmentation led to the successful implementation of a deep CNN that can be trained using a variable input size, which increased the amount of allowable scale augmentation and led to much better single-view performance. Our analysis of the role of the training data shows the capabilities of deep CNNs to break down a large hierarchical dataset along the hierarchical lines into smaller components and learn all of them with great efficiency. This might help explain why deep CNN are very effective in classifying large and dense datasets of natural images which tend to have a hierarchical structure. Our investigation of core network components shows that the shared normalisation statistics of BN allowed us to alter the behaviour of the network by controlling the structure of the training batches. We used this observation to obtain large conditional gain by training and testing the network using balanced batches. Finally, we were able to implement a successful multitasking network that were able to outperform the corresponding single task networks. Our model used the normalisation statistics of BN to separate between the tasks, and our analysis shows that using a whole dataset per task increases the gains of the multitasking network by increasing the transfer of knowledge between the tasks.Open Acces

    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

    Get PDF
    Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators
    corecore