7,759 research outputs found

    SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

    Full text link
    Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.Comment: Accepted to International Conference on Machine Learning (ICML) 2023. 25 pages, 8 figure

    Modular lifelong machine learning

    Get PDF
    Deep learning has drastically improved the state-of-the-art in many important fields, including computer vision and natural language processing (LeCun et al., 2015). However, it is expensive to train a deep neural network on a machine learning problem. The overall training cost further increases when one wants to solve additional problems. Lifelong machine learning (LML) develops algorithms that aim to efficiently learn to solve a sequence of problems, which become available one at a time. New problems are solved with less resources by transferring previously learned knowledge. At the same time, an LML algorithm needs to retain good performance on all encountered problems, thus avoiding catastrophic forgetting. Current approaches do not possess all the desired properties of an LML algorithm. First, they primarily focus on preventing catastrophic forgetting (Diaz-Rodriguez et al., 2018; Delange et al., 2021). As a result, they neglect some knowledge transfer properties. Furthermore, they assume that all problems in a sequence share the same input space. Finally, scaling these methods to a large sequence of problems remains a challenge. Modular approaches to deep learning decompose a deep neural network into sub-networks, referred to as modules. Each module can then be trained to perform an atomic transformation, specialised in processing a distinct subset of inputs. This modular approach to storing knowledge makes it easy to only reuse the subset of modules which are useful for the task at hand. This thesis introduces a line of research which demonstrates the merits of a modular approach to lifelong machine learning, and its ability to address the aforementioned shortcomings of other methods. Compared to previous work, we show that a modular approach can be used to achieve more LML properties than previously demonstrated. Furthermore, we develop tools which allow modular LML algorithms to scale in order to retain said properties on longer sequences of problems. First, we introduce HOUDINI, a neurosymbolic framework for modular LML. HOUDINI represents modular deep neural networks as functional programs and accumulates a library of pre-trained modules over a sequence of problems. Given a new problem, we use program synthesis to select a suitable neural architecture, as well as a high-performing combination of pre-trained and new modules. We show that our approach has most of the properties desired from an LML algorithm. Notably, it can perform forward transfer, avoid negative transfer and prevent catastrophic forgetting, even across problems with disparate input domains and problems which require different neural architectures. Second, we produce a modular LML algorithm which retains the properties of HOUDINI but can also scale to longer sequences of problems. To this end, we fix the choice of a neural architecture and introduce a probabilistic search framework, PICLE, for searching through different module combinations. To apply PICLE, we introduce two probabilistic models over neural modules which allows us to efficiently identify promising module combinations. Third, we phrase the search over module combinations in modular LML as black-box optimisation, which allows one to make use of methods from the setting of hyperparameter optimisation (HPO). We then develop a new HPO method which marries a multi-fidelity approach with model-based optimisation. We demonstrate that this leads to improvement in anytime performance in the HPO setting and discuss how this can in turn be used to augment modular LML methods. Overall, this thesis identifies a number of important LML properties, which have not all been attained in past methods, and presents an LML algorithm which can achieve all of them, apart from backward transfer

    Graft-Specific Surgical and Rehabilitation Considerations for Anterior Cruciate Ligament Reconstruction with the Quadriceps Tendon Autograft

    Get PDF
    Anterior cruciate ligament reconstruction (ACLR) with a bone-patellar tendon-bone (BPTB) or hamstring tendon (HT) autograft has traditionally been the preferred surgical treatment for patients returning to Level 1 sports. More recently, international utilization of the quadriceps tendon (QT) autograft for primary and revision ACLR has increased in popularity. Recent literature suggests that ACLR with the QT may yield less donor site morbidity than the BPTB and better patient-reported outcomes than the HT. Additionally, anatomic and biomechanical studies have highlighted the robust properties of the QT itself, with superior levels of collagen density, length, size, and load-to-failure strength compared to the BPTB. Although previous literature has described rehabilitation considerations for the BPTB and HT autografts, there is less published with respect to the QT. Given the known impact of the various ACLR surgical techniques on postoperative rehabilitation, the purpose of this clinical commentary is to present the procedure-specific surgical and rehabilitation considerations for ACLR with the QT, as well as further highlight the need for procedure-specific rehabilitation strategies after ACLR by comparing the QT to the BPTB and HT autografts. # Level of Evidence Level

    Learning disentangled speech representations

    Get PDF
    A variety of informational factors are contained within the speech signal and a single short recording of speech reveals much more than the spoken words. The best method to extract and represent informational factors from the speech signal ultimately depends on which informational factors are desired and how they will be used. In addition, sometimes methods will capture more than one informational factor at the same time such as speaker identity, spoken content, and speaker prosody. The goal of this dissertation is to explore different ways to deconstruct the speech signal into abstract representations that can be learned and later reused in various speech technology tasks. This task of deconstructing, also known as disentanglement, is a form of distributed representation learning. As a general approach to disentanglement, there are some guiding principles that elaborate what a learned representation should contain as well as how it should function. In particular, learned representations should contain all of the requisite information in a more compact manner, be interpretable, remove nuisance factors of irrelevant information, be useful in downstream tasks, and independent of the task at hand. The learned representations should also be able to answer counter-factual questions. In some cases, learned speech representations can be re-assembled in different ways according to the requirements of downstream applications. For example, in a voice conversion task, the speech content is retained while the speaker identity is changed. And in a content-privacy task, some targeted content may be concealed without affecting how surrounding words sound. While there is no single-best method to disentangle all types of factors, some end-to-end approaches demonstrate a promising degree of generalization to diverse speech tasks. This thesis explores a variety of use-cases for disentangled representations including phone recognition, speaker diarization, linguistic code-switching, voice conversion, and content-based privacy masking. Speech representations can also be utilised for automatically assessing the quality and authenticity of speech, such as automatic MOS ratings or detecting deep fakes. The meaning of the term "disentanglement" is not well defined in previous work, and it has acquired several meanings depending on the domain (e.g. image vs. speech). Sometimes the term "disentanglement" is used interchangeably with the term "factorization". This thesis proposes that disentanglement of speech is distinct, and offers a viewpoint of disentanglement that can be considered both theoretically and practically

    HDG methods and data-driven techniques for the van Roosbroeck model and its applications

    Get PDF
    Noninvasive estimation of doping inhomogeneities in semiconductors is relevant for many industrial applications. The goal is to estimate experimentally the unknown doping profile of a semiconductor by means of reproducible, indirect and non--destructive measurements. A number of technologies (such as LBIC, EBIC and LPS) have been developed which allow the indirect detection of doping variations via photovoltaic effects. The idea is to illuminate the sample at several positions while measuring the resulting voltage drop or current at the contacts. These technologies lead to inverse problems for which we still do not have a complete theoretical framework. In this thesis, we present three different data-driven approaches based on least squares, multilayer perceptrons, and residual neural networks. We compare the three strategies after having optimized the relevant hyperparameters and we measure the robustness of our approaches with respect to noise. The methods are trained on synthetic data sets (pairs of discrete doping profiles and corresponding photovoltage signals at different illumination positions) which are generated by a numerical solution of the forward problem using a physics-preserving finite volume method stabilized using the Scharfetter--Gummel scheme. In view of the need of generating larger datasets for trainings, we study the possibility to apply high-order Discontinuous Galerkin methods to the forward problem, preserving the stability properties of the Scharfetter--Gummel scheme. We prove that the Hybridizable Discontinuous Galerkin methods (HDG), a family of high-order DG methods, are equivalent to the Scharfetter--Gummel scheme on uniform unidimensional grids for a specific choice of the HDG stabilization parameter. This result is generalized to two and three dimensions using an approach based on weighted scalar products, and on local Slotboom changes of variables. We show that the proposed numerical scheme is well-posed, and numerically validate that it has the same properties of classical HDG methods, including optimal convergence and superconvergence of postprocessed solutions. For polynomial degree zero, dimension one, and vanishing HDG stabilization parameter, W-HDG coincides with the Scharfetter-Gummel stabilized finite volume scheme (i.e., it produces the same system matrix)

    Understanding Deep Learning Optimization via Benchmarking and Debugging

    Get PDF
    Das zentrale Prinzip des maschinellen Lernens (ML) ist die Vorstellung, dass Computer die notwendigen Strategien zur Lösung einer Aufgabe erlernen können, ohne explizit dafĂŒr programmiert worden zu sein. Die Hoffnung ist, dass Computer anhand von Daten die zugrunde liegenden Muster erkennen und selbst feststellen, wie sie Aufgaben erledigen können, ohne dass sie dabei von Menschen geleitet werden mĂŒssen. Um diese Aufgabe zu erfĂŒllen, werden viele Probleme des maschinellen Lernens als Minimierung einer Verlustfunktion formuliert. Daher sind Optimierungsverfahren ein zentraler Bestandteil des Trainings von ML-Modellen. Obwohl das maschinelle Lernen und insbesondere das tiefe Lernen oft als innovative Spitzentechnologie wahrgenommen wird, basieren viele der zugrunde liegenden Optimierungsalgorithmen eher auf simplen, fast archaischen Verfahren. Um moderne neuronale Netze erfolgreich zu trainieren, bedarf es daher hĂ€ufig umfangreicher menschlicher UnterstĂŒtzung. Ein Grund fĂŒr diesen mĂŒhsamen, umstĂ€ndlichen und langwierigen Trainingsprozess ist unser mangelndes VerstĂ€ndnis der Optimierungsmethoden im anspruchsvollen Rahmen des tiefen Lernens. Auch deshalb hat das Training neuronaler Netze bis heute den Ruf, eher eine Kunstform als eine echte Wissenschaft zu sein und erfordert ein Maß an menschlicher Beteiligung, welche dem Kernprinzip des maschinellen Lernens widerspricht. Obwohl bereits Hunderte Optimierungsverfahren fĂŒr das tiefe Lernen vorgeschlagen wurden, gibt es noch kein allgemein anerkanntes Protokoll zur Beurteilung ihrer QualitĂ€t. Ohne ein standardisiertes und unabhĂ€ngiges Bewertungsprotokoll ist es jedoch schwierig, die NĂŒtzlichkeit neuartiger Methoden zuverlĂ€ssig nachzuweisen. In dieser Arbeit werden Strategien vorgestellt, mit denen sich Optimierer fĂŒr das tiefe Lernen quantitativ, reproduzierbar und aussagekrĂ€ftig vergleichen lassen. Dieses Protokoll berĂŒcksichtigt die einzigartigen Herausforderungen des tiefen Lernens, wie etwa die inhĂ€rente StochastizitĂ€t oder die wichtige Unterscheidung zwischen Lernen und reiner Optimierung. Die Erkenntnisse sind im Python-Paket DeepOBS formalisiert und automatisiert, wodurch gerechtere, schnellere und ĂŒberzeugendere empirische Vergleiche von Optimierern ermöglicht werden. Auf der Grundlage dieses Benchmarking-Protokolls werden anschließend fĂŒnfzehn populĂ€re Deep-Learning-Optimierer verglichen, um einen Überblick ĂŒber den aktuellen Entwicklungsstand in diesem Bereich zu gewinnen. Um fundierte Entscheidungshilfen fĂŒr die Auswahl einer Optimierungsmethode aus der wachsenden Liste zu erhalten, evaluiert der Benchmark sie umfassend anhand von fast 50 000 Trainingsprozessen. Unser Benchmark zeigt, dass der vergleichsweise traditionelle Adam-Optimierer eine gute, aber nicht dominierende Methode ist und dass neuere Algorithmen ihn nicht kontinuierlich ĂŒbertreffen können. Neben dem verwendeten Optimierer können auch andere Ursachen das Training neuronaler Netze erschweren, etwa ineffiziente Modellarchitekturen oder Hyperparameter. Herkömmliche Leistungsindikatoren, wie etwa die Verlustfunktion auf den Trainingsdaten oder die erreichte Genauigkeit auf einem separaten Validierungsdatensatz, können zwar zeigen, ob das Modell lernt oder nicht, aber nicht warum. Um dieses VerstĂ€ndnis und gleichzeitig einen Blick in die Blackbox der neuronalen Netze zu liefern, wird in dieser Arbeit Cockpit prĂ€sentiert, ein Debugging-Tool speziell fĂŒr das tiefe Lernen. Es kombiniert neuartige und bewĂ€hrte Observablen zu einem Echtzeit-Überwachungswerkzeug fĂŒr das Training neuronaler Netze. Cockpit macht unter anderem deutlich,dass gut getunte Trainingsprozesse konsequent ĂŒber das lokale Minimum hinausgehen, zumindest fĂŒr wesentliche Phasen des Trainings. Der Einsatz von sorgfĂ€ltigen Benchmarking-Experimenten und maßgeschneiderten Debugging-Tools verbessert unser VerstĂ€ndnis des Trainings neuronaler Netze. Angesichts des Mangels an theoretischen Erkenntnissen sind diese empirischen Ergebnisse und praktischen Instrumente unerlĂ€sslich fĂŒr die UnterstĂŒtzung in der Praxis. Vor allem aber zeigen sie auf, dass es einen Bedarf und einen klaren Weg fĂŒr grundlegend neuartigen Optimierungsmethoden gibt, um das tiefe Lernen zugĂ€nglicher, robuster und ressourcenschonender zu machen.The central paradigm of machine learning (ML) is the idea that computers can learn the strategies needed to solve a task without being explicitly programmed to do so. The hope is that given data, computers can recognize underlying patterns and figure out how to perform tasks without extensive human oversight. To achieve this, many machine learning problems are framed as minimizing a loss function, which makes optimization methods a core part of training ML models. Machine learning and in particular deep learning is often perceived as a cutting-edge technology, the underlying optimization algorithms, however, tend to resemble rather simplistic, even archaic methods. Crucially, they rely on extensive human intervention to successfully train modern neural networks. One reason for this tedious, finicky, and lengthy training process lies in our insufficient understanding of optimization methods in the challenging deep learning setting. As a result, training neural nets, to this day, has the reputation of being more of an art form than a science and requires a level of human assistance that runs counter to the core principle of ML. Although hundreds of optimization algorithms for deep learning have been proposed, there is no widely agreed-upon protocol for evaluating their performance. Without a standardized and independent evaluation protocol, it is difficult to reliably demonstrate the usefulness of novel methods. In this thesis, we present strategies for quantitatively and reproducibly comparing deep learning optimizers in a meaningful way. This protocol considers the unique challenges of deep learning such as the inherent stochasticity or the crucial distinction between learning and pure optimization. It is formalized and automatized in the Python package DeepOBS and allows fairer, faster, and more convincing empirical comparisons of deep learning optimizers. Based on this benchmarking protocol, we compare fifteen popular deep learning optimizers to gain insight into the field’s current state. To provide evidence-backed heuristics for choosing among the growing list of optimization methods, we extensively evaluate them with roughly 50,000 training runs. Our benchmark indicates that the comparably traditional Adam optimizer remains a strong but not dominating contender and that newer methods fail to consistently outperform it. In addition to the optimizer, other causes can impede neural network training, such as inefficient model architectures or hyperparameters. Traditional performance metrics, such as training loss or validation accuracy, can show if a model is learning or not, but not why. To provide this understanding and a glimpse into the black box of neural networks, we developed Cockpit, a debugging tool specifically for deep learning. It combines novel and proven observables into a live monitoring tool for practitioners. Among other findings, Cockpit reveals that well-tuned training runs consistently overshoot the local minimum, at least for significant portions of the training. The use of thorough benchmarking experiments and tailored debugging tools improves our understanding of neural network training. In the absence of theoretical insights, these empirical results and practical tools are essential for guiding practitioners. More importantly, our results show that there is a need and a clear path for fundamentally different optimization methods to make deep learning more accessible, robust, and resource-efficient

    Exploiting the optical properties of earth abundant cuprous oxide nanocatalysts for energy and health applications

    Get PDF
    In this dissertation, we explore the optical properties of semiconductor materials, for energy and photocatalytic applications. In the past semiconductor materials used in photocatalytic reactions are prominently known through electron transfer mechanisms such as redox reactions, and local surface plasmon resonance (LSPR). In this work, we show substantial understanding and advantages of Mie resonances-based photocatalysis. Mie resonance-based photocatalytic mechanisms can find various applications in chemical manufacturing, pollution mitigation, and pharmaceutical industries. We developed an understanding that Mie resonances of metal-oxide nanoparticles are affected by material properties such as absorption and scattering coefficients, dielectric permittivity physical properties like geometry, size of these nanoparticles, and wavelength, and intensity of the incident light. In this work, we experimentally, demonstrate that the dielectric Mie resonances in cuprous oxide (Cu2O) spherical and cubical nanostructures can be used to enhance the dye-sensitization rate of methylene blue dye. The Cu2O nanostructures exhibit dielectric Mie resonances up to an order of magnitude higher dye-sensitization rate and photocatalytic rate as compared to Cu2O nanostructures not exhibiting dielectric Mie resonances. We further established structure-property-performance relationships of these nanostructures and experimentally found evidence, that rate of dye sensitization is directly proportional to the overlap of absorption characteristics of the nanocatalyst, absorption of the dye, and the wavelength of incident light. This work has the potential to be used in pollution mitigation applications, Dye-Sensitized Solar Cells, etc. Gaining a deeper understanding of the characteristics of Cu2O nanostructures, we have experimentally observed that tuning selectivity and activity of reactions can be achieved by modulating the incident wavelength of light. We performed intensity-dependent studies for methylene blue degradation for gaining mechanistic insights into selective photocatalysis. We explored C-C coupling reactions with small molecules which find applications majorly in the chemical and health industry. Carbon-carbon (C-C) coupling reactions are widely used to produce a range of compounds including pharmaceuticals, aromatic polymers, high-performance materials, and agrochemicals. For these reactions industrially, homogenous palladium (Pd) catalysts are used at high temperatures and are a solvent-intensive process. Palladium is expensive, toxic, and rare earth metal. However, the identification of truly heterogeneous versus homogeneous catalytic conditions remains an ongoing challenge within the field. In this research, we gained insights into the homogenous versus heterogeneous pathways using various analytical, experimental, and computational techniques. In this work, we found that Cu2O nanoparticles can catalyze C-C coupling reactions under ligandless and base-free conditions via a truly heterogeneous pathway paving the way for the development of highly efficient, robust, and sustainable flow processes
    • 

    corecore