6,957 research outputs found

    Heavy elements in Globular Clusters: the role of AGB stars

    Full text link
    Recent observations of heavy elements in Globular Clusters reveal intriguing deviations from the standard paradigm of the early galactic nucleosynthesis. If the r-process contamination is a common feature of halo stars, s-process enhancements are found in a few Globular Clusters only. We show that the combined pollution of AGB stars with mass ranging between 3 to 6 M⊙_\odot may account for most of the features of the s-process overabundance in M4 and M22. In these stars, the s process is a mixture of two different neutron-capture nucleosynthesis episodes. The first is due to the 13C(a,n)16O reaction and takes place during the interpulse periods. The second is due to the 22Ne(a,n)25Mg reaction and takes place in the convective zones generated by thermal pulses. The production of the heaviest s elements (from Ba to Pb) requires the first neutron burst, while the second produces large overabundances of light s (Sr, Y, Zr). The first mainly operates in the less-massive AGB stars, while the second dominates in the more-massive. From the heavy-s/light-s ratio, we derive that the pollution phase should last for 150±50150\pm 50 Myr, a period short enough compared to the formation timescale of the Globular Cluster system, but long enough to explain why the s-process pollution is observed in a few cases only. With few exceptions, our theoretical prediction provides a reasonable reproduction of the observed s-process abundances, from Sr to Hf. However, Ce is probably underproduced by our models, while Rb and Pb are overproduced. Possible solutions are discussed.Comment: Accepted by the Ap

    A formally verified compiler back-end

    Get PDF
    This article describes the development and formal verification (proof of semantic preservation) of a compiler back-end from Cminor (a simple imperative intermediate language) to PowerPC assembly code, using the Coq proof assistant both for programming the compiler and for proving its correctness. Such a verified compiler is useful in the context of formal methods applied to the certification of critical software: the verification of the compiler guarantees that the safety properties proved on the source code hold for the executable compiled code as well

    Accelerating Verified-Compiler Development with a Verified Rewriting Engine

    Get PDF
    Compilers are a prime target for formal verification, since compiler bugs invalidate higher-level correctness guarantees, but compiler changes may become more labor-intensive to implement, if they must come with proof patches. One appealing approach is to present compilers as sets of algebraic rewrite rules, which a generic engine can apply efficiently. Now each rewrite rule can be proved separately, with no need to revisit past proofs for other parts of the compiler. We present the first realization of this idea, in the form of a framework for the Coq proof assistant. Our new Coq command takes normal proved theorems and combines them automatically into fast compilers with proofs. We applied our framework to improve the Fiat Cryptography toolchain for generating cryptographic arithmetic, producing an extracted command-line compiler that is about 1000×\times faster while actually featuring simpler compiler-specific proofs.Comment: 13th International Conference on Interactive Theorem Proving (ITP 2022

    Design and Implementation of a Domain Specific Language for Deep Learning

    Get PDF
    \textit {Deep Learning} (DL) has found great success in well-diversified areas such as machine vision, speech recognition, big data analysis, and multimedia understanding recently. However, the existing state-of-the-art DL frameworks, e.g. Caffe2, Theano, TensorFlow, MxNet, Torch7, and CNTK, are programming libraries with fixed user interfaces, internal representations, and execution environments. Modifying the code of DL layers or data structure is very challenging without in-depth understanding of the underlying implementation. The optimization of the code and execution in these tools is often limited and relies on the specific DL computation graph manipulation and scheduling that lack systematic and universal strategies. Furthermore, most of these tools demand many dependencies beside the tool itself and require to be built to some specific platforms for DL training or inference. \\\\ \noindent This dissertation presents {\it DeepDSL}, a \textit {domain specific language} (DSL) embedded in Scala, that compiles DL networks encoded with DeepDSL to efficient, compact, and portable Java source programs for DL training and inference. DeepDSL represents DL networks as abstract tensor functions, performs symbolic gradient derivations to generate the Intermediate Representation (IR), optimizes the IR expressions, and compiles the optimized IR expressions to cross-platform Java code that is easily modifiable and debuggable. Also, the code directly runs on GPU without additional dependencies except a small set of \textit{JNI} (Java Native Interface) wrappers for invoking the underneath GPU libraries. Moreover, DeepDSL provides static analysis for memory consumption and error detection. \\\\ \noindent DeepDSL\footnote{Our previous results are reported in~\cite{zhao2017}; design and implementation details are summarized in~\cite{Zhao2018}.} has been evaluated with many current state-of-the-art DL networks (e.g. Alexnet, GoogleNet, VGG, Overfeat, and Deep Residual Network). While the DSL code is highly compact with less than 100 lines for each of the network, the Java source code generated by the DeepDSL compiler is highly efficient. Our experiments show that the output java source has very competitive runtime performance and memory efficiency compared to the existing DL frameworks

    Acute: High-level programming language design for distributed computation : Design rationale and language definition

    Get PDF
    This paper studies key issues for distributed programming in high-level languages. We discuss the design space and describe an experimental language, Acute, which we have defined and implemented. Acute extends an OCaml core to support distributed development, deployment, and execution, allowing type-safe interaction between separately-built programs. It is expressive enough to enable a wide variety of distributed infrastructure layers to be written as simple library code above the byte-string network and persistent store APIs, disentangling the language runtime from communication. This requires a synthesis of novel and existing features: (1) type-safe marshalling of values between programs; (2) dynamic loading and controlled rebinding to local resources; (3) modules and abstract types with abstraction boundaries that are respected by interaction; (4) global names, generated either freshly or based on module hashes: at the type level, as runtime names for abstract types; and at the term level, as channel names and other interaction handles; (5) versions and version constraints, integrated with type identity; (6) local concurrency and thread thunkification; and (7) second-order polymorphism with a namecase construct. We deal with the interplay among these features and the core, and develop a semantic definition that tracks abstraction boundaries, global names, and hashes throughout compilation and execution, but which still admits an efficient implementation strategy

    Deployment of Deep Neural Networks on Dedicated Hardware Accelerators

    Get PDF
    Deep Neural Networks (DNNs) have established themselves as powerful tools for a wide range of complex tasks, for example computer vision or natural language processing. DNNs are notoriously demanding on compute resources and as a result, dedicated hardware accelerators for all use cases are developed. Different accelerators provide solutions from hyper scaling cloud environments for the training of DNNs to inference devices in embedded systems. They implement intrinsics for complex operations directly in hardware. A common example are intrinsics for matrix multiplication. However, there exists a gap between the ecosystems of applications for deep learning practitioners and hardware accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics is still mainly defined by human hardware and software experts. Methods to automatically utilize hardware intrinsics in DNN operators are a subject of active research. Existing literature often works with transformationdriven approaches, which aim to establish a sequence of program rewrites and data-layout transformations such that the hardware intrinsic can be used to compute the operator. However, the complexity this of task has not yet been explored, especially for less frequently used operators like Capsule Routing. And not only the implementation of DNN operators with intrinsics is challenging, also their optimization on the target device is difficult. Hardware-in-the-loop tools are often used for this problem. They use latency measurements of implementations candidates to find the fastest one. However, specialized accelerators can have memory and programming limitations, so that not every arithmetically correct implementation is a valid program for the accelerator. These invalid implementations can lead to unnecessary long the optimization time. This work investigates the complexity of transformation-driven processes to automatically embed hardware intrinsics into DNN operators. It is explored with a custom, graph-based intermediate representation (IR). While operators like Fully Connected Layers can be handled with reasonable effort, increasing operator complexity or advanced data-layout transformation can lead to scaling issues. Building on these insights, this work proposes a novel method to embed hardware intrinsics into DNN operators. It is based on a dataflow analysis. The dataflow embedding method allows the exploration of how intrinsics and operators match without explicit transformations. From the results it can derive the data layout and program structure necessary to compute the operator with the intrinsic. A prototype implementation for a dedicated hardware accelerator demonstrates state-of-the art performance for a wide range of convolutions, while being agnostic to the data layout. For some operators in the benchmark, the presented method can also generate alternative implementation strategies to improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while reducing the memory footprint. Lastly, by curating the initial set of possible implementations for the hardware-in-the-loop optimization, the median timeto- solution is reduced by a factor of ×2.40. At the same time, the possibility to have prolonged searches due a bad initial set of implementations is reduced, improving the optimization’s robustness by ×2.35
    • …
    corecore