6,957 research outputs found
Heavy elements in Globular Clusters: the role of AGB stars
Recent observations of heavy elements in Globular Clusters reveal intriguing
deviations from the standard paradigm of the early galactic nucleosynthesis. If
the r-process contamination is a common feature of halo stars, s-process
enhancements are found in a few Globular Clusters only. We show that the
combined pollution of AGB stars with mass ranging between 3 to 6 M may
account for most of the features of the s-process overabundance in M4 and M22.
In these stars, the s process is a mixture of two different neutron-capture
nucleosynthesis episodes. The first is due to the 13C(a,n)16O reaction and
takes place during the interpulse periods. The second is due to the
22Ne(a,n)25Mg reaction and takes place in the convective zones generated by
thermal pulses. The production of the heaviest s elements (from Ba to Pb)
requires the first neutron burst, while the second produces large
overabundances of light s (Sr, Y, Zr). The first mainly operates in the
less-massive AGB stars, while the second dominates in the more-massive. From
the heavy-s/light-s ratio, we derive that the pollution phase should last for
Myr, a period short enough compared to the formation timescale of
the Globular Cluster system, but long enough to explain why the s-process
pollution is observed in a few cases only. With few exceptions, our theoretical
prediction provides a reasonable reproduction of the observed s-process
abundances, from Sr to Hf. However, Ce is probably underproduced by our models,
while Rb and Pb are overproduced. Possible solutions are discussed.Comment: Accepted by the Ap
A formally verified compiler back-end
This article describes the development and formal verification (proof of
semantic preservation) of a compiler back-end from Cminor (a simple imperative
intermediate language) to PowerPC assembly code, using the Coq proof assistant
both for programming the compiler and for proving its correctness. Such a
verified compiler is useful in the context of formal methods applied to the
certification of critical software: the verification of the compiler guarantees
that the safety properties proved on the source code hold for the executable
compiled code as well
Accelerating Verified-Compiler Development with a Verified Rewriting Engine
Compilers are a prime target for formal verification, since compiler bugs
invalidate higher-level correctness guarantees, but compiler changes may become
more labor-intensive to implement, if they must come with proof patches. One
appealing approach is to present compilers as sets of algebraic rewrite rules,
which a generic engine can apply efficiently. Now each rewrite rule can be
proved separately, with no need to revisit past proofs for other parts of the
compiler. We present the first realization of this idea, in the form of a
framework for the Coq proof assistant. Our new Coq command takes normal proved
theorems and combines them automatically into fast compilers with proofs. We
applied our framework to improve the Fiat Cryptography toolchain for generating
cryptographic arithmetic, producing an extracted command-line compiler that is
about 1000 faster while actually featuring simpler compiler-specific
proofs.Comment: 13th International Conference on Interactive Theorem Proving (ITP
2022
Design and Implementation of a Domain Specific Language for Deep Learning
\textit {Deep Learning} (DL) has found great success in well-diversified areas such as machine vision, speech recognition, big data analysis, and multimedia understanding recently. However, the existing state-of-the-art DL frameworks, e.g. Caffe2, Theano, TensorFlow, MxNet, Torch7, and CNTK, are programming libraries with fixed user interfaces, internal representations, and execution environments. Modifying the code of DL layers or data structure is very challenging without in-depth understanding of the underlying implementation. The optimization of the code and execution in these tools is often limited and relies on the specific DL computation graph manipulation and scheduling that lack systematic and universal strategies. Furthermore, most of these tools demand many dependencies beside the tool itself and require to be built to some specific platforms for DL training or inference.
\\\\
\noindent This dissertation presents {\it DeepDSL}, a \textit {domain specific language} (DSL) embedded in Scala, that compiles DL networks encoded with DeepDSL to efficient, compact, and portable Java source programs for DL training and inference. DeepDSL represents DL networks as abstract tensor functions, performs symbolic gradient derivations to generate the Intermediate Representation (IR), optimizes the IR expressions, and compiles the optimized IR expressions to cross-platform Java code that is easily modifiable and debuggable. Also, the code directly runs on GPU without additional dependencies except a small set of \textit{JNI} (Java Native Interface) wrappers for invoking the underneath GPU libraries. Moreover, DeepDSL provides static analysis for memory consumption and error detection.
\\\\
\noindent DeepDSL\footnote{Our previous results are reported in~\cite{zhao2017}; design and implementation details are summarized in~\cite{Zhao2018}.} has been evaluated with many current state-of-the-art DL networks (e.g. Alexnet, GoogleNet, VGG, Overfeat, and Deep Residual Network). While the DSL code is highly compact with less than 100 lines for each of the network, the Java source code generated by the DeepDSL compiler is highly efficient. Our experiments show that the output java source has very competitive runtime performance and memory efficiency compared to the existing DL frameworks
Acute: High-level programming language design for distributed computation : Design rationale and language definition
This paper studies key issues for distributed programming in high-level languages. We discuss the design space and describe an experimental language, Acute, which we have defined and implemented. Acute extends an OCaml core to support distributed development, deployment, and execution, allowing type-safe interaction between separately-built programs. It is expressive enough to enable a wide variety of distributed infrastructure layers to be written as simple library code above the byte-string network and persistent store APIs, disentangling the language runtime from communication. This requires a synthesis of novel and existing features: (1) type-safe marshalling of values between programs; (2) dynamic loading and controlled rebinding to local resources; (3) modules and abstract types with abstraction boundaries that are respected by interaction; (4) global names, generated either freshly or based on module hashes: at the type level, as runtime names for abstract types; and at the term level, as channel names and other interaction handles; (5) versions and version constraints, integrated with type identity; (6) local concurrency and thread thunkification; and (7) second-order polymorphism with a namecase construct. We deal with the interplay among these features and the core, and develop a semantic definition that tracks abstraction boundaries, global names, and hashes throughout compilation and execution, but which still admits an efficient implementation strategy
Deployment of Deep Neural Networks on Dedicated Hardware Accelerators
Deep Neural Networks (DNNs) have established themselves as powerful tools for
a wide range of complex tasks, for example computer vision or natural language
processing. DNNs are notoriously demanding on compute resources and as a
result, dedicated hardware accelerators for all use cases are developed. Different
accelerators provide solutions from hyper scaling cloud environments for the
training of DNNs to inference devices in embedded systems. They implement
intrinsics for complex operations directly in hardware. A common example
are intrinsics for matrix multiplication. However, there exists a gap between
the ecosystems of applications for deep learning practitioners and hardware
accelerators. HowDNNs can efficiently utilize the specialized hardware intrinsics
is still mainly defined by human hardware and software experts.
Methods to automatically utilize hardware intrinsics in DNN operators are a
subject of active research. Existing literature often works with transformationdriven
approaches, which aim to establish a sequence of program rewrites and
data-layout transformations such that the hardware intrinsic can be used to
compute the operator. However, the complexity this of task has not yet been
explored, especially for less frequently used operators like Capsule Routing. And
not only the implementation of DNN operators with intrinsics is challenging,
also their optimization on the target device is difficult. Hardware-in-the-loop
tools are often used for this problem. They use latency measurements of implementations
candidates to find the fastest one. However, specialized accelerators
can have memory and programming limitations, so that not every arithmetically
correct implementation is a valid program for the accelerator. These invalid
implementations can lead to unnecessary long the optimization time.
This work investigates the complexity of transformation-driven processes to
automatically embed hardware intrinsics into DNN operators. It is explored
with a custom, graph-based intermediate representation (IR). While operators
like Fully Connected Layers can be handled with reasonable effort, increasing
operator complexity or advanced data-layout transformation can lead to scaling issues.
Building on these insights, this work proposes a novel method to embed
hardware intrinsics into DNN operators. It is based on a dataflow analysis.
The dataflow embedding method allows the exploration of how intrinsics and
operators match without explicit transformations. From the results it can derive
the data layout and program structure necessary to compute the operator with
the intrinsic. A prototype implementation for a dedicated hardware accelerator
demonstrates state-of-the art performance for a wide range of convolutions, while
being agnostic to the data layout. For some operators in the benchmark, the
presented method can also generate alternative implementation strategies to
improve hardware utilization, resulting in a geo-mean speed-up of ×2.813 while
reducing the memory footprint. Lastly, by curating the initial set of possible
implementations for the hardware-in-the-loop optimization, the median timeto-
solution is reduced by a factor of ×2.40. At the same time, the possibility to
have prolonged searches due a bad initial set of implementations is reduced,
improving the optimization’s robustness by ×2.35
- …