13 research outputs found
Scope is all you need: Transforming LLMs for HPC Code
With easier access to powerful compute resources, there is a growing trend in
the field of AI for software development to develop larger and larger language
models (LLMs) to address a variety of programming tasks. Even LLMs applied to
tasks from the high-performance computing (HPC) domain are huge in size (e.g.,
billions of parameters) and demand expensive compute resources for training. We
found this design choice confusing - why do we need large LLMs trained on
natural languages and programming languages unrelated to HPC for HPC-specific
tasks? In this line of work, we aim to question design choices made by existing
LLMs by developing smaller LLMs for specific domains - we call them
domain-specific LLMs. Specifically, we start off with HPC as a domain and
propose a novel tokenizer named Tokompiler, designed specifically for
preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages
knowledge of language primitives to generate language-oriented tokens,
providing a context-aware understanding of code structure while avoiding human
semantics attributed to code structures completely. We applied Tokompiler to
pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran
code corpus mined from GitHub. We evaluate the performance of these models
against the conventional LLMs. Results demonstrate that Tokompiler
significantly enhances code completion accuracy and semantic understanding
compared to traditional tokenizers in normalized-perplexity tests, down to ~1
perplexity score. This research opens avenues for further advancements in
domain-specific LLMs, catering to the unique demands of HPC and compilation
tasks
Devil is Virtual: Reversing Virtual Inheritance in C++ Binaries
Complexities that arise from implementation of object-oriented concepts in
C++ such as virtual dispatch and dynamic type casting have attracted the
attention of attackers and defenders alike.
Binary-level defenses are dependent on full and precise recovery of class
inheritance tree of a given program.
While current solutions focus on recovering single and multiple inheritances
from the binary, they are oblivious to virtual inheritance. Conventional wisdom
among binary-level defenses is that virtual inheritance is uncommon and/or
support for single and multiple inheritances provides implicit support for
virtual inheritance. In this paper, we show neither to be true.
Specifically, (1) we present an efficient technique to detect virtual
inheritance in C++ binaries and show through a study that virtual inheritance
can be found in non-negligible number (more than 10\% on Linux and 12.5\% on
Windows) of real-world C++ programs including Mysql and libstdc++. (2) we show
that failure to handle virtual inheritance introduces both false positives and
false negatives in the hierarchy tree. These false positves and negatives
either introduce attack surface when the hierarchy recovered is used to enforce
CFI policies, or make the hierarchy difficult to understand when it is needed
for program understanding (e.g., during decompilation). (3) We present a
solution to recover virtual inheritance from COTS binaries. We recover a
maximum of 95\% and 95.5\% (GCC -O0) and a minimum of 77.5\% and 73.8\% (Clang
-O2) of virtual and intermediate bases respectively in the virtual inheritance
tree.Comment: Accepted at CCS20. This is a technical report versio
Entrainer-Based Reactive Distillation for Esterification of Glycerol with Acetic Acid
The applicability of reactive distillation for esterification of glycerol with acetic acid in the presence of Amberlyst-15 as catalyst and ethylene dichloride as an entrainer is evaluated through experiments and simulation. The reaction is studied in both semibatch and continuous reactive distillation systems. The effect of different parameters such as entrainer amount, catalyst loading, and reboiler duty is studied. The results indicate that entrainer-based semibatch reactive distillation can enhance the selectivity toward triacetin to about 100%, which is much greater than that offered by any conventional reactor with stoichiometric mole ratio of reactants. Simulations for both sernibatch and continuous reactive distillation are performed, and results agree reasonably well with those obtained by experiments. The best possible design and operating parameters are obtained through detailed simulation using an experimentally validated model. A column configuration is recommended for a continuous process
Transacetalization of Glycerol with Methylal by Reactive Distillation
The applicability of reactive distillation (RD) for the transacetalization of glycerol with methylal in the presence of Amberlyst-15 is studied by experiments and simulation. On the basis of the batch kinetic runs a pseudohomogeneous kinetic model is proposed. The experiments are performed on a continuous reactive distillation column and are compared with the predictions of the equilibrium stage model. Various feasible configurations of reactive distillation are identified and the experimentally validated simulator is used to investigate the effect of different design and operating parameters such as number of rectifying stages, stripping stages, feed mole ratio, reboiler duty, etc. on the performance in each case. The RD process alternatives and the conventional process of reaction followed by distillation are compared
Acetalization of Glycerol with Formaldehyde by Reactive Distillation
The feasibility of reactive distillation (RD) for the reversible acetalization of glycerol with formaldehyde is evaluated through experiments and simulations. Simultaneous removal of acetal and water from the reactive zone of the RD column helps shift the reaction in the forward direction and achieve close to quantitative conversion levels. The results of laboratory-scale RD experiments performed in this study are compared with the ones predicted by simulation using the kinetics developed in the present work. Since commercial formaldehyde is available in the form of its aqueous solution, a large amount of water has to be removed to achieve substantial conversion. An experimentally validated simulator is thus used to design an appropriate RD configuration that offers minimum energy consumption. Toluene is used as an entrainer to remove water from the RD column. The process is compared with the reported indirect route of transacetalization of glycerol with methylal
Quantitative Detection of PEGylated Biomacromolecules in Biological Fluids by NMR
The accumulation, biodistribution,
and clearance profiles of therapeutic
agents are key factors relevant to their efficacy. Determining these
properties constitutes an ongoing experimental challenge. Many such
therapeutics, including small molecules, peptides, proteins, tissue
scaffolds, and drug delivery vehicles, are conjugated to poly(ethylene
glycol) (PEG) as this improves their bioavailability and in vivo stability.
We demonstrate here that <sup>1</sup>H NMR spectroscopy can be used
to quantify PEGylated species in complex biological fluids directly,
rapidly, and with minimal sample preparation. PEG bears a large number
of spectroscopically equivalent protons exhibiting a narrow NMR line
width while resonating at a <sup>1</sup>H NMR frequency distinct from
most other biochemical signals. We demonstrate that PEG provides a
robust signal allowing detection of concentrations as low as 10 μg/mL
in blood. This PEG detection limit is lowered by another order of
magnitude when background proton signals are minimized using <sup>13</sup>C-enriched PEG in combination with a double quantum filter
to remove <sup>1</sup>H signals from non-<sup>13</sup>C-labeled species.
Quantitative detection of PEG via these methods is shown in pig blood
and goat serum as examples of complex biological fluids. More practically,
we quantify the blood clearance of <sup>13</sup>C-PEG and PEGylated-BSA
(bovine serum albumin) following their intravenous injection in live
rats. Given the relative insensitivity of line width to PEG size,
we anticipate that the biodistribution and clearance profiles of virtually
any PEGylated biomacromolecule from biological fluid samples can be
routinely measured by <sup>1</sup>H NMR without any filtering or treatment
steps