13 research outputs found

    Scope is all you need: Transforming LLMs for HPC Code

    Full text link
    With easier access to powerful compute resources, there is a growing trend in the field of AI for software development to develop larger and larger language models (LLMs) to address a variety of programming tasks. Even LLMs applied to tasks from the high-performance computing (HPC) domain are huge in size (e.g., billions of parameters) and demand expensive compute resources for training. We found this design choice confusing - why do we need large LLMs trained on natural languages and programming languages unrelated to HPC for HPC-specific tasks? In this line of work, we aim to question design choices made by existing LLMs by developing smaller LLMs for specific domains - we call them domain-specific LLMs. Specifically, we start off with HPC as a domain and propose a novel tokenizer named Tokompiler, designed specifically for preprocessing code in HPC and compilation-centric tasks. Tokompiler leverages knowledge of language primitives to generate language-oriented tokens, providing a context-aware understanding of code structure while avoiding human semantics attributed to code structures completely. We applied Tokompiler to pre-train two state-of-the-art models, SPT-Code and Polycoder, for a Fortran code corpus mined from GitHub. We evaluate the performance of these models against the conventional LLMs. Results demonstrate that Tokompiler significantly enhances code completion accuracy and semantic understanding compared to traditional tokenizers in normalized-perplexity tests, down to ~1 perplexity score. This research opens avenues for further advancements in domain-specific LLMs, catering to the unique demands of HPC and compilation tasks

    Devil is Virtual: Reversing Virtual Inheritance in C++ Binaries

    Full text link
    Complexities that arise from implementation of object-oriented concepts in C++ such as virtual dispatch and dynamic type casting have attracted the attention of attackers and defenders alike. Binary-level defenses are dependent on full and precise recovery of class inheritance tree of a given program. While current solutions focus on recovering single and multiple inheritances from the binary, they are oblivious to virtual inheritance. Conventional wisdom among binary-level defenses is that virtual inheritance is uncommon and/or support for single and multiple inheritances provides implicit support for virtual inheritance. In this paper, we show neither to be true. Specifically, (1) we present an efficient technique to detect virtual inheritance in C++ binaries and show through a study that virtual inheritance can be found in non-negligible number (more than 10\% on Linux and 12.5\% on Windows) of real-world C++ programs including Mysql and libstdc++. (2) we show that failure to handle virtual inheritance introduces both false positives and false negatives in the hierarchy tree. These false positves and negatives either introduce attack surface when the hierarchy recovered is used to enforce CFI policies, or make the hierarchy difficult to understand when it is needed for program understanding (e.g., during decompilation). (3) We present a solution to recover virtual inheritance from COTS binaries. We recover a maximum of 95\% and 95.5\% (GCC -O0) and a minimum of 77.5\% and 73.8\% (Clang -O2) of virtual and intermediate bases respectively in the virtual inheritance tree.Comment: Accepted at CCS20. This is a technical report versio

    Entrainer-Based Reactive Distillation for Esterification of Glycerol with Acetic Acid

    No full text
    The applicability of reactive distillation for esterification of glycerol with acetic acid in the presence of Amberlyst-15 as catalyst and ethylene dichloride as an entrainer is evaluated through experiments and simulation. The reaction is studied in both semibatch and continuous reactive distillation systems. The effect of different parameters such as entrainer amount, catalyst loading, and reboiler duty is studied. The results indicate that entrainer-based semibatch reactive distillation can enhance the selectivity toward triacetin to about 100%, which is much greater than that offered by any conventional reactor with stoichiometric mole ratio of reactants. Simulations for both sernibatch and continuous reactive distillation are performed, and results agree reasonably well with those obtained by experiments. The best possible design and operating parameters are obtained through detailed simulation using an experimentally validated model. A column configuration is recommended for a continuous process

    Transacetalization of Glycerol with Methylal by Reactive Distillation

    No full text
    The applicability of reactive distillation (RD) for the transacetalization of glycerol with methylal in the presence of Amberlyst-15 is studied by experiments and simulation. On the basis of the batch kinetic runs a pseudohomogeneous kinetic model is proposed. The experiments are performed on a continuous reactive distillation column and are compared with the predictions of the equilibrium stage model. Various feasible configurations of reactive distillation are identified and the experimentally validated simulator is used to investigate the effect of different design and operating parameters such as number of rectifying stages, stripping stages, feed mole ratio, reboiler duty, etc. on the performance in each case. The RD process alternatives and the conventional process of reaction followed by distillation are compared

    Acetalization of Glycerol with Formaldehyde by Reactive Distillation

    No full text
    The feasibility of reactive distillation (RD) for the reversible acetalization of glycerol with formaldehyde is evaluated through experiments and simulations. Simultaneous removal of acetal and water from the reactive zone of the RD column helps shift the reaction in the forward direction and achieve close to quantitative conversion levels. The results of laboratory-scale RD experiments performed in this study are compared with the ones predicted by simulation using the kinetics developed in the present work. Since commercial formaldehyde is available in the form of its aqueous solution, a large amount of water has to be removed to achieve substantial conversion. An experimentally validated simulator is thus used to design an appropriate RD configuration that offers minimum energy consumption. Toluene is used as an entrainer to remove water from the RD column. The process is compared with the reported indirect route of transacetalization of glycerol with methylal

    Quantitative Detection of PEGylated Biomacromolecules in Biological Fluids by NMR

    No full text
    The accumulation, biodistribution, and clearance profiles of therapeutic agents are key factors relevant to their efficacy. Determining these properties constitutes an ongoing experimental challenge. Many such therapeutics, including small molecules, peptides, proteins, tissue scaffolds, and drug delivery vehicles, are conjugated to poly­(ethylene glycol) (PEG) as this improves their bioavailability and in vivo stability. We demonstrate here that <sup>1</sup>H NMR spectroscopy can be used to quantify PEGylated species in complex biological fluids directly, rapidly, and with minimal sample preparation. PEG bears a large number of spectroscopically equivalent protons exhibiting a narrow NMR line width while resonating at a <sup>1</sup>H NMR frequency distinct from most other biochemical signals. We demonstrate that PEG provides a robust signal allowing detection of concentrations as low as 10 μg/mL in blood. This PEG detection limit is lowered by another order of magnitude when background proton signals are minimized using <sup>13</sup>C-enriched PEG in combination with a double quantum filter to remove <sup>1</sup>H signals from non-<sup>13</sup>C-labeled species. Quantitative detection of PEG via these methods is shown in pig blood and goat serum as examples of complex biological fluids. More practically, we quantify the blood clearance of <sup>13</sup>C-PEG and PEGylated-BSA (bovine serum albumin) following their intravenous injection in live rats. Given the relative insensitivity of line width to PEG size, we anticipate that the biodistribution and clearance profiles of virtually any PEGylated biomacromolecule from biological fluid samples can be routinely measured by <sup>1</sup>H NMR without any filtering or treatment steps
    corecore