1,119 research outputs found
On the impact of tokenizer and parameters on N-gram based Code Analysis
Recent research shows that language models, such as n-gram models, are useful at a wide variety of software engineering tasks, e.g., code completion, bug identification, code summarisation, etc. However, such models require the appropriate set of numerous parameters. Moreover, the different ways one can read code essentially yield different models (based on the different sequences of tokens). In this paper, we focus on n- gram models and evaluate how the use of tokenizers, smoothing, unknown threshold and n values impact the predicting ability of these models. Thus, we compare the use of multiple tokenizers and sets of different parameters (smoothing, unknown threshold and n values) with the aim of identifying the most appropriate combinations. Our results show that the Modified Kneser-Ney smoothing technique performs best, while n values are depended on the choice of the tokenizer, with values 4 or 5 offering a good trade-off between entropy and computation time. Interestingly, we find that tokenizers treating the code as simple text are the most robust ones. Finally, we demonstrate that the differences between the tokenizers are of practical importance and have the potential of changing the conclusions of a given experiment
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
Machine Learning models are often composed of pipelines of transformations.
While this design allows to efficiently execute single model components at
training time, prediction serving has different requirements such as low
latency, high throughput and graceful performance degradation under heavy load.
Current prediction serving systems consider models as black boxes, whereby
prediction-time-specific optimizations are ignored in favor of ease of
deployment. In this paper, we present PRETZEL, a prediction serving system
introducing a novel white box architecture enabling both end-to-end and
multi-model optimizations. Using production-like model pipelines, our
experiments show that PRETZEL is able to introduce performance improvements
over different dimensions; compared to state-of-the-art approaches PRETZEL is
on average able to reduce 99th percentile latency by 5.5x while reducing memory
footprint by 25x, and increasing throughput by 4.7x.Comment: 16 pages, 14 figures, 13th USENIX Symposium on Operating Systems
Design and Implementation (OSDI), 201
2kenize: Tying Subword Sequences for Chinese Script Conversion
Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.Comment: Accepted to ACL 202
Static malware detection Using Stacked BiLSTM and GPT-2
In recent years, cyber threats and malicious software attacks have been escalated on various platforms. Therefore, it has become essential to develop automated machine learning methods for defending against malware. In the present study, we propose stacked bidirectional long short-term memory (Stacked
BiLSTM) and generative pre-trained transformer based (GPT-2) deep learning language models for detecting malicious code. We developed language models using assembly instructions extracted from .text sections of malicious and benign Portable Executable (PE) files. We treated each instruction as a sentence and each .text section as a document. We also labeled each sentence and document as benign or malicious, according to the file source. We created three datasets from those sentences and documents. The first dataset, composed of documents, was fed into a Document Level Analysis Model (DLAM) based on Stacked BiLSTM. The second dataset, composed of sentences, was used in Sentence Level Analysis
Models (SLAMs) based on Stacked BiLSTM and DistilBERT, Domain Specific Language Model GPT-2
(DSLM-GPT2), and General Language Model GPT-2 (GLM-GPT2). Lastly, we merged all assembly
instructions without labels for creating the third dataset; then we fed a custom pre-trained model with it.
We then compared malware detection performances. The results showed that the pre-trained model improved the DSLM-GPT2 and GLM-GPT2 detection performance. The experiments showed that the DLAM, the SLAM based on DistilBERT, the DSLM-GPT2, and the GLM-GPT2 achieved 98.3%, 70.4%, 86.0%, and 76.2% F1 scores, respectively
CORLEONE - Core Linguistic Entity Online Extraction
This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a
pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used
to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources:
(a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic
resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys
state-of-the-art finite-state techniques.
Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were
developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC.
This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it
was implemented.JRC.G.2-Support to external securit
Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation
For a complicated algorithm, its implementation by a human programmer usually
starts with outlining a rough control flow followed by iterative enrichments,
eventually yielding carefully generated syntactic structures and variables in a
hierarchy. However, state-of-the-art large language models generate codes in a
single pass, without intermediate warm-ups to reflect the structured thought
process of "outline-then-detail". Inspired by the recent success of
chain-of-thought prompting, we propose ChainCoder, a program synthesis language
model that generates Python code progressively, i.e. from coarse to fine in
multiple passes. We first decompose source code into layout frame components
and accessory components via abstract syntax tree parsing to construct a
hierarchical representation. We then reform our prediction target into a
multi-pass objective, each pass generates a subsequence, which is concatenated
in the hierarchy. Finally, a tailored transformer architecture is leveraged to
jointly encode the natural language descriptions and syntactically aligned I/O
data samples. Extensive evaluations show that ChainCoder outperforms
state-of-the-arts, demonstrating that our progressive generation eases the
reasoning procedure and guides the language model to generate higher-quality
solutions. Our codes are available at:
https://github.com/VITA-Group/ChainCoder.Comment: Accepted in ICML 202
Sub-Character Tokenization for Chinese Pretrained Language Models
Tokenization is fundamental to pretrained language models (PLMs). Existing
tokenization methods for Chinese PLMs typically treat each character as an
indivisible token. However, they ignore the unique feature of the Chinese
writing system where additional linguistic information exists below the
character level, i.e., at the sub-character level. To utilize such information,
we propose sub-character (SubChar for short) tokenization. Specifically, we
first encode the input text by converting each Chinese character into a short
sequence based on its glyph or pronunciation, and then construct the vocabulary
based on the encoded text with sub-word tokenization. Experimental results show
that SubChar tokenizers have two main advantages over existing tokenizers: 1)
They can tokenize inputs into much shorter sequences, thus improving the
computational efficiency. 2) Pronunciation-based SubChar tokenizers can encode
Chinese homophones into the same transliteration sequences and produce the same
tokenization output, hence being robust to all homophone typos. At the same
time, models trained with SubChar tokenizers perform competitively on
downstream tasks. We release our code at
https://github.com/thunlp/SubCharTokenization to facilitate future work.Comment: This draft supersedes the previous version named "SHUOWEN-JIEZI:
Linguistically Informed Tokenizers For Chinese Language Model Pretraining
Ensemble Learning on Deep Neural Networks for Image Caption Generation
abstract: Capturing the information in an image into a natural language sentence is
considered a difficult problem to be solved by computers. Image captioning involves not just detecting objects from images but understanding the interactions between the objects to be translated into relevant captions. So, expertise in the fields of computer vision paired with natural language processing are supposed to be crucial for this purpose. The sequence to sequence modelling strategy of deep neural networks is the traditional approach to generate a sequential list of words which are combined to represent the image. But these models suffer from the problem of high variance by not being able to generalize well on the training data.
The main focus of this thesis is to reduce the variance factor which will help in generating better captions. To achieve this, Ensemble Learning techniques have been explored, which have the reputation of solving the high variance problem that occurs in machine learning algorithms. Three different ensemble techniques namely, k-fold ensemble, bootstrap aggregation ensemble and boosting ensemble have been evaluated in this thesis. For each of these techniques, three output combination approaches have been analyzed. Extensive experiments have been conducted on the Flickr8k dataset which has a collection of 8000 images and 5 different captions for every image. The bleu score performance metric, which is considered to be the standard for evaluating natural language processing (NLP) problems, is used to evaluate the predictions. Based on this metric, the analysis shows that ensemble learning performs significantly better and generates more meaningful captions compared to any of the individual models used.Dissertation/ThesisMasters Thesis Software Engineering 201
- …