<p>This repository contains the replication package for the paper <em>"The EarlyBIRD Catches the Bug: On Exploiting Early Layers of Encoder Models for More Efficient Code Classification</em>" by Anastasiia Grishina, Max Hort and Leon Moonen, published in the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2023).</p>
<p>The paper is deposited on <a href="https://arxiv.org/abs/2305.04940">arXiv,</a> available under open access at the publisher's site (<a href="https://doi.org/10.1145/3611643.3616304">ACM</a>), and a copy is included in this repository.</p>
<p>The replication package is archived on Zenodo with DOI: <a href="https://doi.org/10.5281/zenodo.7608802">10.5281/zenodo.7608802</a>. The source code is distributed under the MIT license, the data is distributed under the CC BY 4.0 license. The source code is also available on GitHub via <a href="https://github.com/secureIT-project/earlybird">https://github.com/secureIT-project/earlybird</a>.</p>
<p> </p>
<p><strong>Citation</strong></p>
<p>If you build on this data or code, please cite this work by referring to the paper:</p>
<pre><code>@inproceedings{grishina2023:earlybird,
title = {The EarlyBIRD Catches the Bug: On Exploiting Early Layers of
Encoder Models for More Efficient Code Classification},
author = {Anastasiia Grishina and Max Hort and Leon Moonen},
booktitle = {ACM Joint European Software Engineering Conference and Symposium
on the Foundations of Software Engineering (ESEC/FSE)},
year = {2023},
publisher = {ACM},
doi = {https://doi.org/10.1145/3611643.3616304},
note = {Pre-print on arXiv at https://arxiv.org/abs/2305.04940}
}</code></pre>
<p> </p>
<p><strong>Organization</strong></p>
<p>The replication package is organized as follows:</p>
<ul>
<li>
<p>src - the source code</p>
</li>
<li>
<p>requirements - txt files with Python packages and versions for replication</p>
</li>
<li>
<p>data - all raw datasets used for training</p>
<ul>
<li>raw
<ul>
<li>devign - Devign</li>
<li>reveal - ReVeal</li>
<li>break_it_fix_it - BIFI dataset</li>
<li>exception - Exception Type dataset</li>
</ul>
</li>
</ul>
</li>
<li>
<p>mlruns - results of experiments, the folder is created once the run.py is executed (see part II), empty folder at the time of distribution</p>
</li>
<li>
<p>output - results of experiments</p>
<ul>
<li>tables
<ul>
<li>mlflow_<dataset_name>.csv - we used MLflow to log metrics and parameters in our experiments and generated .csv files with the <code>mlflow experiments csv -x <experiment_number> -o mlflow_<dataset_name>.csv</code> command</li>
</ul>
</li>
<li>figures - figures reported in the paper</li>
<li>runs - folder to store model checkpoints, if the corresponding argument is provided when running the code</li>
</ul>
</li>
<li>
<p>model-checkpoints - models with the best F1-weighted score on each of the four datasets - one model for one dataset. Note that the best model is not always the model with the best average improvement over the baseline reported in the paper, because of possible best-performing outliers. This folder is <a href="https://doi.org/10.5281/zenodo.7608802">distributed</a> as a separate file called <code>EarlyBIRD_model-checkpoints.zip</code> (~4.5GB).</p>
</li>
<li>
<p>notebooks - one Jupyter notebook with code to generate figures and tables with aggregated results as reported in the paper </p>
</li>
</ul>
<p> </p>
<p><strong>Usage</strong></p>
<p>Python version: <code>3.7.9</code> (later versions should also work well); CUDA version: <code>11.6</code>; Git LFS.</p>
<p>Commands below work well on Mac or Linux and should be adapted if you have a Windows machine. </p>
<p><strong><em>I. Set up data, environment and code</em></strong></p>
<p><em>1. Path to project directory</em></p>
<p>Update path/to/project to point at EarlyBIRD</p>
<pre><code>export EarlyBIRD=~/path/to/EarlyBIRD</code></pre>
<p><em>2. Download codebert checkpoint</em></p>
<p>Please, install Git LFS: <a href="https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage">https://docs.github.com/en/repositories/working-with-files/managing-large-files/installing-git-large-file-storage</a></p>
<p>Run the following from within <code>EarlyBIRD/</code>:</p>
<pre><code>cd EarlyBIRD
mkdir -p checkpoints/reused/model
cd checkpoints/reused/model
git lfs install
git clone https://huggingface.co/microsoft/codebert-base
cd codebert-base/
git lfs pull
cd ../../..</code></pre>
<p><em>3. Set up a virtual environment</em></p>
<pre><code>cd EarlyBIRD
python -m venv venv
source venv/bin/activate</code></pre>
<p>3.1 No CUDA</p>
<pre><code>python -m pip install -r requirements/requirements_no_cuda.txt</code></pre>
<p>3.2 With CUDA (to run on GPU)</p>
<pre><code>python -m pip install -r requirements/requirements_with_cuda.txt
python -m pip3 install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116</code></pre>
<p><em>4 Preprocess data</em></p>
<p>After preprocessing, all datasets are stored in jsonlines (if in python) format. Naming convention: split is one of <code>'train', 'valid', 'test'</code> in <code>data/preprocessed-final/<dataset_name>/<split>.jsonl</code>, with </p>
<pre><code>{'src': "def function_1() ...", 'label': "Label1"}
{'src': "def function_2() ...", 'label': "Label2"}
...</code></pre>
<p>4.1 Devign</p>
<p>Raw data is downloaded from <a href="https://drive.google.com/file/d/1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF/view">https://drive.google.com/file/d/1x6hoF7G-tSYxg8AFybggypLZgMGDNHfF/view</a>. Test, train, valid txt files are downloaded from <a href="https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection/dataset">https://github.com/microsoft/CodeXGLUE/tree/main/Code-Code/Defect-detection/dataset</a>. All files are saved in <code>data/raw/devign</code>.</p>
<p>To preprocess raw data and save tokenization statistics with the specified tokenizer:</p>
<pre><code>cd EarlyBIRD
python -m src.preprocess \
--dataset_name devign \
--shrink_code \
--config_path src/config.yaml \<br> --tokenizer_path "checkpoints/reused/model/codebert-base"<br></code></pre>
<p>4.2 ReVeal</p>
<p>Raw data is downloaded from <a href="https://github.com/VulDetProject/ReVeal">https://github.com/VulDetProject/ReVeal</a> under "Our Collected vulnerabilities from Chrome and Debian issue trackers (Often referred as Chrome+Debian or Verum dataset in this project)" and saved in <code>data/raw/reveal</code>.</p>
<p>To preprocess raw data and save tokenization statistics with the specified tokenizer:</p>
<pre><code>cd $EarlyBIRD
python -m src.preprocess \
--dataset_name reveal \
--shrink_code \
--config_path src/config.yaml \<br> --tokenizer_path "checkpoints/reused/model/codebert-base"<br></code></pre>
<p>4.3 Break-it-fix-it</p>
<p>Raw data is downloaded as <code>data_minimal.zip</code> from https://github.com/michiyasunaga/BIFI under p. 1, unzipped, and the folder <code>orig_bad_code</code> is saved in <code>data/raw/break_it_fix_it</code>.</p>
<p>To preprocess raw data and save tokenization statistics with the specified tokenizer:</p>
<pre><code>cd $EarlyBIRD
python -m src.preprocess \
--dataset_name break_it_fix_it \
--shrink_code \
--ratio_train 0.9 \
--config_path src/config.yaml</code><code>\<br> --tokenizer_path "checkpoints/reused/model/codebert-base"</code></pre>
<p>Note: The original paper contains only train and test split. Use <code>--ratio_train</code> to specify what part of the original train (orig-train) split will be used in train and the rest of orig-train will be used for validation during training.</p>
<p>4.4 Exception Type</p>
<p>Raw data is downloaded from <a href="https://github.com/google-research/google-research/tree/master/cubert">https://github.com/google-research/google-research/tree/master/cubert</a> under "2. Exception classification" (it points to <a href="https://console.cloud.google.com/storage/browser/cubert/20200621_Python/exception_datasets;tab=objects?prefix=&forceOnObjectsSortingFiltering=false">this storage</a>) and saved in <code>data/raw/exception_type</code>.</p>
<p>To preprocess raw data and save tokenization statistics with the specified tokenizer:</p>
<pre><code>cd $EarlyBIRD
python -m src.preprocess \
--dataset_name exception \
--shrink_code \
--config_path src/config.yaml \<br> --tokenizer_path "checkpoints/reused/model/codebert-base"</code></pre>
<p> </p>
<p><strong><em>II. Run code</em></strong></p>
<p>Activate virtual environment (if not done so yet):</p>
<pre><code>cd EarlyBIRD
source venv/bin/activate</code></pre>
<p><em>Example run</em></p>
<p>Run experiments with Devign using pruned models (<code>cutoff_layers_one_layer_cls</code>) to 3 layers (<code>--hidden_layer_to_use 3</code>), for example:</p>
<pre><code>cd EarlyBIRD
python -m src.run --help # for help with command line args
python -m src.run \
--config_path src/config.yaml \
--model_name codebert \
--model_path "checkpoints/reused/model/codebert-base" \
--tokenizer_path "checkpoints/reused/model/codebert-base" \
--dataset_name devign \
--benchmark_name acc \
--train \
--test \
-warmup 0 \
--device cuda \
--epochs 10 \
-clf one_linear_layer \
--combination_type cutoff_layers_one_layer_cls \
--hidden_layer_to_use 3 \
--experiment_no 12 \
--seed 42</code></pre>
<p>To run experiments on a small subset of data, use <code>--debug</code> argument. For example:</p>
<pre><code>python -m src.run \
--debug \
--config_path src/config.yaml \
--model_name codebert \
--model_path "checkpoints/reused/model/codebert-base" \
--tokenizer_path "checkpoints/reused/model/codebert-base" \
--dataset_name devign \
--benchmark_name acc \
--train \
--test \
-warmup 0 \
--device cuda \
--epochs 2 \
-clf one_linear_layer \
--combination_type cutoff_layers_one_layer_cls \
--hidden_layer_to_use 3 \
--experiment_no 12 \
--seed 42</code></pre>
<p> </p>
<p><strong>Explore output</strong></p>
<p>Your <code>EarlyBIRD/</code> should contain <code>mlruns/</code>. If you started the <code>run.py</code> from another location, you will find <code>mlruns/</code>one level below that location.</p>
<pre><code>cd $EarlyBIRD
mlflow ui</code></pre>
<p>Alternatively, find tables in <code>EarlyBIRD/output/tables/</code> with best epoch logs and logs of all epochs. </p>
<p> </p>
<p><strong>ChangeLog</strong></p>
<ul>
<li>v1.0 - corresponds to the version submitted for review to ESEC/FSE 2023 and contains code for using CodeBERT as a base model for fine-tuning, extensive logging in MLFlow and a custom table, as well as replication instructions.</li>
<li>v1.1 - corresponds to the camera-ready submission for ESEC/FSE 2023 and contains the code with configurations adapted to use more models for fine-tuning, logging in MLFlow (redundant logging in a custom table is removed), Jupyter notebooks to replicate artifacts in the paper, as well as replication instructions and model checkpoints.</li>
<li>v1.2 - updated code with documentation and typing hints; added a link to the public GitHub repository to README.</li>
</ul>
<p> </p>
<p><strong>Acknowledgement</strong></p>The work included in this repository was supported by the Research Council of Norway through the secureIT project (IKTPLUSS #288787). Max Hort is supported through the ERCIM 'Alain Bensoussan' Fellowship Programme. The empirical evaluation was performed on the Experimental Infrastructure for Exploration of Exascale Computing (eX3), financially supported by the Research Council of Norway under contract #270053, as well as on resources provided by Sigma2, the National Infrastructure for High Performance Computing and Data Storage in Norway