Natural language to code generation is an important application area of LLMs
and has received wide attention from the community. The majority of relevant
studies have exclusively concentrated on increasing the quantity and functional
correctness of training sets while disregarding other stylistic elements of
programs. More recently, data quality has garnered a lot of interest and
multiple works have showcased its importance for improving performance. In this
work, we investigate data quality for code and find that making the code more
structured and readable leads to improved code generation performance of the
system. We build a novel data-cleaning pipeline that uses these principles to
transform existing programs by 1.) renaming variables, 2.) modularizing and
decomposing complex code into smaller helper sub-functions, and 3.) inserting
natural-language based plans via LLM based transformations. We evaluate our
approach on two challenging algorithmic code generation benchmarks and find
that fine-tuning CodeLLaMa-7B on our transformed modularized programs improves
the performance by up to 30% compared to fine-tuning on the original dataset.
Additionally, we demonstrate improved performance from using a smaller amount
of higher-quality data, finding that a model fine-tuned on the entire original
dataset is outperformed by a model trained on 15% of our cleaned dataset. Even
in comparison to closed-source models, our models outperform the much larger
AlphaCoder models