Due to the development of pre-trained language models, automated code
generation techniques have shown great promise in recent years. However, the
generated code is difficult to meet the syntactic constraints of the target
language, especially in the case of Turducken-style code, where declarative
code snippets are embedded within imperative programs. In this study, we
summarize the lack of syntactic constraints into three significant challenges:
(1) the efficient representation of syntactic constraints, (2) the effective
integration of syntactic information, and (3) the scalable syntax-first
decoding algorithm. To address these challenges, we propose a syntax-guided
multi-task learning approach TurduckenGen. Specifically, we first explicitly
append the type information to the code tokens to capture the representation of
syntactic constraints. Then we formalize code generation with syntactic
constraint representation as an auxiliary task to enable the model to learn the
syntactic constraints of the code. Finally, the syntactically correct code is
selected accurately from the multiple candidates with the help of the compiler
feedback. Extensive experiments and comprehensive analysis demonstrate the
effectiveness and general applicability of our approach after being compared
with six state-of-the-art baselines on two Turducken-style code datasets.
Finally, we conducted a human study and found the code quality generated by our
approach is better than baselines in terms of code readability and semantic
similarity.Comment: Accepted in Empirical Software Engineerin