Deep learning is widely used to uncover hidden patterns in large code
corpora. To achieve this, constructing a format that captures the relevant
characteristics and features of source code is essential. Graph-based
representations have gained attention for their ability to model structural and
semantic information. However, existing tools lack flexibility in constructing
graphs across different programming languages, limiting their use.
Additionally, the output of these tools often lacks interoperability and
results in excessively large graphs, making graph-based neural networks
training slower and less scalable.
We introduce CONCORD, a domain-specific language to build customizable graph
representations. It implements reduction heuristics to reduce graphs' size
complexity. We demonstrate its effectiveness in code smell detection as an
illustrative use case and show that: first, CONCORD can produce code
representations automatically per the specified configuration, and second, our
heuristics can achieve comparable performance with significantly reduced size.
CONCORD will help researchers a) create and experiment with customizable
graph-based code representations for different software engineering tasks
involving DL, b) reduce the engineering work to generate graph representations,
c) address the issue of scalability in GNN models, and d) enhance the
reproducibility of experiments in research through a standardized approach to
code representation and analysis