Evaluating Language Models for Computer Graphics Code Completion
Evaluation benchmarks are essential for developing and training language models, providing both comparison and optimization targets. Existing code completion benchmarks, often based on standalone Python functions and unit tests, are overly simplistic, contaminated, and fail to reflect real-world scenarios. In this paper we present ShaderMatch, a novel benchmark for code completion in computer graphics programming. The benchmark is derived from real-world fragment shaders in OpenGL Shading Language (GLSL) from the Shadertoy platform, forming a zero-shot function completion task with 467 function headers and user-written comments as input. Besides, we propose a two-step evaluation metric: static code comparison followed by frame rendering comparison. Additionally, ShaderMatch introduces eight fine-grained labels for deeper insights. We evaluate over 20 open-source code-specific models and highlight notable performance outliers. Results show that even top models fail to generate working code in 31% of cases, highlighting the challenge posed by GLSL, a low-resource language rarely found in pretraining datasets. ShaderMatch provides a well-annotated, extendable dataset for future research. Data, code, leaderboard, and discussions are available at: https://hf.co/spaces/Vipitis/shadermatch.
Slides (LLM4Code slides_final5.pdf) | 1.98MiB |