We gave four frontier coding models the same three game prompts and let them build. No edits, no retries. Kimi K2.7 Code from Moonshot AI, DeepSeek V4 Pro, Qwen3.7 Max from Alibaba, and GLM 5.2 from Z.ai each wrote a self-playing Snake, a self-playing Breakout, and a self-playing Pong, every one a single self-contained HTML file with no libraries. All four run on EmpirioLabs behind one OpenAI compatible API.
Watch all four build it
How we ran it
Each prompt went to each model as one user message, one shot, rendered exactly as returned with no edits. Reasoning effort was set to max. No temperature override and no system prompt. Maximum output was 32000 tokens. Every prompt asked for a self-playing game as a single self-contained HTML file with all CSS and JavaScript inline, no external libraries, no CDN, and no imports.
The results
All four models returned a working single file game on every prompt on the first try. Here is the size of each answer, in lines of the final HTML file.
| Test | Kimi K2.7 Code | DeepSeek V4 Pro | Qwen3.7 Max | GLM 5.2 |
|---|---|---|---|---|
| Self-playing Snake | 374 lines | 744 lines | 460 lines | 526 lines |
| Self-playing Breakout | 295 lines | 762 lines | 335 lines | 370 lines |
| Self-playing Pong | 240 lines | 640 lines | 258 lines | 321 lines |
What we noticed
Every model shipped a playable game first try, but they got there in very different ways. DeepSeek V4 Pro wrote by far the most code on all three tasks, often more than twice the lines of the others. Kimi K2.7 Code was the most concise. Qwen3.7 Max and GLM 5.2 landed in between. More lines is not better or worse on its own, so the thing to watch is how each game actually looks and plays in the clip. We are not naming a winner. Pick the one whose output fits how you like to work.
Run the same test yourself
All four serve the OpenAI compatible Chat Completions API, so comparing them is a one line change. Point base_url at https://api.empiriolabs.ai/v1 and set the model id.
curl https://api.empiriolabs.ai/v1/chat/completions \
-H "Authorization: Bearer $EMPIRIOLABS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "kimi-k2-7-code",
"messages": [{"role": "user", "content": "Build a self-playing Snake game as a single HTML file, no libraries."}]
}'
Swap "model" to deepseek-v4-pro, qwen3-7-max, or glm-5-2 and run it again. Every frontier model lives behind the same API, so you can compare them on your own prompts without changing your code. You can also run all four side by side in the playground.
Frequently asked questions
Which coding models were tested?
Kimi K2.7 Code from Moonshot AI, DeepSeek V4 Pro, Qwen3.7 Max from Alibaba, and GLM 5.2 from Z.ai. All four run on EmpirioLabs through one OpenAI compatible API.
What were the three tasks?
A self-playing Snake, a self-playing Breakout, and a self-playing Pong, each a single self-contained HTML file with no external libraries, that plays itself with no user input.
Was anything edited or retried?
No. Each model got one shot per prompt and we rendered exactly what it returned, working or not.
Which model wrote the most code?
DeepSeek V4 Pro wrote the most lines on all three tasks, and Kimi K2.7 Code wrote the fewest. Line count is just a measure of size, not quality, so watch the clip to see how each game plays.
How do I switch between the models?
Change one string. All four serve the OpenAI Chat Completions API at https://api.empiriolabs.ai/v1, so you set the model id and keep the rest of the request unchanged.



