Kimi vs DeepSeek vs Qwen vs GLM: Coding (2026)

Four way coding test: Kimi K2.7 Code, DeepSeek V4 Pro, Qwen3.7 Max, and GLM 5.2 each rendering a self-playing Breakout game from a single HTML file.

Jun 24, 2026

EmpirioLabs AI

We gave four frontier coding models the same three game prompts and let them build. No edits, no retries. Kimi K2.7 Code from Moonshot AI, DeepSeek V4 Pro, Qwen3.7 Max from Alibaba, and GLM 5.2 from Z.ai each wrote a self-playing Snake, a self-playing Breakout, and a self-playing Pong, every one a single self-contained HTML file with no libraries. All four run on EmpirioLabs behind one OpenAI compatible API.

Watch all four build it

How we ran it

Each prompt went to each model as one user message, one shot, rendered exactly as returned with no edits. Reasoning effort was set to max. No temperature override and no system prompt. Maximum output was 32000 tokens. Every prompt asked for a self-playing game as a single self-contained HTML file with all CSS and JavaScript inline, no external libraries, no CDN, and no imports.

The results

All four models returned a working single file game on every prompt on the first try. Here is the size of each answer, in lines of the final HTML file.

Test	Kimi K2.7 Code	DeepSeek V4 Pro	Qwen3.7 Max	GLM 5.2
Self-playing Snake	374 lines	744 lines	460 lines	526 lines
Self-playing Breakout	295 lines	762 lines	335 lines	370 lines
Self-playing Pong	240 lines	640 lines	258 lines	321 lines

What we noticed

Every model shipped a playable game first try, but they got there in very different ways. DeepSeek V4 Pro wrote by far the most code on all three tasks, often more than twice the lines of the others. Kimi K2.7 Code was the most concise. Qwen3.7 Max and GLM 5.2 landed in between. More lines is not better or worse on its own, so the thing to watch is how each game actually looks and plays in the clip. We are not naming a winner. Pick the one whose output fits how you like to work.

Run the same test yourself

All four serve the OpenAI compatible Chat Completions API, so comparing them is a one line change. Point base_url at https://api.empiriolabs.ai/v1 and set the model id.

curl https://api.empiriolabs.ai/v1/chat/completions \
  -H "Authorization: Bearer $EMPIRIOLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "kimi-k2-7-code",
    "messages": [{"role": "user", "content": "Build a self-playing Snake game as a single HTML file, no libraries."}]
  }'

Swap "model" to deepseek-v4-pro, qwen3-7-max, or glm-5-2 and run it again. Every frontier model lives behind the same API, so you can compare them on your own prompts without changing your code. You can also run all four side by side in the playground.

Frequently asked questions

Which coding models were tested?

Kimi K2.7 Code from Moonshot AI, DeepSeek V4 Pro, Qwen3.7 Max from Alibaba, and GLM 5.2 from Z.ai. All four run on EmpirioLabs through one OpenAI compatible API.

What were the three tasks?

A self-playing Snake, a self-playing Breakout, and a self-playing Pong, each a single self-contained HTML file with no external libraries, that plays itself with no user input.

Was anything edited or retried?

No. Each model got one shot per prompt and we rendered exactly what it returned, working or not.

Which model wrote the most code?

DeepSeek V4 Pro wrote the most lines on all three tasks, and Kimi K2.7 Code wrote the fewest. Line count is just a measure of size, not quality, so watch the clip to see how each game plays.

How do I switch between the models?

Change one string. All four serve the OpenAI Chat Completions API at https://api.empiriolabs.ai/v1, so you set the model id and keep the rest of the request unchanged.

Try it

Open the playground | Browse all models | Pricing

Kimi vs DeepSeek vs Qwen vs GLM: AI Coding Models Compared

Watch all four build it

How we ran it

The results

What we noticed

Run the same test yourself