GLM-5.2 API: Pricing, Quickstart & Limits

Jun 16, 2026

EmpirioLabs AI

GLM-5.2 is Z.ai's flagship reasoning and coding model, released on June 13, 2026. It pairs a one-million-token context window with adjustable reasoning effort, so you can dial how hard the model thinks before it answers, and it ships with built-in web search and tool calling. Z.ai positions it as a coding-first model that keeps strong general reasoning, and recommends maximum reasoning effort for complex, multi-step engineering work.

GLM-5.2 is live on EmpirioLabs today through an OpenAI-compatible API, with text input and output, a 1M token context, up to 128K output tokens, function calling, JSON mode structured output, and streaming. Try it in the playground or call it from any OpenAI-compatible client. The full spec and current rates live on the GLM-5.2 model page and the API docs.

Pricing

Billing is usage based with no subscription: input and output tokens are metered per token, and each built-in web search adds a small per-call fee that applies only when a search actually runs. There is no separate cache tier. Current per-token rates always live on the model page and the pricing page, which stay in sync with what you are charged.

Quickstart

Point any OpenAI SDK at the EmpirioLabs base URL and pass glm-5-2 as the model:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.empiriolabs.ai/v1",
    api_key="YOUR_EMPIRIOLABS_API_KEY",
)

resp = client.chat.completions.create(
    model="glm-5-2",
    messages=[
        {"role": "user", "content": "Refactor this Python loop into a list comprehension and explain why."},
    ],
    reasoning_effort="high",
)

print(resp.choices[0].message.content)

Reasoning effort

GLM-5.2 exposes a native reasoning_effort control with levels from minimal up to max, plus none to turn thinking off entirely. The default is max, which Z.ai recommends for complex coding. Lower the effort for faster, cheaper replies on simple turns, or set enable_thinking to false for the lowest-latency path. When the model thinks, the reasoning is returned alongside the answer so you can show it or hide it.

Web search

Set tool_web_search to true to let GLM-5.2 pull live results into its answer. The model decides when to search, cites what it used, and the per-call fee applies only when a search runs. Leave it off for fully offline reasoning.

Good to know

For strict structured output, run with thinking disabled so the model returns clean JSON without an interleaved reasoning trace. GLM-5.2 is text in and text out: for image or video understanding, pick a multimodal model from the catalog. The one-million-token context comfortably holds large codebases, long transcripts, and multi-file refactors in a single request.