
Reasoning and coding model with a 1M token context, 128K output, adjustable reasoning effort, native web search, and tool calling.
Pay only for what you use. No subscription lock-ins.
Pricing catalog
EmpirioLabs pricing varies by model and unit: tokens, images, seconds of audio or video, messages, 3D assets, and search requests. The interactive pricing table loads the current catalog, while these representative rates remain readable without client JavaScript.
Image input $0.05 per image. Video output starts at $0.096 per second for 480p and $0.168 per second for 720p.
Input starts at $0.40 per 1M prompt tokens for a cost-effective long-context multimodal API.
Input starts at $0.30 per 1M prompt tokens for multimodal reasoning, coding, and agent workloads.
Image-to-3D generation is priced per generated 3D asset, with docs for format and resolution controls.

Reasoning and coding model with a 1M token context, 128K output, adjustable reasoning effort, native web search, and tool calling.

Kimi K2.7 Code is Moonshot's trillion-parameter agentic coding model with 256K context, always-on reasoning, and text, image, and video inputs.

Cost-effective Qwen3.7 vision-language model for text, image, video, coding, tool use, GUI understanding, and 1M-context workflows.
Cost-effective Qwen3.7 vision-language model for text, image, video, coding, tool use, GUI understanding, and 1M-context workflows.
MiniMax M3 is a multimodal reasoning model for coding, agents, and long-context analysis with text, image, and video input.

Qwen3.7 Max is a flagship text model for coding, productivity, long-running agents, deep thinking, tools, and 1M-token context.
Qwen3.7 Max is a flagship text model for coding, productivity, long-running agents, deep thinking, tools, and 1M-token context.
High-speed M2.7 variant tuned for fast inference with strong general-purpose performance with strong agentic capabilities.
Sub-130ms TTFB voice synthesis with 271+ voices across 15 languages, expressive prosody, and real-time SSE streaming for low-latency voice agents.
Broadcast-quality voice synthesis with rich expressive prosody, 271+ voices across 15 languages, and real-time SSE streaming with per-word timestamps.
Long-context Zhipu AI reasoning model with 202K context, 128K output, tool calling, structured output, and cache support.
Kimi K2.6 is a Moonshot multimodal reasoning model with 256K context, strong coding, and text, image, and video inputs.
MiniMax M2.7 is a general-purpose reasoning chat model with interleaved thinking, function calling, and prompt caching.
Qwen3.5 122B-A10B is a multimodal reasoning model with 256K context, efficient sparse MoE inference, and text, image, and video input.
Qwen3.5 397B-A17B is a flagship multimodal reasoning model for language, code, agents, GUI tasks, and image and video understanding.
Qwen3.5 35B-A3B is an efficient native vision-language model with sparse MoE routing, deep thinking, and text, image, and video input.
Qwen3.5 27B is a dense multimodal reasoning model with fast responses, 256K context, and text, image, and video understanding.
Qwen3.6 27B improves agentic coding, STEM reasoning, spatial vision, OCR, and text, image, and video understanding on 256K context.

Fast Qwen3.6 vision-language model for agentic coding, math reasoning, spatial understanding, OCR, and text, image, and video input.
Fast Qwen3.6 vision-language model for agentic coding, math reasoning, spatial understanding, OCR, and text, image, and video input.

Free lightweight GLM-4.7 text model for coding, reasoning, long-context writing, and general chat.

Free lightweight GLM-4.5 text model for reasoning, coding, long-form chat, and general language tasks.

Free multimodal GLM-4.6V model for image, video, file, and text understanding with native function calling.

Image generation and editing model creating and modifying images from text or image inputs, with inpainting, virtual try-on, and style controls.

Video generation model producing up to 2-minute multi-shot videos from text and optional image prompts with improved quality and consistency.

Speech-to-text transcription using the Nova-3 model with multi-language support and advanced customizable settings for production workloads.

Open-source LLM specialized in formal theorem proving in Lean 4, built on a recursive theorem-proving pipeline.

Open-source Mixture-of-Experts LLM tuned for high-efficiency reasoning, coding, and general language tasks across long-form prompts.

Lightweight MoE model with 284B total / 13B active parameters and native 1M context, tuned for low-latency, cost-effective high-concurrency use.

Lightweight MoE model with 284B total / 13B active parameters and native 1M context, tuned for low-latency, cost-effective high-concurrency use.
Lightweight MoE model with 284B total / 13B active parameters and native 1M context, tuned for low-latency, cost-effective high-concurrency use.
Flagship MoE LLM with 1.6T total / 49B active parameters and native 1M context for advanced math, logical inference, and specialized coding.

Flagship MoE LLM with 1.6T total / 49B active parameters and native 1M context for advanced math, logical inference, and specialized coding.
Flagship MoE LLM with 1.6T total / 49B active parameters and native 1M context for advanced math, logical inference, and specialized coding.

Quick LLM-style answer to a natural-language question, grounded in fresh Exa web search results with inline citations and source links.

Asynchronous research task that explores the web, gathers sources, synthesizes findings, and returns cited answers for in-depth queries.

Web search engine for finding pages, retrieving similar pages, crawling, and dedicated code search across the open web for AI agents.

Low-latency text-to-speech with single- and multi-speaker voices and controllable style, accent, and expressive tone for production apps.

High-quality TTS preview for podcasts, audiobooks, and customer support, with expressive multi-speaker voices across 23+ languages.

Highly controllable TTS with new Audio Tags for precise style, tone, pace, and delivery across narration, assistants, and voice apps.

Open-source vision-language model with 128K context, 140+ languages, improved math/reasoning, structured outputs, and function calling.

Deep-learning detector that flags portions of text likely generated by AI versus human, classifying content as entirely human, AI, or mixed.

Video model offering Text-to-Video, Image-to-Video, Reference-to-Video, and Video Edit modes with high-fidelity, motion-smooth output.

Open-source text-to-image model on a multimodal Mixture-of-Experts architecture with photorealistic detail and strong multilingual text rendering.

Autoregressive framework on the Janus Pro 7B model that unifies multimodal understanding and image generation in one architecture.

Video model in Standard or Pro modes with Text-to-Video, Image-to-Video, Reference-to-Video, editing, native sound, and multi-scene transitions.

Kling 3.0 model that transfers motion from a reference video onto a character from a reference image, with Standard 720p and Pro 1080p tiers.

Iterative AI search that keeps querying when initial results are insufficient, returning more comprehensive answers than Standard mode.

AI-powered web search with detailed overviews and answers, faster than Deep Search. Ranks #1 on OpenAI SimpleQA benchmark.

Reasoning model tuned for tasks needing longer thought and higher accuracy: legal research, financial forecasting, software, and storytelling.

Cost-efficient language model offering strong reasoning and multimodal performance for general production workloads at competitive latency.

Enterprise-grade model with strong reasoning, coding, and STEM performance, supporting hybrid, on-prem, and in-VPC deployments.

24B-parameter multimodal model with 128K context for image analysis, programming, math, and multilingual tasks, tuned for efficient local inference.

Hybrid model unifying Instruct, Reasoning (Magistral), and Devstral families: 40% lower completion time and 3x throughput vs Small 3.

Low-cost multimodal foundation model for text, images, and video on a 300K context (up to ~30 min video), tuned for speed and affordability.

Fast, cost-effective multimodal reasoning model for text, images, documents, and video on a 1M context (long docs and ~90 min clips).

Text-only foundation model tuned for ultra-low latency and cost on 128K context. Strong for summarization, translation, and chat with 44% cache discount.

Most capable model in the family. Multimodal text/image/video on a 1M context with chain-of-thought reasoning across tools and data sources.

Multimodal foundation model balancing accuracy, speed, and cost for text, images, and video on 300K context (up to ~30 min video).

Whisper-1 speech-to-text transcription trained on multilingual supervised audio, with a 25 MB upload limit per file.

Institutional-grade research powered by Claude Opus 4.6 reasoning, with maximum depth, enhanced tool access, and extensive source coverage.

Research model for multi-step retrieval, synthesis, and reasoning, autonomously searching, reading, and evaluating sources across complex topics.

Sonar Pro as an agentic researcher: chains web searches, fetches full pages, and streams live reasoning, adapting strategy for complex queries.

Real-time web search with filtering by domain, language, date, and more. Returns search results, not LLM responses; no file uploads.

Real-time web-connected search with accurate citations and customizable sources for up-to-date AI search integration in production apps.

Search-grounded model with double the citations and a larger context window, tuned for complex queries needing in-depth, nuanced answers.

Reasoning model on the uncensored open-source R1-1776 with web search, outperforming leading search engines and LLMs on the SimpleQA benchmark.

Cinematic video generation in Text-to-Video, Image-to-Video, and Transition modes with high detail, fluid motion, and lifelike animations.

Generates videos from text or 1-2 frame image prompts up to 1080p, multiple aspect ratios, 5-10s durations, with optional synchronized audio.
Unified image generation and editing model with class-leading complex Chinese/English text rendering, realistic textures, and multi-image fusion.
Vision-language model with hybrid linear-attention plus sparse MoE, 1M context, and fast multimodal text/image/video inference.
Vision-language model with hybrid linear-attention plus sparse MoE, 1M context, and fast multimodal text/image/video inference.

Cost-efficient omni-modal model handling text, image, audio, and video, with up to 3 hours of audio and 1 hour of video across 90+ languages.

Flagship omni-modal model for text, image, audio, and video. 3h audio, 1h video, 90+ input and 30+ output languages, 55 voice timbres.
Multimodal model with hybrid architecture for efficient deep thinking and visual understanding across text, image, and video on a 1M context.
Multimodal model with hybrid architecture for efficient deep thinking and visual understanding across text, image, and video on a 1M context.

Largest preview variant in the 3.6 series (text-only): improved coding agent execution, stronger front-end skills, and broader long-tail knowledge.

Vision-language model with major upgrades over 3.5: agentic and front-end coding, multimodal recognition, OCR, and object localization.
Vision-language model with major upgrades over 3.5: agentic and front-end coding, multimodal recognition, OCR, and object localization.
256K-context flagship with major improvements in reasoning, instruction following, and multilingual support, plus higher coding/math accuracy.
Preview release with major gains over the 2.5 series in Chinese-English understanding, complex instructions, multilingual ability, and tool use.

Reasoning model with adaptive tool use (search, memory, code interpreter) and test-time scaling for higher accuracy on complex tasks.

Semantic document reranker. Sorts up to 500 candidates per query by relevance, supports 100+ languages, and accepts a custom sorting instruction.

Coding-tuned 256K-context model with strong front-end results and multilingual programming support for AI coding tools and agents.

Balanced general-purpose model for high-frequency enterprise workloads: information processing, content, search, and data analysis.

Latency-focused multimodal model with 256K context, four reasoning effort modes, and image/video understanding for high-concurrency use.

Flagship general model with 256K context for complex reasoning, multimodal understanding, structured generation, and tool-augmented execution.

Speed-optimized 2.0 video variant for cinematic clips with native audio sync, camera control, and stable motion at lower cost per render.

Multimodal video model for cinematic output from text, image, audio, or video inputs, with stable motion and consistent characters.

Unified multimodal image model that reasons through prompts before rendering, producing high-resolution and consistent edits and brand visuals.

Generates audio up to 3 minutes from text prompts, supporting text-to-audio and audio-to-audio with adjustable duration, steps, and CFG scale.

Up-to-3-minute audio from text with text-to-audio, audio-to-audio, and audio inpainting for music production, sound design, and remixing.

Multi-search research assistant that explores a topic, analyzes sources, and produces a detailed research report with citations.

Web search with crawl, extract, and URL mapping for fast, structured retrieval across pages and domains for downstream pipelines.

Multilingual text embedding with selectable output dimensions (64–2048). Up to 8,192 tokens per input.

Speed-optimised multimodal embedding — same shape as Vision-Plus, 3× cheaper image/video tokens.

Multimodal embedding producing independent vectors for text, image, and video inputs.
Multimodal video generation model for cinematic, multi-shot stories with native audio-visual sync (lip-sync, dialogue, music, SFX).

Multimodal video model supporting T2V, I2V, video editing, and reference-to-video, with high-fidelity output from text, image, or video inputs.

Image generation and editing companion model: text-to-image, bounding-box edits, and cohesive image sets, with up to 4K output on Pro.

Autonomous AI agent that turns a high-level prompt into subtasks, calls tools and APIs, and delivers end-to-end results without manual orchestration.

Image-to-video model that animates a source image with prompt-guided motion, up to 15 seconds at 480p or 720p across seven aspect ratios.

Top-tier model for agentic workflows, complex software engineering, and long-horizon tasks, sustaining work across 1000+ tool calls on 1M context.

Multimodal model with native visual and audio understanding on a 1M context, designed to reason and act across modalities in agentic workflows.

Lightweight, high-speed reasoning model with hybrid attention and multi-token prediction for low-cost inference and strong benchmark scores.
95 of 119 models