
Qwen3.7 Plus
Alibaba CloudCost-effective Qwen3.7 vision-language model for text, image, video, coding, tool use, GUI understanding, and 1M-context workflows.
Browse the full catalog of models across text, image, audio, video, 3D, and more.
Model catalog
Browse text, image, video, audio, 3D, search, and agent endpoints with pay-as-you-go pricing. The interactive catalog loads current availability from EmpirioLabs, and these model docs are crawlable without client JavaScript.
xAI image-to-video with prompt-guided motion, native audio, 480p or 720p output, and up to 15 second clips.
Multimodal video generation for cinematic clips from text, image, audio, or video inputs.
Unified image generation and editing for high-resolution creative, brand, and product visuals.
Cost-effective vision-language model for text, image, video, coding, tools, and 1M-context workflows.
Flagship long-context model for coding, productivity, long-running agents, deep thinking, and tool use.
Multimodal reasoning for coding, agents, long-context analysis, and text, image, and video input.
Moonshot multimodal reasoning with strong coding support, 256K context, and image and video input.
Long-context reasoning with tool calling, structured output, cache support, and 128K output.
Image-to-3D generation that turns a reference image into a textured GLB asset.

Alibaba CloudCost-effective Qwen3.7 vision-language model for text, image, video, coding, tool use, GUI understanding, and 1M-context workflows.

MiniMaxMiniMax M3 is a multimodal reasoning model for coding, agents, and long-context analysis with text, image, and video input.

xAIImage-to-video model that animates a source image with prompt-guided motion, up to 15 seconds at 480p or 720p across seven aspect ratios.

Alibaba CloudQwen3.7 Max is a flagship text model for coding, productivity, long-running agents, deep thinking, tools, and 1M-token context.

Moonshot AIKimi K2.6 is a Moonshot multimodal reasoning model with 256K context, strong coding, and text, image, and video inputs.

Z.aiLong-context Zhipu AI reasoning model with 202K context, 128K output, tool calling, structured output, and cache support.

Alibaba CloudCost-effective Qwen3.7 vision-language model for text, image, video, coding, tool use, GUI understanding, and 1M-context workflows.

MiniMaxMiniMax M3 is a multimodal reasoning model for coding, agents, and long-context analysis with text, image, and video input.

Alibaba CloudQwen3.7 Max is a flagship text model for coding, productivity, long-running agents, deep thinking, tools, and 1M-token context.

MiniMaxHigh-speed M2.7 variant tuned for fast inference with strong general-purpose performance with strong agentic capabilities.

Z.aiLong-context Zhipu AI reasoning model with 202K context, 128K output, tool calling, structured output, and cache support.

Moonshot AIKimi K2.6 is a Moonshot multimodal reasoning model with 256K context, strong coding, and text, image, and video inputs.

Black Forest LabsApache-licensed 4B FLUX.2 Klein image generation and editing model with text-to-image, reference-image editing, and creative workflow support.

AmazonImage generation and editing model creating and modifying images from text or image inputs, with inpainting, virtual try-on, and style controls.

TencentOpen-source text-to-image model on a multimodal Mixture-of-Experts architecture with photorealistic detail and strong multilingual text rendering.

DeepSeekAutoregressive framework on the Janus Pro 7B model that unifies multimodal understanding and image generation in one architecture.

Alibaba CloudUnified image generation and editing model with class-leading complex Chinese/English text rendering, realistic textures, and multi-image fusion.

ByteDanceUnified multimodal image model that reasons through prompts before rendering, producing high-resolution and consistent edits and brand visuals.

AmazonVideo generation model producing up to 2-minute multi-shot videos from text and optional image prompts with improved quality and consistency.

Alibaba CloudVideo model offering Text-to-Video, Image-to-Video, Reference-to-Video, and Video Edit modes with high-fidelity, motion-smooth output.

Tencent8.3B-parameter video model with native 720p output (upscalable to 1080p), strong motion coherence, and bilingual prompt understanding up to 10s.

Kling AIVideo model in Standard or Pro modes with Text-to-Video, Image-to-Video, Reference-to-Video, editing, native sound, and multi-scene transitions.

Kling AIKling 3.0 model that transfers motion from a reference video onto a character from a reference image, with Standard 720p and Pro 1080p tiers.

OpenMOSSOpen-source 32B MoE foundation model that generates synchronized video and audio in one inference step with precise dual-tower lip-sync.

ACE-StepOpen-source music generation model for text-to-song and lyric-guided audio, with fast 8-step XL Turbo inference for controllable song iteration.

InworldSub-130ms TTFB voice synthesis with 271+ voices across 15 languages, expressive prosody, and real-time SSE streaming for low-latency voice agents.

InworldBroadcast-quality voice synthesis with rich expressive prosody, 271+ voices across 15 languages, and real-time SSE streaming with per-word timestamps.

GoogleLow-latency text-to-speech with single- and multi-speaker voices and controllable style, accent, and expressive tone for production apps.

GoogleHigh-quality TTS preview for podcasts, audiobooks, and customer support, with expressive multi-speaker voices across 23+ languages.

GoogleHighly controllable TTS with new Audio Tags for precise style, tone, pace, and delivery across narration, assistants, and voice apps.

DeepgramSpeech-to-text transcription using the Nova-3 model with multi-language support and advanced customizable settings for production workloads.

OpenAIWhisper-1 speech-to-text transcription trained on multilingual supervised audio, with a 25 MB upload limit per file.

OpenAIControlled Whisper Large v3 Turbo transcription with multilingual ASR, translation, VAD, timestamps, subtitles, hotwords, and decoder controls.

ExaQuick LLM-style answer to a natural-language question, grounded in fresh Exa web search results with inline citations and source links.

ExaAsynchronous research task that explores the web, gathers sources, synthesizes findings, and returns cited answers for in-depth queries.

ExaWeb search engine for finding pages, retrieving similar pages, crawling, and dedicated code search across the open web for AI agents.

LinkupIterative AI search that keeps querying when initial results are insufficient, returning more comprehensive answers than Standard mode.

LinkupAI-powered web search with detailed overviews and answers, faster than Deep Search. Ranks #1 on OpenAI SimpleQA benchmark.

PerplexityInstitutional-grade research powered by Claude Opus 4.6 reasoning, with maximum depth, enhanced tool access, and extensive source coverage.

Alibaba CloudMultilingual text embedding with selectable output dimensions (64–2048). Up to 8,192 tokens per input.

Alibaba CloudSpeed-optimised multimodal embedding — same shape as Vision-Plus, 3× cheaper image/video tokens.

Alibaba CloudMultimodal embedding producing independent vectors for text, image, and video inputs.

GPTZeroDeep-learning detector that flags portions of text likely generated by AI versus human, classifying content as entirely human, AI, or mixed.

ManusAutonomous AI agent that turns a high-level prompt into subtasks, calls tools and APIs, and delivers end-to-end results without manual orchestration.
Explore our models, or contact us about business inquiries, custom deployments, or anything else.