How to Run Any Hugging Face Model Behind an OpenAI-Compatible API

Run any Hugging Face model on EmpirioLabs GPU Cloud

Jun 8, 2026

EmpirioLabs AI

Open models have caught up fast. Labs like Qwen, DeepSeek, GLM, Kimi, and Mistral keep shipping strong open text models on Hugging Face, and for a lot of work they are all you need. The hard part has never been the model. It is everything around it: finding a GPU, installing the right drivers and serving stack, loading the weights, exposing a port, and getting a stable connection back to your app.

GPU Cloud removes that work. You deploy a managed GPU, paste a Hugging Face repository id, and EmpirioLabs serves the model behind an OpenAI-compatible endpoint. Because the endpoint speaks the standard OpenAI API, you can point almost any tool at it: a chat frontend like SillyTavern, a coding assistant, your own scripts, or the official OpenAI SDKs. This guide walks through the whole flow, including how to choose a GPU.

Deploy a model

Open the GPU Cloud page in the dashboard, choose to deploy a model, and paste any Hugging Face repository id, for example Qwen/Qwen3.5-4B. Pick a GPU, then deploy. EmpirioLabs downloads the weights, starts a high-throughput serving engine, and exposes the model on an OpenAI-compatible endpoint. Standard safetensors models are served automatically, and quantized GGUF repositories are handled too, so most text models on Hugging Face work with no extra configuration.

You can do the same thing over the API. The deploy call returns an instance id with a status of provisioning. Poll the instance until it is running.

curl https://api.empiriolabs.ai/v1/gpu/instances \
  -H "Authorization: Bearer $EMPIRIOLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "gpu_slug": "rtx-a6000",
    "mode": "model",
    "hf_id": "Qwen/Qwen3.5-4B"
  }'

A model takes a few minutes to come up the first time, since the weights have to download and load into GPU memory. After that, the instance is ready to receive requests.

Choose the right GPU

The main thing that decides which GPU you need is how much memory the model takes. A text model served at 16-bit precision needs roughly 2 GB of GPU memory per billion parameters, plus headroom for the context window and the key-value cache. So a 7B model wants somewhere around 16 to 20 GB, a 30B model wants a large single card, and a 70B model wants the biggest single card or several cards together. Quantized builds use far less memory, which lets a bigger model fit on a smaller GPU.

A good default for small and mid-size text models is the RTX A6000 with 48 GB at 0.65 USD per hour. It comfortably runs models up to roughly 13B at full precision, and much larger ones when they are quantized. For models around 30B, move up to an A100 or H100 with 80 GB. For 70B and above, use an H200 with 141 GB, or split the model across several GPUs in a single instance, which is how the largest models are served. The dashboard shows each GPU's memory next to its hourly price, and if a model needs more memory than the GPU you picked, the deploy is blocked before it starts, so you never pay for one that cannot fit.

Use it in any OpenAI-compatible app

This is where it gets useful. A running model instance gives you a connect base URL that looks like this:

https://api.empiriolabs.ai/v1/gpu/connect/$INSTANCE_ID/v1

That URL is a drop-in OpenAI base URL. Anything that can talk to the OpenAI API can talk to your model. You need three values: the base URL above, your EmpirioLabs API key as the bearer token, and the model name, which is the exact repository id you deployed, for example Qwen/Qwen3.5-4B. The endpoint also answers /v1/models, so clients that fetch a model list to fill a dropdown work as expected.

curl https://api.empiriolabs.ai/v1/gpu/connect/$INSTANCE_ID/v1/chat/completions \
  -H "Authorization: Bearer $EMPIRIOLABS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3.5-4B",
    "messages": [{ "role": "user", "content": "Hello" }]
  }'

If you prefer the OpenAI SDK, point its base URL at the same connect path and everything else stays the same:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.empiriolabs.ai/v1/gpu/connect/INSTANCE_ID/v1",
    api_key="YOUR_EMPIRIOLABS_API_KEY",
)

response = client.chat.completions.create(
    model="Qwen/Qwen3.5-4B",
    messages=[{"role": "user", "content": "Hello"}],
)
print(response.choices[0].message.content)

The same three values plug into the tools you already use. In a chat frontend such as SillyTavern, Open WebUI, or LibreChat, choose the OpenAI-compatible or custom provider, set the API URL to your connect base URL, paste your EmpirioLabs API key, and enter the model name shown on the instance. Coding assistants that accept a custom OpenAI base URL, like Cline, Continue, and Aider, are configured the same way. Roleplay frontends, agent frameworks, retrieval pipelines, and any in-house service follow the same pattern: base URL, key, model.

Chat with it in the dashboard

You do not need an external tool to try a model you just deployed. Every instance that serves an OpenAI-compatible API gets a built-in chat page. Open the instance from the GPU Cloud page, click Chat with this model, and start typing. The chat page streams responses, supports a system prompt and the usual sampling controls, and runs over the same secure connection as the API, so there is nothing extra to set up and no separate billing, because the instance is already metered by the second.

Pause when you are not using it

GPU Cloud is billed per second of running time, and the hourly rate is locked when you deploy, so a price change never affects an instance that is already running. When you do not need the model, stop the instance. That releases the GPU and stops billing immediately. Start it again later and EmpirioLabs redeploys the same model on the next available GPU. Stopping clears the instance's ephemeral storage, so treat it as scratch space, and destroy an instance when you are finished with it. Lifetime and per-instance spend are visible on the GPU Cloud page and through /v1/account/usage.