Self-hosted vLLM¶
OpenAIProvider talks to any server that implements OpenAI's Chat
Completions wire format (POST /v1/chat/completions) — including
self-hosted vLLM. This page
walks through the configuration nuances specific to vLLM
deployments.
The 30-second version¶
import asyncio
from openarmature.llm import OpenAIProvider, RuntimeConfig, UserMessage
async def main() -> None:
provider = OpenAIProvider(
base_url="http://localhost:8000", # host root only — no /v1
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None, # vLLM doesn't require auth by default
genai_system="vllm", # surfaces on observability spans
)
messages = [UserMessage(content="hello")]
config = RuntimeConfig(temperature=0.0, max_tokens=128)
# await provider.ready() # pre-flight; needs a live endpoint
# response = await provider.complete(messages, config=config)
# print(response.message.content)
_ = (messages, config) # used once the calls above are uncommented
await provider.aclose()
asyncio.run(main())
That's it for the happy path. The rest of the page covers the config nuances you'll hit in real deployments.
base_url shape — host root only¶
vLLM serves on http://<host>:<port>/v1/chat/completions and
http://<host>:<port>/v1/models. OpenAIProvider appends the
/v1/... paths itself, so the base_url you pass MUST be the host
root — no /v1 suffix:
from openarmature.llm import OpenAIProvider
# Correct — host root only, provider appends /v1/...
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None,
)
# Rejected at construction time — raises ValueError
try:
OpenAIProvider(
base_url="http://localhost:8000/v1",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None,
)
except ValueError as exc:
# "base_url must not end with '/v1' — the provider appends …"
_ = exc
The check exists because httpx joins paths by appending, so an
unprefixed /v1 on base_url produces a doubled /v1/v1/... wire
path that silently 404/405s on most backends. The provider rejects
the misconfiguration synchronously rather than letting it surface as
a wire failure at lifespan startup.
Trailing slashes are stripped; other non-empty paths (proxy prefixes
like /api/openai-proxy) are left intact for intentional reverse-
proxy setups.
Authentication — typically off, optionally on¶
vLLM ships with auth off by default. Pass api_key=None for that
case. To enable auth on the vLLM side, launch with --api-key
<your-key> and pass the same value to OpenAIProvider:
# vLLM server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--api-key sk-vllm-local-secret
from openarmature.llm import OpenAIProvider
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key="sk-vllm-local-secret",
)
A wrong or missing key surfaces as ProviderAuthentication (mapped
from 401/403) — the same error category as OpenAI cloud auth
failures, so retry / surface logic is portable across backends.
genai_system="vllm" for the observability layer¶
The genai_system constructor kwarg sets the OTel gen_ai.system
span attribute. The default "openai" is correct for OpenAI's
hosted API; for self-hosted backends, set it to the backend name so
dashboards / cost-attribution UIs filter the traces correctly:
from openarmature.llm import OpenAIProvider
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None,
genai_system="vllm",
)
Standard values for other backends running the same wire format:
"vllm", "lmstudio", "llamacpp", "sglang". No base_url
sniffing is done — the same host:port could be any of those servers,
and a wrong inference would be worse than the explicit opt-in.
Older vLLM releases — force_prompt_augmentation_fallback¶
OpenAI's native structured-output path uses the response_format
field on the request body. Older vLLM releases either reject this
field or silently ignore it. If you're targeting structured output
on an older vLLM:
from openarmature.llm import OpenAIProvider
provider = OpenAIProvider(
base_url="http://localhost:8000",
model="meta-llama/Llama-3.1-8B-Instruct",
api_key=None,
genai_system="vllm",
force_prompt_augmentation_fallback=True,
)
With the fallback flag set, the provider injects a JSON-Schema
directive into the system message instead of using
response_format. The wire body never carries response_format;
the model sees the schema in the prompt and is asked to produce
conforming JSON. Validation against the schema still runs on the
returned text — StructuredOutputInvalid surfaces when the model's
output doesn't match.
Recent vLLM releases (>=0.5.x) support response_format natively;
leave the flag at its default False for those.
Readiness probe — GET /v1/models¶
provider.ready() hits GET /v1/models and:
- Matches the bound model against the returned
data[].identries; raisesProviderInvalidModelif absent. - Consults an optional per-entry
statusfield — if it containsloadingornot_loaded, raisesProviderModelNotLoaded. Local servers that report load state (some LM Studio / vLLM builds) get a real not-loaded signal through this path. - Maps 401/403 →
ProviderAuthentication, 5xx / connection error →ProviderUnavailable.
Limitation for vLLM specifically. vLLM's /v1/models doesn't
populate a status field — it returns the configured model with a
200 even during a slow first-load. So the status-based not-loaded
detection above doesn't fire for vLLM; the probe confirms the model
name matches but can't tell warmed from cold. For deployments where
cold-load takes seconds to minutes, layer your own warm-up call
after ready():
from openarmature.llm import OpenAIProvider, RuntimeConfig, UserMessage
async def warm_up(provider: OpenAIProvider) -> None:
await provider.ready()
# Synthetic warm-up — sends a 1-token request to force the model
# to finish loading before lifespan startup completes.
await provider.complete(
[UserMessage(content="ok")],
config=RuntimeConfig(temperature=0.0, max_tokens=1),
)
A more discriminating readiness contract is on the roadmap (see post-release task: harden OpenAIProvider readiness probe).
Tool calling on vLLM¶
vLLM supports OpenAI-style tool calling when launched with
--enable-auto-tool-choice and a tool-parser flag matching the
model family (e.g., --tool-call-parser llama3_json for Llama 3.1
Instruct). The wire shape is identical to OpenAI's; from
OpenAIProvider's perspective, tool calls Just Work. The
fundamentals → tool calling page
covers the OA-side dispatch pattern; no vLLM-specific changes
needed.
# vLLM server — enable tool calling
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser llama3_json
Behaviour to be aware of¶
- Concurrency: vLLM batches requests internally.
OpenAIProvidershares onehttpx.AsyncClientper provider instance; concurrentcomplete()calls share the connection pool and round-trip through vLLM's batcher. - Token counting: vLLM returns OpenAI-shaped
usageblocks. OA records them onResponse.usageand surfaces them asgen_ai.usage.*span attributes per observability §5.5.3. Retry-Afteron 429: vLLM emits 429 when its scheduler queues fill.ProviderRateLimit.retry_afteris populated from the header. Retry middleware handles backoff if wrapped around the node.- Streaming: not supported on
provider.complete()(it's a single-completion call by contract). For streaming, write a capability against the same wire endpoint.