Self-host OpenAI API

LLM inference API · Category: AI / LLM tooling

OpenAI's API gives you GPT-4o / o-series chat completions, embeddings, and tool use behind a single HTTPS endpoint billed per token. The self-hostable replacements split the job in two: a model server (Ollama, vLLM, TGI) that runs an open-weight model on your own GPU, and an OpenAI-compatible router (LiteLLM) that exposes /v1/chat/completions to clients that already speak the OpenAI protocol.

OpenAI API pricing anchor: GPT-4o ~$2.50 input / $10 output per 1M tokens; embeddings ~$0.02 per 1M tokens.

Ollama ollama/ollama alive

GitHub: ★ 171.1k · last commit 1d ago · 3223 open issues
License: MIT
Setup time: 5min single binary
Monthly cost: Free on a workstation with a 16GB+ GPU; ~$200/mo for an A10/RTX 4090 cloud GPU; CPU-only works for 7B models but is too slow for production.

Migration sketch. Install with `curl -fsSL https://ollama.com/install.sh | sh`, pull a model with `ollama pull llama3.1:8b` (or `qwen2.5:32b` for closer GPT-4 quality), then point clients at `http://localhost:11434/v1/chat/completions` — Ollama exposes an OpenAI-compatible endpoint so the official `openai` SDK works by setting `base_url`. Replace `gpt-4o` with the local model name in your request payload.

Good fit forSingle-machine deployments and laptops; the easiest on-ramp from OpenAI for a developer team.

Weak atMulti-tenant serving and batched throughput — Ollama serializes requests; for concurrent traffic switch to vLLM.

vLLM vllm-project/vllm alive

GitHub: ★ 79.5k · last commit today · 4864 open issues
License: Apache-2.0
Setup time: 30min docker run with --gpus
Monthly cost: $200-1500/mo depending on GPU class; an A100 80GB runs Llama 3.1 70B comfortably with PagedAttention batching.

Migration sketch. Run `docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-3.1-70B-Instruct`. The container exposes `/v1/chat/completions` and `/v1/embeddings` matching the OpenAI schema; point your existing `openai` client's `base_url` at `http://your-host:8000/v1`. Use vLLM's `--api-key` flag to require a bearer token before exposing the endpoint to the internet.

Good fit forProduction inference at scale — vLLM's continuous batching is what you want when 10+ concurrent users hit the endpoint.

Weak atSingle-GPU model fit — large models (70B+) need multi-GPU tensor parallelism and careful VRAM budgeting.

LiteLLM BerriAI/litellm alive

GitHub: ★ 46.4k · last commit today · 2949 open issues
License: MIT
Setup time: 15min docker-compose (proxy + Postgres for usage logs)
Monthly cost: $5 VPS for the proxy itself; the underlying model server (Ollama / vLLM / OpenAI passthrough) is the real cost line.

Migration sketch. Deploy the LiteLLM proxy with `litellm --config config.yaml`, defining model aliases like `gpt-4` → `ollama/qwen2.5:32b` and `gpt-4o-mini` → `openai/gpt-4o-mini` (real OpenAI passthrough for traffic you still want on the hosted model). Apps keep calling `gpt-4` on the OpenAI SDK with `base_url=http://litellm:4000/v1` — the routing happens in the proxy. Built-in budget caps and per-key rate limits replace OpenAI's dashboard quotas.

Good fit forTeams that want one OpenAI-shaped endpoint in front of many backends (mix of self-hosted + hosted Anthropic + hosted OpenAI for fallback).

Weak atNot a model server itself — you still need Ollama/vLLM/cloud APIs behind it; LiteLLM is glue, not GPU.

In a terminal? npx os-alt openai-api prints this table — how the CLI works →