← all SaaS
Self-host OpenAI API
LLM inference API ·
Category: AI / LLM tooling
OpenAI's API gives you GPT-4o / o-series chat completions, embeddings, and tool use behind a single HTTPS endpoint billed per token. The self-hostable replacements split the job in two: a model server (Ollama, vLLM, TGI) that runs an open-weight model on your own GPU, and an OpenAI-compatible router (LiteLLM) that exposes /v1/chat/completions to clients that already speak the OpenAI protocol.
OpenAI API pricing anchor: GPT-4o ~$2.50 input / $10 output per 1M tokens; embeddings ~$0.02 per 1M tokens.
- GitHub
- ★ 171.1k · last commit 1d ago · 3223 open issues
- License
-
MIT - Setup time
- 5min single binary
- Monthly cost
- Free on a workstation with a 16GB+ GPU; ~$200/mo for an A10/RTX 4090 cloud GPU; CPU-only works for 7B models but is too slow for production.
Migration sketch. Install with `curl -fsSL https://ollama.com/install.sh | sh`, pull a model with `ollama pull llama3.1:8b` (or `qwen2.5:32b` for closer GPT-4 quality), then point clients at `http://localhost:11434/v1/chat/completions` — Ollama exposes an OpenAI-compatible endpoint so the official `openai` SDK works by setting `base_url`. Replace `gpt-4o` with the local model name in your request payload.
Good fit forSingle-machine deployments and laptops; the easiest on-ramp from OpenAI for a developer team.
Weak atMulti-tenant serving and batched throughput — Ollama serializes requests; for concurrent traffic switch to vLLM.
- GitHub
- ★ 79.5k · last commit today · 4864 open issues
- License
-
Apache-2.0 - Setup time
- 30min docker run with --gpus
- Monthly cost
- $200-1500/mo depending on GPU class; an A100 80GB runs Llama 3.1 70B comfortably with PagedAttention batching.
Migration sketch. Run `docker run --gpus all -p 8000:8000 vllm/vllm-openai --model meta-llama/Llama-3.1-70B-Instruct`. The container exposes `/v1/chat/completions` and `/v1/embeddings` matching the OpenAI schema; point your existing `openai` client's `base_url` at `http://your-host:8000/v1`. Use vLLM's `--api-key` flag to require a bearer token before exposing the endpoint to the internet.
Good fit forProduction inference at scale — vLLM's continuous batching is what you want when 10+ concurrent users hit the endpoint.
Weak atSingle-GPU model fit — large models (70B+) need multi-GPU tensor parallelism and careful VRAM budgeting.
- GitHub
- ★ 46.4k · last commit today · 2949 open issues
- License
-
MIT - Setup time
- 15min docker-compose (proxy + Postgres for usage logs)
- Monthly cost
- $5 VPS for the proxy itself; the underlying model server (Ollama / vLLM / OpenAI passthrough) is the real cost line.
Migration sketch. Deploy the LiteLLM proxy with `litellm --config config.yaml`, defining model aliases like `gpt-4` → `ollama/qwen2.5:32b` and `gpt-4o-mini` → `openai/gpt-4o-mini` (real OpenAI passthrough for traffic you still want on the hosted model). Apps keep calling `gpt-4` on the OpenAI SDK with `base_url=http://litellm:4000/v1` — the routing happens in the proxy. Built-in budget caps and per-key rate limits replace OpenAI's dashboard quotas.
Good fit forTeams that want one OpenAI-shaped endpoint in front of many backends (mix of self-hosted + hosted Anthropic + hosted OpenAI for fallback).
Weak atNot a model server itself — you still need Ollama/vLLM/cloud APIs behind it; LiteLLM is glue, not GPU.
In a terminal? npx os-alt openai-api prints this table —
how the CLI works →