VRAM AI Docs
Everything you need to deploy and operate VRAM AI in production.
Quick Start
Get VRAM running in under 5 minutes using Docker. You need an NVIDIA GPU with CUDA 12.1+ and Docker with the NVIDIA Container Toolkit installed.
docker pull vramai/ghostswap:latest
docker run -d \ --gpus all \ -p 8080:8080 \ -e LICENSE_KEY="GSW1.your-key-here" \ -e MODE=ghostswap \ --name ghostswap \ vramai/ghostswap:latest
curl http://localhost:8080/health
Open http://localhost:8080/xray in your browser to see the X-Ray GPU dashboard.
Configuration
VRAM is configured via a YAML file mounted at /config/config.yaml. Key fields:
vram_budget_gb: 24 # Total VRAM budget across all models
ram_budget_gb: 64 # RAM spill tier size
device: "cuda" # cuda | cpu
log_level: "INFO"
# NVMe fast tier for large models
nvme_dir: "/models/nvme"
auto_cache_nvme: true
# Models to preload at startup
models:
- id: "mistral-7b"
path: "mistralai/Mistral-7B-Instruct-v0.3"
vram_gb: 14.0
dtype: "float16"
- id: "llama-3-8b"
path: "meta-llama/Meta-Llama-3-8B-Instruct"
vram_gb: 16.0
dtype: "float16"API Reference
VRAM exposes an OpenAI-compatible REST API. Point any OpenAI SDK at your VRAM instance.
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-used" # license key is set server-side
)
response = client.chat.completions.create(
model="mistral-7b",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)| Method | Path | Description |
|---|---|---|
| GET | /health | Health check |
| GET | /v1/models | List registered models |
| POST | /v1/completions | Text completion |
| POST | /v1/chat/completions | Chat completion |
| POST | /v1/embeddings | Text embeddings |
| POST | /admin/models | Register model at runtime |
| DELETE | /admin/models/:id | Remove model at runtime |
| POST | /admin/models/upload | Upload custom model file |
| GET | /metrics | Prometheus metrics |
| GET | /xray/ | X-Ray dashboard UI |
| GET | /xray/api/snapshot | X-Ray GPU snapshot JSON |
Kubernetes / Helm
Deploy to Kubernetes using the official VRAM Helm chart. The chart includes GPU node affinity, PVC storage for model cache, HPA, and Prometheus ServiceMonitor.
helm install ghostswap \ oci://registry-1.docker.io/vramai/ghostswap \ --version 0.1.0 \ --set license.key="GSW1.your-key-here" \ --set gateway.device=cuda \ --set gpu.count=1 \ -n ghostswap --create-namespace
kubectl get pods -n ghostswap -w kubectl logs -n ghostswap deploy/ghostswap -f kubectl port-forward -n ghostswap svc/ghostswap 8080:8080
License Keys
License keys are HMAC-signed tokens that encode your GPU limit, model limit, and expiry. Validation is fully offline — no license server or internet access required.
Set your license key via environment variable or in config.yaml:
LICENSE_KEY=GSW1.your-key-here
license_key: "GSW1.your-key-here"
The gateway exits immediately with a clear error message if the key is missing, expired, or if detected GPU/model counts exceed license limits.
Need help with your deployment?
Contact our team