Documentation

VRAM AI Docs

Everything you need to deploy and operate VRAM AI in production.

Quick Start

Get VRAM running in under 5 minutes using Docker. You need an NVIDIA GPU with CUDA 12.1+ and Docker with the NVIDIA Container Toolkit installed.

1. Pull the image
docker pull vramai/ghostswap:latest
2. Run with your license key
docker run -d \
  --gpus all \
  -p 8080:8080 \
  -e LICENSE_KEY="GSW1.your-key-here" \
  -e MODE=ghostswap \
  --name ghostswap \
  vramai/ghostswap:latest
3. Verify it's running
curl http://localhost:8080/health

Open http://localhost:8080/xray in your browser to see the X-Ray GPU dashboard.

Configuration

VRAM is configured via a YAML file mounted at /config/config.yaml. Key fields:

config.yaml
vram_budget_gb: 24      # Total VRAM budget across all models
ram_budget_gb: 64       # RAM spill tier size
device: "cuda"          # cuda | cpu
log_level: "INFO"

# NVMe fast tier for large models
nvme_dir: "/models/nvme"
auto_cache_nvme: true

# Models to preload at startup
models:
  - id: "mistral-7b"
    path: "mistralai/Mistral-7B-Instruct-v0.3"
    vram_gb: 14.0
    dtype: "float16"

  - id: "llama-3-8b"
    path: "meta-llama/Meta-Llama-3-8B-Instruct"
    vram_gb: 16.0
    dtype: "float16"

API Reference

VRAM exposes an OpenAI-compatible REST API. Point any OpenAI SDK at your VRAM instance.

Python — OpenAI SDK
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used"  # license key is set server-side
)

response = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Endpoints
MethodPathDescription
GET/healthHealth check
GET/v1/modelsList registered models
POST/v1/completionsText completion
POST/v1/chat/completionsChat completion
POST/v1/embeddingsText embeddings
POST/admin/modelsRegister model at runtime
DELETE/admin/models/:idRemove model at runtime
POST/admin/models/uploadUpload custom model file
GET/metricsPrometheus metrics
GET/xray/X-Ray dashboard UI
GET/xray/api/snapshotX-Ray GPU snapshot JSON

Kubernetes / Helm

Deploy to Kubernetes using the official VRAM Helm chart. The chart includes GPU node affinity, PVC storage for model cache, HPA, and Prometheus ServiceMonitor.

Install via Helm
helm install ghostswap \
  oci://registry-1.docker.io/vramai/ghostswap \
  --version 0.1.0 \
  --set license.key="GSW1.your-key-here" \
  --set gateway.device=cuda \
  --set gpu.count=1 \
  -n ghostswap --create-namespace
Check status
kubectl get pods -n ghostswap -w
kubectl logs -n ghostswap deploy/ghostswap -f
kubectl port-forward -n ghostswap svc/ghostswap 8080:8080

License Keys

License keys are HMAC-signed tokens that encode your GPU limit, model limit, and expiry. Validation is fully offline — no license server or internet access required.

Set your license key via environment variable or in config.yaml:

Environment variable
LICENSE_KEY=GSW1.your-key-here
config.yaml
license_key: "GSW1.your-key-here"

The gateway exits immediately with a clear error message if the key is missing, expired, or if detected GPU/model counts exceed license limits.

Need help with your deployment?

Contact our team