Documentation

VRAM AI Docs

Everything you need to deploy and operate VRAM AI in production.

Quick Start

Get VRAM running in under 5 minutes using Docker. You need an NVIDIA GPU with CUDA 12.1+ and Docker with the NVIDIA Container Toolkit installed.

1. Pull the image

docker pull vramai/ghostswap:latest

2. Run with your license key

docker run -d \
  --gpus all \
  -p 8080:8080 \
  -e LICENSE_KEY="GSW1.your-key-here" \
  -e MODE=ghostswap \
  --name ghostswap \
  vramai/ghostswap:latest

3. Verify it's running

curl http://localhost:8080/health

Open http://localhost:8080/xray in your browser to see the X-Ray GPU dashboard.

Configuration

VRAM is configured via a YAML file mounted at /config/config.yaml. Key fields:

config.yaml

vram_budget_gb: 24      # Total VRAM budget across all models
ram_budget_gb: 64       # RAM spill tier size
device: "cuda"          # cuda | cpu
log_level: "INFO"

# NVMe fast tier for large models
nvme_dir: "/models/nvme"
auto_cache_nvme: true

# Models to preload at startup
models:
  - id: "mistral-7b"
    path: "mistralai/Mistral-7B-Instruct-v0.3"
    vram_gb: 14.0
    dtype: "float16"

  - id: "llama-3-8b"
    path: "meta-llama/Meta-Llama-3-8B-Instruct"
    vram_gb: 16.0
    dtype: "float16"

API Reference

VRAM exposes an OpenAI-compatible REST API. Point any OpenAI SDK at your VRAM instance.

Python — OpenAI SDK

import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-used"  # license key is set server-side
)

response = client.chat.completions.create(
    model="mistral-7b",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Endpoints

Method	Path	Description
GET	/health	Health check
GET	/v1/models	List registered models
POST	/v1/completions	Text completion
POST	/v1/chat/completions	Chat completion
POST	/v1/embeddings	Text embeddings
POST	/admin/models	Register model at runtime
DELETE	/admin/models/:id	Remove model at runtime
POST	/admin/models/upload	Upload custom model file
GET	/metrics	Prometheus metrics
GET	/xray/	X-Ray dashboard UI
GET	/xray/api/snapshot	X-Ray GPU snapshot JSON

Kubernetes / Helm

Deploy to Kubernetes using the official VRAM Helm chart. The chart includes GPU node affinity, PVC storage for model cache, HPA, and Prometheus ServiceMonitor.

Install via Helm

helm install ghostswap \
  oci://registry-1.docker.io/vramai/ghostswap \
  --version 0.1.0 \
  --set license.key="GSW1.your-key-here" \
  --set gateway.device=cuda \
  --set gpu.count=1 \
  -n ghostswap --create-namespace

Check status

kubectl get pods -n ghostswap -w
kubectl logs -n ghostswap deploy/ghostswap -f
kubectl port-forward -n ghostswap svc/ghostswap 8080:8080

License Keys

License keys are HMAC-signed tokens that encode your GPU limit, model limit, and expiry. Validation is fully offline — no license server or internet access required.

Set your license key via environment variable or in config.yaml:

Environment variable

LICENSE_KEY=GSW1.your-key-here

config.yaml

license_key: "GSW1.your-key-here"

The gateway exits immediately with a clear error message if the key is missing, expired, or if detected GPU/model counts exceed license limits.

Need help with your deployment?

Contact our team