VRAM AI Suite

Three products.
One GPU stack.

VRAM Gateway handles model serving. X-Ray handles diagnostics. Model Compression shrinks your models so they need less GPU to run. Together they cut your GPU bill by up to 80%.

VRAM AI Gateway

Serve unlimited models on a single GPU

VRAM sits in front of your inference stack as an OpenAI-compatible proxy. It maintains a pool of models, swapping them in and out of VRAM based on actual request traffic — not manual management.

You define a VRAM budget. VRAM handles everything else: LRU eviction, NVMe spill, prefetching, LoRA switching, and GPU Direct Storage. Most AI teams run one model per GPU — paying full price for hardware that sits idle between requests. VRAM changes that. By dynamically loading and evicting models based on real traffic, you can run 8–12 models on a single GPU that previously handled one. That's the same throughput at a fraction of the infrastructure cost — cut your GPU spend by up to 60%, serve more models without adding hardware, and plug it in with zero changes to your existing code.

Cut Your GPU Costs

LRU Model Eviction

Least-recently-used models are automatically evicted to RAM or NVMe when VRAM fills up. Reload is transparent to callers.

NVMe Spill Tier

Models too large for RAM are cached on fast NVMe SSDs. GDS (GPU Direct Storage) supported for maximum throughput.

Multi-GPU Support

Distribute models across multiple GPUs. Per-GPU VRAM budgets. Tensor-parallel inference for oversized models.

LoRA Adapter Switching

Hot-swap LoRA adapters on a shared base model without reloading weights. 100ms adapter switch latency.

Offline License Enforcement

GPU count and model count limits baked into the signed license key. Validated at startup and at every API call.

OpenAI-Compatible API

Drop-in replacement for /v1/completions, /v1/chat/completions, /v1/embeddings. No SDK changes required.

Real-Time GPU Metrics

VRAM usage, GPU utilization, temperature, power draw — polled every 2 seconds via NVML.

Waste Pattern Detection

Identifies memory leaks, oversized batches, idle models, and fragmented VRAM automatically.

Dollar-Cost Attribution

Maps each GPU waste pattern to an hourly cloud cost. See exactly how much each idle model costs.

30+ GPU Catalog

Covers A100, H100, RTX 4090, A40, T4, and all major cloud and on-prem GPU types.

Fleet View

Monitor multiple nodes from a single dashboard. Aggregate fleet-level waste and cost.

Streaming Updates

Server-Sent Events push updates to the browser — no polling, no page refresh.

X-Ray Dashboard

Know exactly what your GPU is doing

X-Ray is a real-time GPU waste scanner embedded directly into VRAM. It surfaces the exact models wasting money, the dollar cost per hour, and prescriptive fixes for each waste pattern.

Accessible at /xray inside the VRAM gateway. No separate deployment needed.

Contact Sales

Production-ready from day one

Docker + Helm. Deploys anywhere NVIDIA GPUs run.

Docker

docker pull vramai/ghostswap

docker run -e LICENSE_KEY=...

-e MODE=ghostswap

--gpus all

Kubernetes / Helm

helm install ghostswap

oci://registry-1.docker.io/

vramai/ghostswap

--set license.key=GSW1.xxx

Your App

openai.base_url =

"http://ghostswap:8080/v1"

# No other changes needed

Ready to get started?

Talk to our team to get a license key and onboarding support.

Contact Sales Read the Docs

Three products.One GPU stack.

Serve unlimited models on a single GPU

Know exactly what your GPU is doing

Production-ready from day one

Ready to get started?

Three products.
One GPU stack.