VRAM AI Suite

Three products.
One GPU stack.

VRAM Gateway handles model serving. X-Ray handles diagnostics. Model Compression shrinks your models so they need less GPU to run. Together they cut your GPU bill by up to 80%.

VRAM AI Gateway

Serve unlimited models on a single GPU

VRAM sits in front of your inference stack as an OpenAI-compatible proxy. It maintains a pool of models, swapping them in and out of VRAM based on actual request traffic — not manual management.

You define a VRAM budget. VRAM handles everything else: LRU eviction, NVMe spill, prefetching, LoRA switching, and GPU Direct Storage. Most AI teams run one model per GPU — paying full price for hardware that sits idle between requests. VRAM changes that. By dynamically loading and evicting models based on real traffic, you can run 8–12 models on a single GPU that previously handled one. That's the same throughput at a fraction of the infrastructure cost — cut your GPU spend by up to 60%, serve more models without adding hardware, and plug it in with zero changes to your existing code.

Cut Your GPU Costs
LRU Model Eviction
Least-recently-used models are automatically evicted to RAM or NVMe when VRAM fills up. Reload is transparent to callers.
NVMe Spill Tier
Models too large for RAM are cached on fast NVMe SSDs. GDS (GPU Direct Storage) supported for maximum throughput.
Multi-GPU Support
Distribute models across multiple GPUs. Per-GPU VRAM budgets. Tensor-parallel inference for oversized models.
LoRA Adapter Switching
Hot-swap LoRA adapters on a shared base model without reloading weights. 100ms adapter switch latency.
Offline License Enforcement
GPU count and model count limits baked into the signed license key. Validated at startup and at every API call.
OpenAI-Compatible API
Drop-in replacement for /v1/completions, /v1/chat/completions, /v1/embeddings. No SDK changes required.
Real-Time GPU Metrics
VRAM usage, GPU utilization, temperature, power draw — polled every 2 seconds via NVML.
Waste Pattern Detection
Identifies memory leaks, oversized batches, idle models, and fragmented VRAM automatically.
Dollar-Cost Attribution
Maps each GPU waste pattern to an hourly cloud cost. See exactly how much each idle model costs.
30+ GPU Catalog
Covers A100, H100, RTX 4090, A40, T4, and all major cloud and on-prem GPU types.
Fleet View
Monitor multiple nodes from a single dashboard. Aggregate fleet-level waste and cost.
Streaming Updates
Server-Sent Events push updates to the browser — no polling, no page refresh.
X-Ray Dashboard

Know exactly what your GPU is doing

X-Ray is a real-time GPU waste scanner embedded directly into VRAM. It surfaces the exact models wasting money, the dollar cost per hour, and prescriptive fixes for each waste pattern.

Accessible at /xray inside the VRAM gateway. No separate deployment needed.

Contact Sales

Production-ready from day one

Docker + Helm. Deploys anywhere NVIDIA GPUs run.

Docker
docker pull vramai/ghostswap
docker run -e LICENSE_KEY=...
-e MODE=ghostswap
--gpus all
Kubernetes / Helm
helm install ghostswap
oci://registry-1.docker.io/
vramai/ghostswap
--set license.key=GSW1.xxx
Your App
openai.base_url =
"http://ghostswap:8080/v1"
 
# No other changes needed

Ready to get started?

Talk to our team to get a license key and onboarding support.