vLLM on DGX Spark: Performance, Limits, Tips

Why vLLM?

vLLM (from UC Berkeley) is a high-throughput LLM inference engine that uses PagedAttention for efficient KV-cache management. On the DGX Spark's unified memory architecture, vLLM can outperform Ollama's default engine by 2-3x in throughput when serving multiple concurrent requests.

Installation

# vLLM 0.6+ supports ARM64 natively
pip install vllm

# Verify ARM64 build
python -c "import vllm; print(vllm.__version__)"

# Check CUDA support (unified memory)
python -c "from vllm import LLM; print('vLLM ready')"

# For GPU access on the GB10, vLLM auto-detects the Blackwell
# architecture. No CUDA_VISIBLE_DEVICES needed — the GPU IS unified memory.

Core Configuration

The key parameters that matter on the GB10:

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen3-35B-A3B",
    quantization="fp8",           # or "gguf" for GGUF models
    tensor_parallel_size=1,        # GB10 is a single chip
    gpu_memory_utilization=0.85,   # 85% of 128GB = ~108 GB for model + KV
    max_model_len=8192,            # Balance context length vs concurrent requests
    dtype="auto",                  # Auto-detect FP8/FP16
    max_num_batched_tokens=4096,   # Tokens per batch (tune for your workload)
    max_num_seqs=32,               # Concurrent sequences
    kv_cache_dtype="fp8",          # FP8 KV cache saves VRAM
    enable_prefix_caching=True,    # Cache common prompt prefixes
    enforce_eager=False,           # Let PTX compile for speed
    worker_use_ray=False,          # Single-node, no Ray needed
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
    repetition_penalty=1.05,
)

Key Parameters Explained

Parameter	Default	Recommended for GB10	Effect
gpu_memory_utilization	0.9	0.80-0.85	Leaves 15-20 GB for OS + other services
kv_cache_dtype	auto	fp8	Cuts KV cache VRAM by 50% for long contexts
max_model_len	8192	4096-8192	Higher = more VRAM per request, fewer concurrent
max_num_batched_tokens	4096	2048-8192	Tune based on avg request length
max_num_seqs	256	8-32	Concurrent users. GB10 sweet spot: 8-16
enable_prefix_caching	False	True	Huge win for chatbots with repeated prompts
enforce_eager	False	False	Keep False for speed. Set True only if debugging

ARM64-Specific Gotchas

1. FP8 Support on Blackwell

The GB10's Blackwell GPU natively supports FP8 (E4M3 and E5M2 formats). vLLM's FP8 quantization works out of the box:

# FP8 inference — check it's actually using FP8 for weights AND KV cache
python -c "
from vllm import LLM
llm = LLM(model='Qwen/Qwen3-35B-A3B', quantization='fp8')
print(llm.llm_engine.model_config.quantization)
# Output: fp8
"

# Memory savings: FP8 uses ~half the weights of FP16
# FP16 model: ~70 GB | FP8 model: ~35 GB

2. Tensor Parallelism is 1

Unlike multi-GPU setups, the GB10 has a single unified GPU chip. Do not set tensor_parallel_size > 1. The GB10's Blackwell GPU is a single die — tensor parallelism across multiple GPUs doesn't apply here.

3. Memory Split (CPU vs GPU)

With unified memory, vLLM dynamically allocates between CPU cores and GPU. The default split favors GPU (~85% of memory to the GPU side). Adjust if you need more CPU for preprocessing:

# Force more memory to CPU side for heavy preprocessing
export VLLM_CPU_KV_CACHE_MEM_EFFICIENCY=0.9
# This tells vLLM: 90% of available memory is for CPU-side ops

# Or force more to GPU for more concurrent KV caches
export VLLM_GPU_MEMORY=0.92

4. ARM64 PyTorch Compilation

vLLM uses PTX JIT compilation. On ARM64, this can take 10-30 seconds on first inference:

# Pre-compile to avoid first-request lag
python -c "
from vllm import LLM
llm = LLM(model='Qwen/Qwen3-35B-A3B')
# Run a dummy request to trigger compilation
llm.generate('test', sampling_params=SamplingParams(max_tokens=1))
print('Compilation complete. Subsequent requests will be fast.')
"

5. NUMA Awareness

The GB10 has a single NUMA node (all memory is coherent), but CPU cores are partitioned into performance and efficiency clusters:

# Pin vLLM workers to performance cores (8-27)
numactl --cpunodebind=0 --membind=0 python serve.py

# Or explicitly:
taskset -c 8-27 python serve.py

Performance Tuning

Concurrency vs Latency Tradeoff

max_num_seqs	Approx Throughput (tok/s)	Avg Latency (ms/token)	VRAM Used
1	~50-60	~15-25	~32 GB
4	~150-200	~20-30	~55 GB
8	~250-320	~25-35	~72 GB
16	~350-420	~35-50	~91 GB
32	~400-500	~55-80	~110 GB

Numbers vary by model, batch size, and input length. These are rough estimates — run your own benchmarks with your actual workload for accurate figures.

Context Length vs VRAM

max_model_len	VRAM per request	Concurrent (max_num_seqs=8)
1024	4.2 GB	~34 GB total
2048	6.8 GB	~55 GB total
4096	11.5 GB	~92 GB total
8192	19.2 GB	~154 GB total (OOM!)

⚠️ Critical: With max_model_len=8192 and 8 concurrent requests, you'll exceed 128 GB. Either reduce max_model_len, reduce max_num_seqs, or use KV cache FP8 to halve KV cache size.

Using FP8 KV Cache

The biggest VRAM hog is the KV cache for long contexts. Switching to FP8 cuts it in half:

llm = LLM(
    model="Qwen/Qwen3-35B-A3B",
    quantization="fp8",
    kv_cache_dtype="fp8",     # <-- The key setting
    max_model_len=8192,
    max_num_seqs=8,
    gpu_memory_utilization=0.85,
)
# With FP8 KV: max_model_len=8192 works at 8 concurrent requests
# Without FP8 KV: same config → OOM

Benchmark: vLLM vs Ollama

Metric	vLLM (FP8)	Ollama (Q4_K_M)	Winner
Single request throughput	Varies by model	Varies by model	Benchmark yours
8 concurrent throughput	High (PagedAttention scaling)	No built-in concurrency	vLLM
VRAM efficiency	FP8 weights + FP8 KV cache	Q4_K_M (GGUF)	Depends on model
First-token latency	Varies by model and context	Varies by model and context	Benchmark yours
Multi-model serving	Yes (switching in ~100ms)	No (one model at a time)	vLLM
HTTP API	Built-in OpenAI-compatible	Built-in OpenAI-compatible	Draw

Production Deployment

# systemd service for vLLM
cat > /etc/systemd/system/vllm.service <<'EOF'
[Unit]
Description=vLLM LLM Inference Server
After=network.target

[Service]
Type=simple
User=spark
WorkingDirectory=/opt/vllm
Environment=PATH=/home/spark/miniforge3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
ExecStart=/home/spark/miniforge3/bin/python /opt/vllm/server.py
Restart=always
RestartSec=5
OOMScoreAdj=-1000

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm

# Verify
curl http://localhost:8000/v1/models
# Returns: { "object": "list", "data": [ { "id": "Qwen/Qwen3-35B-A3B", ... } ] }

💡 Bottom line: For single-request use, Ollama is simpler and slightly faster. For multi-request serving (the real value of the DGX Spark), vLLM is the clear winner. The key settings are: FP8 quantization, KV cache FP8, 85% GPU memory utilization, and prefix caching. The GB10's unified memory gives vLLM a huge advantage over multi-GPU setups where KV cache has to cross PCIe buses.