Why vLLM?
vLLM (from UC Berkeley) is a high-throughput LLM inference engine that uses PagedAttention for efficient KV-cache management. On the DGX Spark's unified memory architecture, vLLM can outperform Ollama's default engine by 2-3x in throughput when serving multiple concurrent requests.
Installation
# vLLM 0.6+ supports ARM64 natively
pip install vllm
# Verify ARM64 build
python -c "import vllm; print(vllm.__version__)"
# Check CUDA support (unified memory)
python -c "from vllm import LLM; print('vLLM ready')"
# For GPU access on the GB10, vLLM auto-detects the Blackwell
# architecture. No CUDA_VISIBLE_DEVICES needed — the GPU IS unified memory.
Core Configuration
The key parameters that matter on the GB10:
from vllm import LLM, SamplingParams
llm = LLM(
model="Qwen/Qwen3-35B-A3B",
quantization="fp8", # or "gguf" for GGUF models
tensor_parallel_size=1, # GB10 is a single chip
gpu_memory_utilization=0.85, # 85% of 128GB = ~108 GB for model + KV
max_model_len=8192, # Balance context length vs concurrent requests
dtype="auto", # Auto-detect FP8/FP16
max_num_batched_tokens=4096, # Tokens per batch (tune for your workload)
max_num_seqs=32, # Concurrent sequences
kv_cache_dtype="fp8", # FP8 KV cache saves VRAM
enable_prefix_caching=True, # Cache common prompt prefixes
enforce_eager=False, # Let PTX compile for speed
worker_use_ray=False, # Single-node, no Ray needed
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
repetition_penalty=1.05,
)
Key Parameters Explained
| Parameter | Default | Recommended for GB10 | Effect |
|---|---|---|---|
| gpu_memory_utilization | 0.9 | 0.80-0.85 | Leaves 15-20 GB for OS + other services |
| kv_cache_dtype | auto | fp8 | Cuts KV cache VRAM by 50% for long contexts |
| max_model_len | 8192 | 4096-8192 | Higher = more VRAM per request, fewer concurrent |
| max_num_batched_tokens | 4096 | 2048-8192 | Tune based on avg request length |
| max_num_seqs | 256 | 8-32 | Concurrent users. GB10 sweet spot: 8-16 |
| enable_prefix_caching | False | True | Huge win for chatbots with repeated prompts |
| enforce_eager | False | False | Keep False for speed. Set True only if debugging |
ARM64-Specific Gotchas
1. FP8 Support on Blackwell
The GB10's Blackwell GPU natively supports FP8 (E4M3 and E5M2 formats). vLLM's FP8 quantization works out of the box:
# FP8 inference — check it's actually using FP8 for weights AND KV cache
python -c "
from vllm import LLM
llm = LLM(model='Qwen/Qwen3-35B-A3B', quantization='fp8')
print(llm.llm_engine.model_config.quantization)
# Output: fp8
"
# Memory savings: FP8 uses ~half the weights of FP16
# FP16 model: ~70 GB | FP8 model: ~35 GB
2. Tensor Parallelism is 1
Unlike multi-GPU setups, the GB10 has a single unified GPU chip. Do not set tensor_parallel_size > 1. The GB10's Blackwell GPU is a single die — tensor parallelism across multiple GPUs doesn't apply here.
3. Memory Split (CPU vs GPU)
With unified memory, vLLM dynamically allocates between CPU cores and GPU. The default split favors GPU (~85% of memory to the GPU side). Adjust if you need more CPU for preprocessing:
# Force more memory to CPU side for heavy preprocessing
export VLLM_CPU_KV_CACHE_MEM_EFFICIENCY=0.9
# This tells vLLM: 90% of available memory is for CPU-side ops
# Or force more to GPU for more concurrent KV caches
export VLLM_GPU_MEMORY=0.92
4. ARM64 PyTorch Compilation
vLLM uses PTX JIT compilation. On ARM64, this can take 10-30 seconds on first inference:
# Pre-compile to avoid first-request lag
python -c "
from vllm import LLM
llm = LLM(model='Qwen/Qwen3-35B-A3B')
# Run a dummy request to trigger compilation
llm.generate('test', sampling_params=SamplingParams(max_tokens=1))
print('Compilation complete. Subsequent requests will be fast.')
"
5. NUMA Awareness
The GB10 has a single NUMA node (all memory is coherent), but CPU cores are partitioned into performance and efficiency clusters:
# Pin vLLM workers to performance cores (8-27)
numactl --cpunodebind=0 --membind=0 python serve.py
# Or explicitly:
taskset -c 8-27 python serve.py
Performance Tuning
Concurrency vs Latency Tradeoff
| max_num_seqs | Approx Throughput (tok/s) | Avg Latency (ms/token) | VRAM Used |
|---|---|---|---|
| 1 | ~50-60 | ~15-25 | ~32 GB |
| 4 | ~150-200 | ~20-30 | ~55 GB |
| 8 | ~250-320 | ~25-35 | ~72 GB |
| 16 | ~350-420 | ~35-50 | ~91 GB |
| 32 | ~400-500 | ~55-80 | ~110 GB |
Numbers vary by model, batch size, and input length. These are rough estimates — run your own benchmarks with your actual workload for accurate figures.
Context Length vs VRAM
| max_model_len | VRAM per request | Concurrent (max_num_seqs=8) |
|---|---|---|
| 1024 | 4.2 GB | ~34 GB total |
| 2048 | 6.8 GB | ~55 GB total |
| 4096 | 11.5 GB | ~92 GB total |
| 8192 | 19.2 GB | ~154 GB total (OOM!) |
Using FP8 KV Cache
The biggest VRAM hog is the KV cache for long contexts. Switching to FP8 cuts it in half:
llm = LLM(
model="Qwen/Qwen3-35B-A3B",
quantization="fp8",
kv_cache_dtype="fp8", # <-- The key setting
max_model_len=8192,
max_num_seqs=8,
gpu_memory_utilization=0.85,
)
# With FP8 KV: max_model_len=8192 works at 8 concurrent requests
# Without FP8 KV: same config → OOM
Benchmark: vLLM vs Ollama
| Metric | vLLM (FP8) | Ollama (Q4_K_M) | Winner |
|---|---|---|---|
| Single request throughput | Varies by model | Varies by model | Benchmark yours |
| 8 concurrent throughput | High (PagedAttention scaling) | No built-in concurrency | vLLM |
| VRAM efficiency | FP8 weights + FP8 KV cache | Q4_K_M (GGUF) | Depends on model |
| First-token latency | Varies by model and context | Varies by model and context | Benchmark yours |
| Multi-model serving | Yes (switching in ~100ms) | No (one model at a time) | vLLM |
| HTTP API | Built-in OpenAI-compatible | Built-in OpenAI-compatible | Draw |
Production Deployment
# systemd service for vLLM
cat > /etc/systemd/system/vllm.service <<'EOF'
[Unit]
Description=vLLM LLM Inference Server
After=network.target
[Service]
Type=simple
User=spark
WorkingDirectory=/opt/vllm
Environment=PATH=/home/spark/miniforge3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
ExecStart=/home/spark/miniforge3/bin/python /opt/vllm/server.py
Restart=always
RestartSec=5
OOMScoreAdj=-1000
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable vllm
sudo systemctl start vllm
# Verify
curl http://localhost:8000/v1/models
# Returns: { "object": "list", "data": [ { "id": "Qwen/Qwen3-35B-A3B", ... } ] }