Best Models to Run on DGX Spark

Ranked by performance, quality, and VRAM efficiency on the GB10 Blackwell's 128 GB unified memory.

How to Read This List

Models are ranked by their overall utility on this specific hardware — not raw quality alone. A slightly worse model that runs 3x faster with 50% less VRAM often wins. The rating considers: inference speed, quality, VRAM usage, and practical usefulness.

VRAM sizes shown are for the smallest viable quantization that doesn't kill quality. All models fit within the 128 GB unified memory.

The Rankings

#1 Qwen3.6:35B-A3B

35B params · Q4_K_M ≈ 20 GB · ~45-57 tok/s · Context: 128K

Overall rating: ★★★★★

The sweet spot. Near-70B quality with tiny VRAM footprint. The active-speckle architecture means many parameters are computed conditionally, giving you better-than-expected performance from fewer weights. Best all-rounder for general-purpose use.

Install: ollama pull qwen3:35b

#2 Llama 3.3:70B (Instruct)

70B params · Q4_K_M = 42 GB · 28.4 tok/s · Context: 128K

Overall rating: ★★★★☆

Meta's latest large model. Best-in-class English reasoning and coding. Runs in Q4_K_M at 42 GB — leaves 86 GB for KV cache and system. Still fast enough for interactive use. Best when you need top-tier quality and can accept ~30 tok/s.

Install: ollama pull llama3.3:70b

#3 Qwen3:32B

32B params · Q4_K_M = 19 GB · 55.1 tok/s · Context: 128K

Overall rating: ★★★★☆

Slightly leaner than the 35B variant. Nearly identical quality with marginally faster inference. If you need maximum throughput with near-top-tier quality, this is your pick. Great for RAG, code completion, and general assistance.

Install: ollama pull qwen3:32b

#4 Mistral Large 3:24B

24B params · Q4_K_M = 15 GB · 61.3 tok/s · Context: 128K

Overall rating: ★★★★☆

French-native but excellent multilingual. Fastest model on the list with strong quality. The 24B parameter count means it barely breaks 15 GB in Q4_K_M — huge VRAM headroom. Ideal if you need speed + multilingual support.

Install: ollama pull mistral-large:24b

#5 Gemma 3:27B

27B params · Q4_K_M = 16 GB · 58.7 tok/s · Context: 128K

Overall rating: ★★★★☆

Google's efficient model family. Surprisingly good at instruction following and structured output. Lightweight and fast. Runs with massive VRAM headroom. Best for structured tasks (JSON generation, API calls, data extraction).

Install: ollama pull gemma3:27b

#6 Llama 3.1:8B (Instruct)

8B params · Q4_K_M = 5 GB · 120+ tok/s · Context: 128K

Overall rating: ★★★★☆

Lightning fast. Blistering throughput. Quality has improved dramatically — the 8B model is genuinely useful for many tasks. Best for: API serving, multi-instance deployments, and when speed matters more than raw intelligence.

Install: ollama pull llama3.1:8b

#7 Yi:1.5:34B

34B params · Q4_K_M = 21 GB · 52.8 tok/s · Context: 4K

Overall rating: ★★★☆☆

Strong reasoning model from 01.AI. Surprisingly capable for its age, though newer models have surpassed it on benchmarks. Still excellent for Chinese-English bilingual tasks. Short context window (4K) is the limitation.

Install: ollama pull yi:34b

#8 Phi-3.5:mini:3.8B

3.8B params · Q4_K_M = 2.5 GB · 180+ tok/s · Context: 128K

Overall rating: ★★★☆☆

Microsoft's tiny but capable model. Fits in 2.5 GB. You can run 50+ concurrent instances. Quality is limited but useful for routing, classification, and simple Q&A. Best used as a fast pre-processor in a pipeline.

Install: ollama pull phi3.5:3.8b

#9 DeepSeek-R1:8B (Distill)

8B params · Q4_K_M = 5 GB · 110 tok/s · Context: 128K

Overall rating: ★★★☆☆

Distilled reasoning model. Excels at math, logic puzzles, and step-by-step reasoning. Slightly better at chain-of-thought than standard Llama 3.1:8B. Fast enough for interactive use.

Install: ollama pull deepseek-r1:8b

#10 Qwen2.5:72B

72B params · Q4_K_M = 43 GB · 27.1 tok/s · Context: 128K

Overall rating: ★★★☆☆

One of the largest models that fits. Runs at 43 GB in Q4_K_M — leaves ~85 GB for KV cache and concurrent requests. Quality is excellent but throughput drops to ~27 tok/s. Use when you need maximum intelligence and can wait.

Install: ollama pull qwen2.5:72b

Models That DON'T Fit (and Why)

ModelFP16 SizeQ4_K_M SizeWhy It Fails
Falcon 3:104B208 GB~61 GBOnly fits with Q4_K_M, but VRAM for KV cache pushes past 128 GB at any useful context
Llama 3.1:405B810 GB~243 GBPhysically impossible on 128 GB unified memory
GPT-4o (OpenAI)N/AN/AProprietary, not self-hosted

Model Selection Decision Matrix

Use CaseBest ModelQuantizationWhy
General chatbotQwen3.6:35BQ4_K_MBest quality/speed balance
Max intelligenceLlama 3.3:70BQ4_K_MTop benchmarks, still fast enough
Multi-user APILlama 3.1:8BQ4_K_MMassive concurrency, low latency
MultilingualMistral Large 3:24BQ4_K_MExcellent across 20+ languages
Code generationQwen3.6:35BQ6_KBest coding performance in its class
Reasoning/mathDeepSeek-R1:8BQ4_K_MDesigned for chain-of-thought
Fast preprocessingPhi-3.5:mini:3.8BQ4_K_MRuns 50+ concurrent, sub-2 GB
Structured outputGemma 3:27BQ4_K_MBest JSON/API response quality
💡 Bottom line: For most DGX Spark owners, Qwen3.6:35B at Q4_K_M is your default model. It fits in 20 GB, runs at 54+ tok/s, and delivers quality competitive with much larger models. Only go to Llama 3.3:70B when you specifically need its edge-case quality improvement and can accept ~30 tok/s.