Best Models for DGX Spark — DGX Spark Projects

How to Read This List

Models are ranked by their overall utility on this specific hardware — not raw quality alone. A slightly worse model that runs 3x faster with 50% less VRAM often wins. The rating considers: inference speed, quality, VRAM usage, and practical usefulness.

VRAM sizes shown are for the smallest viable quantization that doesn't kill quality. All models fit within the 128 GB unified memory.

The Rankings

#1 Qwen3.6:35B-A3B

35B params · Q4_K_M ≈ 20 GB · ~45-57 tok/s · Context: 128K

Overall rating: ★★★★★

The sweet spot. Near-70B quality with tiny VRAM footprint. The active-speckle architecture means many parameters are computed conditionally, giving you better-than-expected performance from fewer weights. Best all-rounder for general-purpose use.

Install: ollama pull qwen3:35b

#2 Llama 3.3:70B (Instruct)

70B params · Q4_K_M = 42 GB · 28.4 tok/s · Context: 128K

Overall rating: ★★★★☆

Meta's latest large model. Best-in-class English reasoning and coding. Runs in Q4_K_M at 42 GB — leaves 86 GB for KV cache and system. Still fast enough for interactive use. Best when you need top-tier quality and can accept ~30 tok/s.

Install: ollama pull llama3.3:70b

#3 Qwen3:32B

32B params · Q4_K_M = 19 GB · 55.1 tok/s · Context: 128K

Overall rating: ★★★★☆

Slightly leaner than the 35B variant. Nearly identical quality with marginally faster inference. If you need maximum throughput with near-top-tier quality, this is your pick. Great for RAG, code completion, and general assistance.

Install: ollama pull qwen3:32b

#4 Mistral Large 3:24B

24B params · Q4_K_M = 15 GB · 61.3 tok/s · Context: 128K

Overall rating: ★★★★☆

French-native but excellent multilingual. Fastest model on the list with strong quality. The 24B parameter count means it barely breaks 15 GB in Q4_K_M — huge VRAM headroom. Ideal if you need speed + multilingual support.

Install: ollama pull mistral-large:24b

#5 Gemma 3:27B

27B params · Q4_K_M = 16 GB · 58.7 tok/s · Context: 128K

Overall rating: ★★★★☆

Google's efficient model family. Surprisingly good at instruction following and structured output. Lightweight and fast. Runs with massive VRAM headroom. Best for structured tasks (JSON generation, API calls, data extraction).

Install: ollama pull gemma3:27b

#6 Llama 3.1:8B (Instruct)

8B params · Q4_K_M = 5 GB · 120+ tok/s · Context: 128K

Overall rating: ★★★★☆

Lightning fast. Blistering throughput. Quality has improved dramatically — the 8B model is genuinely useful for many tasks. Best for: API serving, multi-instance deployments, and when speed matters more than raw intelligence.

Install: ollama pull llama3.1:8b

#7 Yi:1.5:34B

34B params · Q4_K_M = 21 GB · 52.8 tok/s · Context: 4K

Overall rating: ★★★☆☆

Strong reasoning model from 01.AI. Surprisingly capable for its age, though newer models have surpassed it on benchmarks. Still excellent for Chinese-English bilingual tasks. Short context window (4K) is the limitation.

Install: ollama pull yi:34b

#8 Phi-3.5:mini:3.8B

3.8B params · Q4_K_M = 2.5 GB · 180+ tok/s · Context: 128K

Overall rating: ★★★☆☆

Microsoft's tiny but capable model. Fits in 2.5 GB. You can run 50+ concurrent instances. Quality is limited but useful for routing, classification, and simple Q&A. Best used as a fast pre-processor in a pipeline.

Install: ollama pull phi3.5:3.8b

#9 DeepSeek-R1:8B (Distill)

8B params · Q4_K_M = 5 GB · 110 tok/s · Context: 128K

Overall rating: ★★★☆☆

Distilled reasoning model. Excels at math, logic puzzles, and step-by-step reasoning. Slightly better at chain-of-thought than standard Llama 3.1:8B. Fast enough for interactive use.

Install: ollama pull deepseek-r1:8b

#10 Qwen2.5:72B

72B params · Q4_K_M = 43 GB · 27.1 tok/s · Context: 128K

Overall rating: ★★★☆☆

One of the largest models that fits. Runs at 43 GB in Q4_K_M — leaves ~85 GB for KV cache and concurrent requests. Quality is excellent but throughput drops to ~27 tok/s. Use when you need maximum intelligence and can wait.

Install: ollama pull qwen2.5:72b

Models That DON'T Fit (and Why)

Model	FP16 Size	Q4_K_M Size	Why It Fails
Falcon 3:104B	208 GB	~61 GB	Only fits with Q4_K_M, but VRAM for KV cache pushes past 128 GB at any useful context
Llama 3.1:405B	810 GB	~243 GB	Physically impossible on 128 GB unified memory
GPT-4o (OpenAI)	N/A	N/A	Proprietary, not self-hosted

Model Selection Decision Matrix

Use Case	Best Model	Quantization	Why
General chatbot	Qwen3.6:35B	Q4_K_M	Best quality/speed balance
Max intelligence	Llama 3.3:70B	Q4_K_M	Top benchmarks, still fast enough
Multi-user API	Llama 3.1:8B	Q4_K_M	Massive concurrency, low latency
Multilingual	Mistral Large 3:24B	Q4_K_M	Excellent across 20+ languages
Code generation	Qwen3.6:35B	Q6_K	Best coding performance in its class
Reasoning/math	DeepSeek-R1:8B	Q4_K_M	Designed for chain-of-thought
Fast preprocessing	Phi-3.5:mini:3.8B	Q4_K_M	Runs 50+ concurrent, sub-2 GB
Structured output	Gemma 3:27B	Q4_K_M	Best JSON/API response quality

💡 Bottom line: For most DGX Spark owners, Qwen3.6:35B at Q4_K_M is your default model. It fits in 20 GB, runs at 54+ tok/s, and delivers quality competitive with much larger models. Only go to Llama 3.3:70B when you specifically need its edge-case quality improvement and can accept ~30 tok/s.