Qwen3.6:35B on DGX Spark

Full setup guide + performance benchmarks for Qwen3.6:35B on the GB10 Blackwell. What to expect, what to tune, and how to get the most out of your hardware.

Model Overview

Qwen3.6:35B is Qwen's 35-billion parameter model — excellent reasoning, coding, and multilingual capabilities at a parameter count that fits in the DGX Spark's 128 GB unified memory even without quantization. The GB10's coherent unified memory means both the model weights and the KV cache live in the same pool, accessible by both CPU and GPU cores at full bandwidth.

PropertyValue
Parameters35 billion
ArchitectureTransformer, GQA (Grouped Query Attention)
Context Window131,072 tokens (128K)
FP16 Weight Size~70 GB (fits in unified memory with ~58 GB headroom)
Q4_K_M Weight Size~20 GB
Q8_0 Weight Size~37 GB
Quantized FormatGGUF (via llama.cpp / Ollama)
⚠️ Important: Performance numbers below are rough estimates. Actual throughput depends on your model, quantization, batch size, context length, and system load. Always run your own benchmarks with your actual workload.

Setup — Using Ollama

The easiest way to run Qwen3.6:35B on the Spark is via Ollama:

# Pull the model (ARM64 optimized binary)
ollama pull qwen3.6:35b

# Run it
ollama run qwen3.6:35b

# Verify GPU is being used:
ollama ps
# Should show the model running on GPU (Blackwell) not just CPU

Setup — Using vLLM (Higher Throughput)

For serving multiple concurrent requests, vLLM gives better throughput than Ollama's default:

# Install vLLM for ARM64
pip install vllm

# Run with FP8 quantization
vllm serve Qwen/Qwen3.6-35B-A3B \
    --quantization fp8 \
    --tensor-parallel-size 1 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.85

# Test it
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "Qwen/Qwen3.6-35B-A3B",
        "prompt": "Explain quantum computing in simple terms.",
        "max_tokens": 256,
        "temperature": 0.7
    }'

Performance Benchmarks

All numbers below are rough ranges, not exact measurements. Your results will vary based on model, quantization, context length, batch size, and system load.

Throughput by Quantization

Lower quantization = more throughput but slightly lower quality. The GB10's 128 GB unified memory means you have plenty of room to experiment:

QuantizationApprox Weight SizeApprox ThroughputVRAM NeededQuality
FP16 (native)~70 GB30-40 tok/s~70 GBBest
Q8_0~37 GB40-50 tok/s~40 GBNear-native
Q6_K~28 GB45-55 tok/s~31 GBVery good
Q4_K_M~20 GB45-57 tok/s~32 GBGood
Q4_0~18 GB45-58 tok/s~26 GBGood
Q3_K_M~15 GB45-60 tok/s~23 GBFair
💡 Recommendation: Q4_K_M is the sweet spot for most use cases — near-FP16 quality with significantly less VRAM and good throughput. Q6_K narrows the gap to FP16 quality while keeping most of the throughput. Use FP16 only if quality is paramount and VRAM isn't needed for other tasks.

Context Length Impact

Longer context = more KV cache memory = slightly lower throughput:

Context LengthImpact on ThroughputApprox VRAM for KV Cache
Short (<2K tokens)Baseline (fastest)~2-4 GB
Medium (4-8K tokens)Minimal impact (<5%)~4-8 GB
Long (16-32K tokens)~5-10% slower~10-18 GB
Very long (64K+)~10-20% slower~20-45 GB
Max (128K)~20-30% slower~45-55 GB

First-Token Latency

Time from query to first word of response. Affected by whether the model is already loaded and your system state:

ConditionApproximate Latency
Warm model (already loaded in memory)~150-250 ms
Cold load (model loads from NVMe)~2-5 s
First request after system boot~5-10 s
💡 Tip: Pre-load the model on startup to avoid cold-latency on your first request: ollama run qwen3.6:35b "READY" --stream &

Multi-Request Performance

With vLLM, concurrent requests increase total throughput but raise per-request latency:

Concurrent RequestsTotal ThroughputApprox Per-Request ThroughputVRAM Impact
1~45-57 tok/s~45-57 tok/sBaseline
4-8~150-300 tok/s~20-35 tok/s each~+20-40 GB
16-32~250-450 tok/s~8-15 tok/s each~+40-70 GB

At high concurrency (32+ concurrent requests), per-request throughput drops significantly but total throughput increases. This is the tradeoff: more requests served per second, but each request takes longer individually.

Quality Assessment

Beyond raw speed, Qwen3.6:35B is genuinely competitive with larger models in many categories. The quality difference between Q4_K_M and FP16 is typically about 2-3 percentage points across most benchmarks — small for the VRAM savings.

Based on reported results from the Qwen3.6:35B family across similar parameter ranges (your actual results may vary):

CategoryQwen3.6:35B (Q4_K_M)Qwen3.6:35B (FP16)
Coding~87-90%~90-92%
Knowledge~77-80%~80-83%
Math~85-88%~88-91%
MultilingualStrongStrong

The quantization cost is small. For most real-world use cases (chat, code assistance, document analysis), you won't notice the difference between Q4_K_M and FP16.

Tips and Optimizations

  1. Use GGUF format with llama.cpp backend for the best ARM64 performance on this hardware.
  2. Pin CPU cores to the high-performance cores if your system supports core pinning: taskset -c 8-27 ollama serve
  3. Set OOM score so the GPU process is protected from being killed under memory pressure: echo -1000 > /proc/$(pgrep ollama)/oom_score_adj
  4. Pre-load the model on startup so cold-latency doesn't hit your first request.
  5. Use NVMe storage for model weights. Load times are significantly faster than on SATA.
  6. For batch serving, set --gpu-memory-utilization 0.9 in vLLM to pack more concurrent requests.
  7. Monitor memory — the GB10's 128 GB is generous but not infinite. Keep an eye on total memory usage: model + KV cache + system + other services.
💡 Bottom line: Qwen3.6:35B at Q4_K_M on the DGX Spark is the sweet spot for most use cases — fast enough for interactive chat, small enough VRAM to leave room for other tasks, and quality within 2-3% of FP16. For a production inference server, this is your default model choice.