Model Overview
Qwen3.6:35B is Qwen's 35-billion parameter model — excellent reasoning, coding, and multilingual capabilities at a parameter count that fits in the DGX Spark's 128 GB unified memory even without quantization. The GB10's coherent unified memory means both the model weights and the KV cache live in the same pool, accessible by both CPU and GPU cores at full bandwidth.
| Property | Value |
|---|---|
| Parameters | 35 billion |
| Architecture | Transformer, GQA (Grouped Query Attention) |
| Context Window | 131,072 tokens (128K) |
| FP16 Weight Size | ~70 GB (fits in unified memory with ~58 GB headroom) |
| Q4_K_M Weight Size | ~20 GB |
| Q8_0 Weight Size | ~37 GB |
| Quantized Format | GGUF (via llama.cpp / Ollama) |
Setup — Using Ollama
The easiest way to run Qwen3.6:35B on the Spark is via Ollama:
# Pull the model (ARM64 optimized binary)
ollama pull qwen3.6:35b
# Run it
ollama run qwen3.6:35b
# Verify GPU is being used:
ollama ps
# Should show the model running on GPU (Blackwell) not just CPU
Setup — Using vLLM (Higher Throughput)
For serving multiple concurrent requests, vLLM gives better throughput than Ollama's default:
# Install vLLM for ARM64
pip install vllm
# Run with FP8 quantization
vllm serve Qwen/Qwen3.6-35B-A3B \
--quantization fp8 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.85
# Test it
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.6-35B-A3B",
"prompt": "Explain quantum computing in simple terms.",
"max_tokens": 256,
"temperature": 0.7
}'
Performance Benchmarks
All numbers below are rough ranges, not exact measurements. Your results will vary based on model, quantization, context length, batch size, and system load.
Throughput by Quantization
Lower quantization = more throughput but slightly lower quality. The GB10's 128 GB unified memory means you have plenty of room to experiment:
| Quantization | Approx Weight Size | Approx Throughput | VRAM Needed | Quality |
|---|---|---|---|---|
| FP16 (native) | ~70 GB | 30-40 tok/s | ~70 GB | Best |
| Q8_0 | ~37 GB | 40-50 tok/s | ~40 GB | Near-native |
| Q6_K | ~28 GB | 45-55 tok/s | ~31 GB | Very good |
| Q4_K_M | ~20 GB | 45-57 tok/s | ~32 GB | Good |
| Q4_0 | ~18 GB | 45-58 tok/s | ~26 GB | Good |
| Q3_K_M | ~15 GB | 45-60 tok/s | ~23 GB | Fair |
Context Length Impact
Longer context = more KV cache memory = slightly lower throughput:
| Context Length | Impact on Throughput | Approx VRAM for KV Cache |
|---|---|---|
| Short (<2K tokens) | Baseline (fastest) | ~2-4 GB |
| Medium (4-8K tokens) | Minimal impact (<5%) | ~4-8 GB |
| Long (16-32K tokens) | ~5-10% slower | ~10-18 GB |
| Very long (64K+) | ~10-20% slower | ~20-45 GB |
| Max (128K) | ~20-30% slower | ~45-55 GB |
First-Token Latency
Time from query to first word of response. Affected by whether the model is already loaded and your system state:
| Condition | Approximate Latency |
|---|---|
| Warm model (already loaded in memory) | ~150-250 ms |
| Cold load (model loads from NVMe) | ~2-5 s |
| First request after system boot | ~5-10 s |
ollama run qwen3.6:35b "READY" --stream &
Multi-Request Performance
With vLLM, concurrent requests increase total throughput but raise per-request latency:
| Concurrent Requests | Total Throughput | Approx Per-Request Throughput | VRAM Impact |
|---|---|---|---|
| 1 | ~45-57 tok/s | ~45-57 tok/s | Baseline |
| 4-8 | ~150-300 tok/s | ~20-35 tok/s each | ~+20-40 GB |
| 16-32 | ~250-450 tok/s | ~8-15 tok/s each | ~+40-70 GB |
At high concurrency (32+ concurrent requests), per-request throughput drops significantly but total throughput increases. This is the tradeoff: more requests served per second, but each request takes longer individually.
Quality Assessment
Beyond raw speed, Qwen3.6:35B is genuinely competitive with larger models in many categories. The quality difference between Q4_K_M and FP16 is typically about 2-3 percentage points across most benchmarks — small for the VRAM savings.
Based on reported results from the Qwen3.6:35B family across similar parameter ranges (your actual results may vary):
| Category | Qwen3.6:35B (Q4_K_M) | Qwen3.6:35B (FP16) |
|---|---|---|
| Coding | ~87-90% | ~90-92% |
| Knowledge | ~77-80% | ~80-83% |
| Math | ~85-88% | ~88-91% |
| Multilingual | Strong | Strong |
The quantization cost is small. For most real-world use cases (chat, code assistance, document analysis), you won't notice the difference between Q4_K_M and FP16.
Tips and Optimizations
- Use GGUF format with llama.cpp backend for the best ARM64 performance on this hardware.
- Pin CPU cores to the high-performance cores if your system supports core pinning:
taskset -c 8-27 ollama serve - Set OOM score so the GPU process is protected from being killed under memory pressure:
echo -1000 > /proc/$(pgrep ollama)/oom_score_adj - Pre-load the model on startup so cold-latency doesn't hit your first request.
- Use NVMe storage for model weights. Load times are significantly faster than on SATA.
- For batch serving, set
--gpu-memory-utilization 0.9in vLLM to pack more concurrent requests. - Monitor memory — the GB10's 128 GB is generous but not infinite. Keep an eye on total memory usage: model + KV cache + system + other services.