Overview
The GB10 Grace Blackwell superchip is powerful but relatively new. The software stack (Linux kernel, ARM64 CUDA compatibility, NVIDIA drivers, container runtime) is still maturing. Here's what you need to know to keep your Spark running smoothly over months and years of use.
Software Updates
Kernel Updates — Proceed with Caution
The Spark runs on a custom Linux kernel optimized for the ARM64 + unified memory architecture. Kernel updates can and will occasionally break GPU passthrough or thermal management:
# Check current kernel version
uname -r
# Example output: 6.11.0-nvidia-grace-2024.11
# Before updating, check NVIDIA's compatibility notes
# at docs.nvidia.com — search for your kernel version
# Pin a known-good kernel to avoid surprises
sudo apt-mark hold linux-image-generic linux-headers-generic
# When you DO update:
sudo apt upgrade linux-image-generic linux-headers-generic
# Reboot and verify GPU is recognized
sudo reboot
nvidia-smi
cpupower frequency-set -g performance.
NVIDIA Driver Updates
# Check current driver version
nvidia-smi | head -5
# Update to latest
sudo apt install nvidia-driver-550
sudo apt install nvidia-container-toolkit
# Verify
nvidia-smi
# Should show:
# Driver Version: 550.x.x | CUDA Version: 12.x
Driver cadence: NVIDIA ships driver updates roughly monthly for the Grace/Blackwell platforms. Always test in a Docker container before applying to your host system.
Ollama and Model Updates
Ollama on ARM64 is updated frequently. Check for updates monthly:
ollama --version
# Update:
curl -fsSL https://ollama.com/install.sh | sh
Thermal Monitoring
The GB10 superchip has multiple thermal zones. Monitor them all:
# Install NVIDIA monitoring tools
sudo apt install nvidia-diag-hook libnvidia-ml1
# Check all thermal zones
nvidia-smi -q | grep -A5 "GPU Thermal"
# Or use nvdur to get real-time readings
watch -n 2 'nvidia-smi --query-gpu=temperature.gpu,power.draw,clocks.gr,clocks.mem,utilization.gpu --format=csv,noheader'
# Check CPU temps
sensors | grep -E "temp|core"
# Check NVMe storage health
smartctl -a /dev/nvme0n1 2>/dev/null | grep -i temperature
# Check GPU power draw
nvidia-smi --query-gpu=power.draw --format=csv -i 0
Temperature thresholds:
| Component | Normal | Warning | Throttle | Critical |
|---|---|---|---|---|
| GB10 GPU Junction | 40-65°C | 75°C | 85°C | 95°C |
| GB10 CPU Cores | 35-60°C | 70°C | 80°C | 90°C |
| Memory (LPDDR5x) | 30-55°C | 65°C | 75°C | 85°C |
| Board (PCB) Avg | 35-55°C | 65°C | 75°C | 85°C |
Known Pain Points
1. ARM64 Compatibility Gaps
Despite being the primary architecture the Spark runs on, ARM64 software support has gaps:
- PyTorch: Full ARM64 CUDA support arrived in 2.1+. Older versions silently fail. Always verify with
torch.cuda.is_available(). - Docker images: Many GPU images only publish AMD64. Use the
--platform linux/arm64flag or search for explicit ARM64 builds. - pip packages: Some popular packages (especially those with C extensions) don't publish ARM64 wheels. You'll need to compile from source.
- VS Code remote: Works but extensions may be missing ARM64 builds. Stick to the core extensions.
2. Unified Memory OOM Crashes
With 128 GB shared between CPU and GPU, it's easy to exhaust memory without realizing it:
# Monitor shared memory usage
free -h
# Look at "total" and "available" — not just "free"
# Check GPU-side allocation
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Set memory pressure monitoring
cat > /usr/local/bin/memory-monitor.sh <<'EOF'
#!/bin/bash
AVAIL=$(awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo)
if [ "$AVAIL" -lt 4096 ]; then
echo "⚠️ Low memory: ${AVAIL}MB available on $(hostname)" | nc -w 1 discord-webhook-url
fi
EOF
chmod +x /usr/local/bin/memory-monitor.sh
# Add to crontab - runs every 5 minutes
*/5 * * * * /usr/local/bin/memory-monitor.sh
3. Network Adapter Maintenance
The DGX Spark ships with Mellanox 200GbE networking. Keep firmware updated and check link status periodically:
# Check network adapter status
nvidia-smi
# Check link status
ethtool eth0
# Check for driver updates
# See NVIDIA's official docs for the latest driver releases
# https://docs.nvidia.com/
4. LPDDR5x Memory Scrubbing
The GB10's LPDDR5x memory uses hardware error correction but occasionally requires scrub cycles. The system does this automatically, but during intensive workloads you may see brief stuttering as the memory controller cleans itself:
- Normal: occasional 10-50ms pauses in inference (unnoticeable during generation)
- Concerning: persistent stuttering during idle → check memory health with
nmemctl --health
Routine Maintenance Schedule
| Frequency | Task | Command / Method |
|---|---|---|
| Weekly | Check temps and power | nvidia-smi + PDU monitoring |
| Weekly | Check disk space | df -h — model files grow fast |
| Monthly | Update Ollama + drivers | ollama update + nvidia-smi version check |
| Monthly | Clear Ollama cache | ollama list --json | jq — prune unused models |
| Quarterly | Kernel update review | Check NVIDIA docs for kernel compatibility |
| Quarterly | Check for dust buildup around vents | Compressed air around chassis vents |
| Annually | Verify thermal performance | Check temps haven't drifted upward |
Backup Strategy
The Spark's 128GB of RAM is volatile. Your important data lives on NVMe storage. Back up:
- System configuration:
tar czf spark-config-backup.tar.gz /etc /var/lib/docker /home/spark - Model weights: Mirror your model library to an external NVMe or S3 bucket. Models are heavy; 10-20 GB each adds up.
- Database dumps: If running RAG, vector stores, or any local database — dump daily.
- nvram / firmware settings: Use
nvramctl savefor BIOS/firmware state.
Emergency Procedures
Spark Won't Boot
# If the Spark won't boot, check the power button and PSU connection.
# The Spark has no remote management (IPMI/redfish) — the power button is your only control.
# If the system is frozen: hold the power button for 10 seconds, wait 30 seconds, then power on again
GPU Not Recognized
# Check if nvidia-smi works at all
nvidia-smi
# If it fails, check the NVIDIA driver
sudo apt show nvidia-driver-550 # or your installed version
# If the driver is not responding, reboot:
sudo reboot
Summary Checklist
- ☐ Temperatures monitored daily (automated via cron)
- ☐ NVIDIA driver within 2 versions of latest stable
- ☐ Kernel pinned to a known-good version (unpin only when you verify compatibility)
- ☐ Disk space > 20% free (model downloads eat space)
- ☐ Memory available > 16 GB (OOM kills inference services)
- ☐ Backup of configs and model library up to date