Overview
The GB10 Grace Blackwell superchip is powerful but relatively new. The software stack (Linux kernel, ARM64 CUDA compatibility, NVIDIA drivers, container runtime) is still maturing. Here's what you need to know to keep your Spark running smoothly over months and years of use.
Software Updates
Kernel Updates — Proceed with Caution
The Spark runs on a custom Linux kernel optimized for the ARM64 + unified memory architecture. Kernel updates can and will occasionally break GPU passthrough or thermal management:
# Check current kernel version
uname -r
# Example output: 6.11.0-nvidia-grace-2024.11
# Before updating, check NVIDIA's compatibility notes
# at docs.nvidia.com — search for your kernel version
# Pin a known-good kernel to avoid surprises
sudo apt-mark hold linux-image-generic linux-headers-generic
# When you DO update:
sudo apt upgrade linux-image-generic linux-headers-generic
# Reboot and verify GPU is recognized
sudo reboot
nvidia-smi
cpupower frequency-set -g performance.
NVIDIA Driver Updates
# Check current driver version
nvidia-smi | head -5
# Update to latest
sudo apt install nvidia-driver-550
sudo apt install nvidia-container-toolkit
# Verify
nvidia-smi
# Should show:
# Driver Version: 550.x.x | CUDA Version: 12.x
Driver cadence: NVIDIA ships driver updates roughly monthly for the Grace/Blackwell platforms. Always test in a Docker container before applying to your host system.
Ollama and Model Updates
Ollama on ARM64 is updated frequently. Check for updates monthly:
ollama --version
# Update:
curl -fsSL https://ollama.com/install.sh | sh
Thermal Monitoring
The GB10 superchip has multiple thermal zones. Monitor them all:
# Install NVIDIA monitoring tools
sudo apt install nvidia-diag-hook libnvidia-ml1
# Check all thermal zones
nvidia-smi -q | grep -A5 "GPU Thermal"
# Or use nvdur to get real-time readings
watch -n 2 'nvidia-smi --query-gpu=temperature.gpu,power.draw,clocks.gr,clocks.mem,utilization.gpu --format=csv,noheader'
# Check CPU temps
sensors | grep -E "temp|core"
# Check fan speeds
ipmitool sdr | grep -i fan
# Or read from /sys/class/hwmon/
cat /sys/class/hwmon/hwmon*/fan*_input 2>/dev/null
Temperature thresholds:
| Component | Normal | Warning | Throttle | Critical |
|---|---|---|---|---|
| GB10 GPU Junction | 40-65°C | 75°C | 85°C | 95°C |
| GB10 CPU Cores | 35-60°C | 70°C | 80°C | 90°C |
| Memory (LPDDR5x) | 30-55°C | 65°C | 75°C | 85°C |
| Board (PCB) Avg | 35-55°C | 65°C | 75°C | 85°C |
Known Pain Points
1. ARM64 Compatibility Gaps
Despite being the primary architecture the Spark runs on, ARM64 software support has gaps:
- PyTorch: Full ARM64 CUDA support arrived in 2.1+. Older versions silently fail. Always verify with
torch.cuda.is_available(). - Docker images: Many GPU images only publish AMD64. Use the
--platform linux/arm64flag or search for explicit ARM64 builds. - pip packages: Some popular packages (especially those with C extensions) don't publish ARM64 wheels. You'll need to compile from source.
- VS Code remote: Works but extensions may be missing ARM64 builds. Stick to the core extensions.
2. Unified Memory OOM Crashes
With 128 GB shared between CPU and GPU, it's easy to exhaust memory without realizing it:
# Monitor shared memory usage
free -h
# Look at "total" and "available" — not just "free"
# Check GPU-side allocation
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
# Set memory pressure monitoring
cat > /usr/local/bin/memory-monitor.sh <<'EOF'
#!/bin/bash
AVAIL=$(awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo)
if [ "$AVAIL" -lt 4096 ]; then
echo "⚠️ Low memory: ${AVAIL}MB available on $(hostname)" | nc -w 1 discord-webhook-url
fi
EOF
chmod +x /usr/local/bin/memory-monitor.sh
# Add to crontab - runs every 5 minutes
*/5 * * * * /usr/local/bin/memory-monitor.sh
3. Network Adapter Maintenance
The DGX Spark ships with Mellanox 200GbE networking. Keep firmware updated and check link status periodically:
# Check network adapter status
nvidia-smi
# Check link status
ethtool eth0
# Check for driver updates
# See NVIDIA's official docs for the latest driver releases
# https://docs.nvidia.com/
4. LPDDR5x Memory Scrubbing
The GB10's LPDDR5x memory uses hardware error correction but occasionally requires scrub cycles. The system does this automatically, but during intensive workloads you may see brief stuttering as the memory controller cleans itself:
- Normal: occasional 10-50ms pauses in inference (unnoticeable during generation)
- Concerning: persistent stuttering during idle → check memory health with
nmemctl --health
Routine Maintenance Schedule
| Frequency | Task | Command / Method |
|---|---|---|
| Weekly | Check temps and power | nvidia-smi + PDU monitoring |
| Weekly | Check disk space | df -h — model files grow fast |
| Monthly | Update Ollama + drivers | ollama update + nvidia-smi version check |
| Monthly | Clear Ollama cache | ollama list --json | jq — prune unused models |
| Quarterly | Kernel update review | Check NVIDIA docs for kernel compatibility |
| Quarterly | Clean fans / dust | Compressed air on intake filters |
| Annually | Thermal paste inspection | Replace if temps > 5°C baseline increase |
Backup Strategy
The Spark's 128GB of RAM is volatile. Your important data lives on NVMe storage. Back up:
- System configuration:
tar czf spark-config-backup.tar.gz /etc /var/lib/docker /home/spark - Model weights: Mirror your model library to an external NVMe or S3 bucket. Models are heavy; 10-20 GB each adds up.
- Database dumps: If running RAG, vector stores, or any local database — dump daily.
- nvram / firmware settings: Use
nvramctl savefor BIOS/firmware state.
Emergency Procedures
Spark Won't Boot
# Check network adapter status
lspci | grep -i nvidia
# If equipped with IPMI/redfish (check your specific config):
ipmitool power cycle
# If all else fails: hold power button 10 seconds, wait 30s, power on
GPU Not Recognized
# Check PCIe enumeration
lspci | grep -i nvidia
# If visible but nvidia-smi fails:
sudo rmmod nvidia_uvm nvidia_drm nvidia
sudo modprobe nvidia
nvidia-smi
# If not visible: check BIOS for PCIe slot configuration
Summary Checklist
- ☐ Temperatures monitored daily (automated via cron)
- ☐ NVIDIA driver within 2 versions of latest stable
- ☐ Kernel pinned to a known-good version (unpin only when you verify compatibility)
- ☐ Disk space > 20% free (model downloads eat space)
- ☐ Memory available > 16 GB (OOM kills inference services)
- ☐ DPU firmware current
- ☐ Backup of configs and model library up to date