DGX Spark Maintenance Guide — DGX Spark Projects

Overview

The GB10 Grace Blackwell superchip is powerful but relatively new. The software stack (Linux kernel, ARM64 CUDA compatibility, NVIDIA drivers, container runtime) is still maturing. Here's what you need to know to keep your Spark running smoothly over months and years of use.

Software Updates

Kernel Updates — Proceed with Caution

The Spark runs on a custom Linux kernel optimized for the ARM64 + unified memory architecture. Kernel updates can and will occasionally break GPU passthrough or thermal management:

# Check current kernel version
uname -r

# Example output: 6.11.0-nvidia-grace-2024.11

# Before updating, check NVIDIA's compatibility notes
# at docs.nvidia.com — search for your kernel version

# Pin a known-good kernel to avoid surprises
sudo apt-mark hold linux-image-generic linux-headers-generic

# When you DO update:
sudo apt upgrade linux-image-generic linux-headers-generic

# Reboot and verify GPU is recognized
sudo reboot
nvidia-smi

⚠️ Known issue (Dec 2024 – Apr 2025): Kernel 6.8.x through 6.10.x on early Spark units had inconsistent CPU frequency scaling under the GB10's big.LITTLE configuration. The 8 high-performance cores would occasionally lock to minimum frequency, killing inference throughput by 40%. Workaround: pin the performance cores with cpupower frequency-set -g performance.

NVIDIA Driver Updates

# Check current driver version
nvidia-smi | head -5

# Update to latest
sudo apt install nvidia-driver-550
sudo apt install nvidia-container-toolkit

# Verify
nvidia-smi
# Should show:
# Driver Version: 550.x.x  |  CUDA Version: 12.x

Driver cadence: NVIDIA ships driver updates roughly monthly for the Grace/Blackwell platforms. Always test in a Docker container before applying to your host system.

Ollama and Model Updates

Ollama on ARM64 is updated frequently. Check for updates monthly:

ollama --version
# Update:
curl -fsSL https://ollama.com/install.sh | sh

Thermal Monitoring

The GB10 superchip has multiple thermal zones. Monitor them all:

# Install NVIDIA monitoring tools
sudo apt install nvidia-diag-hook libnvidia-ml1

# Check all thermal zones
nvidia-smi -q | grep -A5 "GPU Thermal"

# Or use nvdur to get real-time readings
watch -n 2 'nvidia-smi --query-gpu=temperature.gpu,power.draw,clocks.gr,clocks.mem,utilization.gpu --format=csv,noheader'

# Check CPU temps
sensors | grep -E "temp|core"

# Check fan speeds
ipmitool sdr | grep -i fan
# Or read from /sys/class/hwmon/
cat /sys/class/hwmon/hwmon*/fan*_input 2>/dev/null

Temperature thresholds:

Component	Normal	Warning	Throttle	Critical
GB10 GPU Junction	40-65°C	75°C	85°C	95°C
GB10 CPU Cores	35-60°C	70°C	80°C	90°C
Memory (LPDDR5x)	30-55°C	65°C	75°C	85°C
Board (PCB) Avg	35-55°C	65°C	75°C	85°C

💡 Set up automated alerts: Create a cron job that checks temps and sends you a Discord/Telegram message if anything exceeds safe levels.

Known Pain Points

1. ARM64 Compatibility Gaps

Despite being the primary architecture the Spark runs on, ARM64 software support has gaps:

PyTorch: Full ARM64 CUDA support arrived in 2.1+. Older versions silently fail. Always verify with torch.cuda.is_available().
Docker images: Many GPU images only publish AMD64. Use the --platform linux/arm64 flag or search for explicit ARM64 builds.
pip packages: Some popular packages (especially those with C extensions) don't publish ARM64 wheels. You'll need to compile from source.
VS Code remote: Works but extensions may be missing ARM64 builds. Stick to the core extensions.

2. Unified Memory OOM Crashes

With 128 GB shared between CPU and GPU, it's easy to exhaust memory without realizing it:

# Monitor shared memory usage
free -h
# Look at "total" and "available" — not just "free"

# Check GPU-side allocation
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

# Set memory pressure monitoring
cat > /usr/local/bin/memory-monitor.sh <<'EOF'
#!/bin/bash
AVAIL=$(awk '/MemAvailable/ {print int($2/1024)}' /proc/meminfo)
if [ "$AVAIL" -lt 4096 ]; then
    echo "⚠️ Low memory: ${AVAIL}MB available on $(hostname)" | nc -w 1 discord-webhook-url
fi
EOF
chmod +x /usr/local/bin/memory-monitor.sh
# Add to crontab - runs every 5 minutes
*/5 * * * * /usr/local/bin/memory-monitor.sh

3. Network Adapter Maintenance

The DGX Spark ships with Mellanox 200GbE networking. Keep firmware updated and check link status periodically:

# Check network adapter status
nvidia-smi

# Check link status
ethtool eth0

# Check for driver updates
# See NVIDIA's official docs for the latest driver releases
# https://docs.nvidia.com/

4. LPDDR5x Memory Scrubbing

The GB10's LPDDR5x memory uses hardware error correction but occasionally requires scrub cycles. The system does this automatically, but during intensive workloads you may see brief stuttering as the memory controller cleans itself:

Normal: occasional 10-50ms pauses in inference (unnoticeable during generation)
Concerning: persistent stuttering during idle → check memory health with nmemctl --health

Routine Maintenance Schedule

Frequency	Task	Command / Method
Weekly	Check temps and power	`nvidia-smi` + PDU monitoring
Weekly	Check disk space	`df -h` — model files grow fast
Monthly	Update Ollama + drivers	`ollama update` + `nvidia-smi` version check
Monthly	Clear Ollama cache	`ollama list --json \| jq` — prune unused models
Quarterly	Kernel update review	Check NVIDIA docs for kernel compatibility
Quarterly	Clean fans / dust	Compressed air on intake filters
Annually	Thermal paste inspection	Replace if temps > 5°C baseline increase

Backup Strategy

The Spark's 128GB of RAM is volatile. Your important data lives on NVMe storage. Back up:

System configuration: tar czf spark-config-backup.tar.gz /etc /var/lib/docker /home/spark
Model weights: Mirror your model library to an external NVMe or S3 bucket. Models are heavy; 10-20 GB each adds up.
Database dumps: If running RAG, vector stores, or any local database — dump daily.
nvram / firmware settings: Use nvramctl save for BIOS/firmware state.

Emergency Procedures

Spark Won't Boot

# Check network adapter status
lspci | grep -i nvidia

# If equipped with IPMI/redfish (check your specific config):
ipmitool power cycle

# If all else fails: hold power button 10 seconds, wait 30s, power on

GPU Not Recognized

# Check PCIe enumeration
lspci | grep -i nvidia

# If visible but nvidia-smi fails:
sudo rmmod nvidia_uvm nvidia_drm nvidia
sudo modprobe nvidia
nvidia-smi

# If not visible: check BIOS for PCIe slot configuration

Summary Checklist

☐ Temperatures monitored daily (automated via cron)
☐ NVIDIA driver within 2 versions of latest stable
☐ Kernel pinned to a known-good version (unpin only when you verify compatibility)
☐ Disk space > 20% free (model downloads eat space)
☐ Memory available > 16 GB (OOM kills inference services)
☐ DPU firmware current
☐ Backup of configs and model library up to date

💡 Bottom line: The Spark is reliable hardware. 90% of issues are software-related (driver mismatches, memory exhaustion, kernel updates). Automate monitoring, pin your kernel, and update drivers in containers first. Don't update the kernel on a production Spark without testing the new version in a Docker container first.