Fine-tuning on consumer GPUs means choosing between "run it fully on GPU and OOM on anything over 70B" or "CPU offload and wait all night." The DGX Spark changes that equation with its 128GB of coherent unified memory — not VRAM, not system RAM, but a single memory pool accessible by both CPU and GPU at full bandwidth.
Memory Budget: What Fits?
| Model | Fine-Tunable? | Method | Peak Memory | Estimated Time (1 epoch) |
|---|---|---|---|---|
| Llama 3.1 8B | ✅ Easy | Full LoRA | ~24 GB | Varies (dataset size, batch size) |
| Llama 3.1 70B | ✅ Yes | QLoRA-4bit | ~48 GB | Varies (dataset size, batch size) |
| Llama 3.1 405B | ❌ No | — | ~256 GB needed | — |
| Qwen2.5 32B | ✅ Easy | Full LoRA | ~28 GB | Varies (dataset size, batch size) |
| Qwen2.5 72B | ✅ Yes | QLoRA-4bit | ~56 GB | Varies (dataset size, batch size) |
| Mistral-7B | ✅ Easy | Full LoRA | ~20 GB | Varies (dataset size, batch size) |
| Mixtral-8x7B | ✅ Yes | QLoRA-4bit | ~40 GB | Varies (dataset size, batch size) |
| Gemma-2-27B | ✅ Easy | Full LoRA | ~26 GB | Varies (dataset size, batch size) |
Note: Fine-tuning time depends heavily on dataset size, batch size, sequence length, and whether you enable gradient checkpointing. These estimates are rough — test with a small subset of your data first to calibrate.
QLoRA Setup on the Spark
QLoRA (Quantized LoRA) is the sweet spot for the GB10. You quantize the base model to 4-bit (or 8-bit) and fine-tune only the LoRA adapters:
pip install bitsandbytes accelerate peft transformers datasets
# Python fine-tuning script
import os
import torch
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
TrainingArguments,
Trainer,
BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
# 1. Load base model (quantized)
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map={"": "cpu"} # GB10: load to unified memory, let framework handle placement
)
# 2. Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: ~4.2M / 8.0B (0.05%)
# 3. Load and prepare dataset
dataset = load_dataset("json", data_files={"train": "train.json", "validation": "val.json"})
def format_example(example):
return {
"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
}
tokenized = dataset.map(lambda x: tokenizer(x['text'], truncation=True, max_length=2048), batched=True)
# 4. Training arguments — optimized for GB10
training_args = TrainingArguments(
output_dir="./llama3-fine-tuned",
per_device_train_batch_size=4, # Adjust based on memory
gradient_accumulation_steps=8, # Effective batch size = 32
learning_rate=2e-4,
num_train_epochs=3,
fp16=True, # Use float16 for faster training
logging_steps=10,
save_steps=100,
save_total_limit=3,
report_to="none",
dataloader_num_workers=8, # Use multiple CPU threads
optim="paged_adamw_8bit", # Save more memory during training
)
# 5. Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
)
trainer.train()
model.save_pretrained("./llama3-fine-tuned-lora")
CPU vs GPU Placement on Unified Memory
On the GB10, "CPU vs GPU" doesn't mean different memory pools — it means different access paths. The GPU can access the full 128GB, but at different bandwidths. Here's how to optimize placement:
# On GB10, use the device_map to hint placement
# The framework will auto-route between CPU and GPU paths
# For fine-tuning: keep the model weights on the GPU path
# (faster access for weight reads during backward pass)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=quantization_config,
device_map={"": "cuda"} # Uses GPU memory path
)
# For inference-only workloads: CPU path is fine
# (lower power, sufficient bandwidth for sequential access)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto" # Let the framework decide
)
# The GB10's key advantage: you can keep BOTH the embedding
# model and the generation model in memory simultaneously
# — something impossible on consumer GPUs with limited VRAM
Fine-Tuning Hyperparameters for the GB10
| Parameter | Recommended Value | Notes |
|---|---|---|
| Learning Rate | 2e-4 (QLoRA) / 5e-5 (full) | QLoRA needs higher LR to compensate for quantization |
| Batch Size | 2-4 per device | Use gradient_accumulation_steps to scale effective batch size |
| Max Length | 2048-4096 | Larger context = more VRAM, but better quality for long docs |
| Warmup Steps | 10% of total steps | Prevents early training instability |
| Optimizer | paged_adamw_8bit | Significantly reduces peak memory vs standard AdamW |
| Gradient Checkpointing | Enabled | Trading compute for memory — ~30% slower but ~40% less memory |
| fp16 / bf16 | bf16 preferred | GB10's ARM64 architecture supports bf16 well |
Quantization Strategies
# 4-bit QLoRA — maximum model size, good quality
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4", # Normal Float 4 — best quality/size ratio
bnb_4bit_use_double_quant=True # Quantize the quantization params
)
# 8-bit — less quality loss, more memory
bnb_config = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.float16
)
# GGUF via llama.cpp — runs on CPU, very memory efficient
# Good for models larger than 70B that can't fit in 128GB even quantized
from llama_cpp import Llama
llm = Llama(
model_path="llama-3.1-8b-instruct.Q4_K_M.gguf",
n_gpu_layers=-1, # Offload all to GPU
n_ctx=8192,
n_threads=20, # All 20 CPU cores for data loading
n_batch=512
)
What You Can't Fine-Tune on 128GB
- Llama 3.1 405B — requires ~256GB at 4-bit minimum. Even with QLoRA, the adapter matrices alone push the limit.
- FULL fine-tuning of any 70B+ model — 8-bit weights + gradients + optimizer states exceed 128GB.
- Training with very long context (8K+ tokens) — the activation matrices for full sequences on 70B models consume too much memory.
For these cases, consider distributed fine-tuning across multiple machines, or use parameter-efficient methods with smaller LoRA ranks. Or just use the Spark for what it's great at: 8B to 72B fine-tuning, where it genuinely outperforms consumer GPUs.
Evaluating Your Fine-Tune
# After training, evaluate on your test set
from transformers import AutoTokenizer, pipeline
import json
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.load_adapter("./llama3-fine-tuned-lora")
# Test it
inputs = tokenizer("Tell me about quantum computing.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Compare against base model for quality assessment
# Save test prompts and compare both models
prompts = json.load(open("test_prompts.json"))
results = {"original": [], "fine_tuned": []}
for p in prompts:
orig_out = original_model.generate(**tokenizer(p, return_tensors="pt"), max_new_tokens=256)
ft_out = model.generate(**tokenizer(p, return_tensors="pt"), max_new_tokens=256)
results["original"].append(tokenizer.decode(orig_out[0], skip_special_tokens=True))
results["fine_tuned"].append(tokenizer.decode(ft_out[0], skip_special_tokens=True))
# Save comparison for review
json.dump(results, open("fine_tune_comparison.json", "w"))
Bottom Line
The DGX Spark is the only consumer-grade machine that can fine-tune 70B-class models without distributed setup, cloud costs, or waiting for GPU availability. The unified memory isn't just a nice-to-have here — it's the reason it works at all. 128GB of coherent memory means you can load a quantized 72B model, its LoRA adapters, the optimizer state, and your training dataset simultaneously, all without swapping.
← Back to DGX Spark Projects