Fine-Tuning Llama 3 on DGX Spark: What Fits, What Doesn't

Fine-tuning on consumer GPUs means choosing between "run it fully on GPU and OOM on anything over 70B" or "CPU offload and wait all night." The DGX Spark changes that equation with its 128GB of coherent unified memory — not VRAM, not system RAM, but a single memory pool accessible by both CPU and GPU at full bandwidth.

Memory Budget: What Fits?

Model	Fine-Tunable?	Method	Peak Memory	Estimated Time (1 epoch)
Llama 3.1 8B	✅ Easy	Full LoRA	~24 GB	Varies (dataset size, batch size)
Llama 3.1 70B	✅ Yes	QLoRA-4bit	~48 GB	Varies (dataset size, batch size)
Llama 3.1 405B	❌ No	—	~256 GB needed	—
Qwen2.5 32B	✅ Easy	Full LoRA	~28 GB	Varies (dataset size, batch size)
Qwen2.5 72B	✅ Yes	QLoRA-4bit	~56 GB	Varies (dataset size, batch size)
Mistral-7B	✅ Easy	Full LoRA	~20 GB	Varies (dataset size, batch size)
Mixtral-8x7B	✅ Yes	QLoRA-4bit	~40 GB	Varies (dataset size, batch size)
Gemma-2-27B	✅ Easy	Full LoRA	~26 GB	Varies (dataset size, batch size)

Note: Fine-tuning time depends heavily on dataset size, batch size, sequence length, and whether you enable gradient checkpointing. These estimates are rough — test with a small subset of your data first to calibrate.

🔑 Key insight: The GB10 doesn't have discrete VRAM. All 128GB is shared. A "48 GB peak" number means you have 80GB left for your system, other processes, and training overhead. That's a lot of room — far more than a 24GB RTX 4090 can give you.

QLoRA Setup on the Spark

QLoRA (Quantized LoRA) is the sweet spot for the GB10. You quantize the base model to 4-bit (or 8-bit) and fine-tune only the LoRA adapters:

pip install bitsandbytes accelerate peft transformers datasets

# Python fine-tuning script
import os
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    BitsAndBytesConfig
)
from peft import LoraConfig, get_peft_model
from datasets import load_dataset

# 1. Load base model (quantized)
model_name = "meta-llama/Meta-Llama-3.1-8B-Instruct"
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map={"": "cpu"}  # GB10: load to unified memory, let framework handle placement
)

# 2. Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: ~4.2M / 8.0B (0.05%)

# 3. Load and prepare dataset
dataset = load_dataset("json", data_files={"train": "train.json", "validation": "val.json"})

def format_example(example):
    return {
        "text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['response']}"
    }

tokenized = dataset.map(lambda x: tokenizer(x['text'], truncation=True, max_length=2048), batched=True)

# 4. Training arguments — optimized for GB10
training_args = TrainingArguments(
    output_dir="./llama3-fine-tuned",
    per_device_train_batch_size=4,      # Adjust based on memory
    gradient_accumulation_steps=8,       # Effective batch size = 32
    learning_rate=2e-4,
    num_train_epochs=3,
    fp16=True,                           # Use float16 for faster training
    logging_steps=10,
    save_steps=100,
    save_total_limit=3,
    report_to="none",
    dataloader_num_workers=8,            # Use multiple CPU threads
    optim="paged_adamw_8bit",            # Save more memory during training
)

# 5. Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
)

trainer.train()
model.save_pretrained("./llama3-fine-tuned-lora")

CPU vs GPU Placement on Unified Memory

On the GB10, "CPU vs GPU" doesn't mean different memory pools — it means different access paths. The GPU can access the full 128GB, but at different bandwidths. Here's how to optimize placement:

# On GB10, use the device_map to hint placement
# The framework will auto-route between CPU and GPU paths

# For fine-tuning: keep the model weights on the GPU path
# (faster access for weight reads during backward pass)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=quantization_config,
    device_map={"": "cuda"}  # Uses GPU memory path
)

# For inference-only workloads: CPU path is fine
# (lower power, sufficient bandwidth for sequential access)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto"  # Let the framework decide
)

# The GB10's key advantage: you can keep BOTH the embedding
# model and the generation model in memory simultaneously
# — something impossible on consumer GPUs with limited VRAM

Fine-Tuning Hyperparameters for the GB10

Parameter	Recommended Value	Notes
Learning Rate	2e-4 (QLoRA) / 5e-5 (full)	QLoRA needs higher LR to compensate for quantization
Batch Size	2-4 per device	Use gradient_accumulation_steps to scale effective batch size
Max Length	2048-4096	Larger context = more VRAM, but better quality for long docs
Warmup Steps	10% of total steps	Prevents early training instability
Optimizer	paged_adamw_8bit	Significantly reduces peak memory vs standard AdamW
Gradient Checkpointing	Enabled	Trading compute for memory — ~30% slower but ~40% less memory
fp16 / bf16	bf16 preferred	GB10's ARM64 architecture supports bf16 well

Quantization Strategies

# 4-bit QLoRA — maximum model size, good quality
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",  # Normal Float 4 — best quality/size ratio
    bnb_4bit_use_double_quant=True  # Quantize the quantization params
)

# 8-bit — less quality loss, more memory
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,
    bnb_8bit_compute_dtype=torch.float16
)

# GGUF via llama.cpp — runs on CPU, very memory efficient
# Good for models larger than 70B that can't fit in 128GB even quantized
from llama_cpp import Llama
llm = Llama(
    model_path="llama-3.1-8b-instruct.Q4_K_M.gguf",
    n_gpu_layers=-1,       # Offload all to GPU
    n_ctx=8192,
    n_threads=20,          # All 20 CPU cores for data loading
    n_batch=512
)

What You Can't Fine-Tune on 128GB

Llama 3.1 405B — requires ~256GB at 4-bit minimum. Even with QLoRA, the adapter matrices alone push the limit.
FULL fine-tuning of any 70B+ model — 8-bit weights + gradients + optimizer states exceed 128GB.
Training with very long context (8K+ tokens) — the activation matrices for full sequences on 70B models consume too much memory.

For these cases, consider distributed fine-tuning across multiple machines, or use parameter-efficient methods with smaller LoRA ranks. Or just use the Spark for what it's great at: 8B to 72B fine-tuning, where it genuinely outperforms consumer GPUs.

Evaluating Your Fine-Tune

# After training, evaluate on your test set
from transformers import AutoTokenizer, pipeline
import json

tokenizer = AutoTokenizer.from_pretrained(model_name)
model.load_adapter("./llama3-fine-tuned-lora")

# Test it
inputs = tokenizer("Tell me about quantum computing.", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

# Compare against base model for quality assessment
# Save test prompts and compare both models
prompts = json.load(open("test_prompts.json"))
results = {"original": [], "fine_tuned": []}

for p in prompts:
    orig_out = original_model.generate(**tokenizer(p, return_tensors="pt"), max_new_tokens=256)
    ft_out = model.generate(**tokenizer(p, return_tensors="pt"), max_new_tokens=256)
    results["original"].append(tokenizer.decode(orig_out[0], skip_special_tokens=True))
    results["fine_tuned"].append(tokenizer.decode(ft_out[0], skip_special_tokens=True))

# Save comparison for review
json.dump(results, open("fine_tune_comparison.json", "w"))

Bottom Line

The DGX Spark is the only consumer-grade machine that can fine-tune 70B-class models without distributed setup, cloud costs, or waiting for GPU availability. The unified memory isn't just a nice-to-have here — it's the reason it works at all. 128GB of coherent memory means you can load a quantized 72B model, its LoRA adapters, the optimizer state, and your training dataset simultaneously, all without swapping.

← Back to DGX Spark Projects