Computer Vision on DGX Spark

Running vision models on the GB10 Blackwell: CLIP, SAM, video analysis, and where the unified memory architecture actually shines.

When people talk about the DGX Spark, they talk about LLMs. But the Blackwell GPU in the GB10 is genuinely powerful for vision workloads too — and the unified memory architecture gives it a specific advantage over consumer GPUs that most people don't realize until they try it.

Why Unified Memory Matters for Vision

On a consumer GPU (RTX 4090 with 24GB VRAM), vision models face a fundamental constraint: the image, the model weights, and intermediate activations all compete for the same VRAM pool. Batch size is the casualty.

On the GB10 with 128GB coherent memory:

CLIP — Image Classification & Embedding

CLIP (Contrastive Language-Image Pretraining) is the workhorse of vision-language models. On the GB10:

pip install transformers torch

from transformers import CLIPModel, CLIPProcessor
from PIL import Image
import torch

# Load CLIP model (ViT-L/14, ~890MB — fits trivially)
model_name = "openai/clip-vit-large-patch14"
processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPModel.from_pretrained(model_name)

# Process images
def classify_images(image_paths, class_names):
    """Zero-shot image classification with CLIP."""
    inputs = processor(text=class_names, images=images, return_tensors="pt", padding=True)
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1).detach().numpy()
    
    for i, path in enumerate(image_paths):
        top_class = class_names[probs[i].argmax()]
        confidence = probs[i].max()
        print(f"{path}: {top_class} ({confidence:.2%})")
    
    return probs

# Zero-shot on the GB10
# Process hundreds of images simultaneously — batch size is the only constraint
images = [Image.open(p) for p in image_paths]
classifications = classify_images(image_paths, ["cat", "dog", "bird", "fish", "car", "person"])

SAM — Segment Anything

SAM (Segment Anything Model) is a large vision model. The GB10 handles it well because:

pip install segment-anything torch

from segment_anything import SamPredictor, sam_model_registry

# SAM-H (Huge) model — 4GB weights, easily fits in unified memory
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
predictor = SamPredictor(sam)

def segment_image(image_path, prompts):
    """Segment an image with multiple point/box prompts."""
    image = load_image(image_path)
    predictor.set_image(image)
    
    results = []
    for prompt_type, coords in prompts.items():
        if prompt_type == "points":
            masks, scores, _ = predictor.predict(
                point_coords=coords['points'],
                point_labels=coords['labels'],
                multimask_output=True
            )
        elif prompt_type == "boxes":
            masks, scores, _ = predictor.predict(
                box=coords['boxes'],
                multimask_output=True
            )
        results.append((masks, scores))
    
    return results

# SAM on the GB10 advantage:
# You can process multiple high-res images in parallel
# (24GB VRAM on a 4090 would OOM with SAM-H + batch processing)

Video Analysis Pipeline

Video is where the GB10's unified memory really flexes. Load frames, process them with a vision model, and keep the buffer — all without VRAM fragmentation:

import cv2
import torch
from transformers import AutoProcessor, AutoModel

# Video classification with ViT
processor = AutoProcessor.from_pretrained("facebook/sowt-vit-base")
model = AutoModel.from_pretrained("facebook/sowt-vit-base")

def analyze_video(video_path, frame_interval=30):
    """Analyze video with frame-by-frame CLIP embeddings."""
    cap = cv2.VideoCapture(video_path)
    frames = []
    
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break
        
        # Process every Nth frame
        if len(frames) % frame_interval == 0:
            frames.append(frame)
    
    cap.release()
    print(f"Extracted {len(frames)} frames")
    
    # Process all frames at once (possible because 128GB unified memory)
    # Even with 4K video, each frame is ~10MB — 100 frames = ~1GB
    # A 4090 with 24GB VRAM struggles with batch sizes beyond 8-16 for 4K
    inputs = processor(images=frames, return_tensors="pt")
    outputs = model(**inputs)
    
    return outputs.last_hidden_state

# Real-world: video surveillance analysis
# Run object detection on 10+ cameras simultaneously
# Each camera at 1080p, processing at 5fps
# Total: 50 frames/sec throughput on GB10

Object Detection — YOLO on the Spark

pip install ultralytics

from ultralytics import YOLO

# Load YOLOv8 (or YOLOv10 for speed)
model = YOLO("yolo11n.pt")  # Nano — smallest, fastest

# Run on image
results = model.predict("image.jpg", conf=0.25, iou=0.45)

# Run on video stream
results = model.predict("video.mp4", stream=True)
for result in results:
    boxes = result.boxes
    for box in boxes:
        x1, y1, x2, y2 = box.xyxy[0].cpu().numpy()
        conf = box.conf[0].item()
        cls_name = model.names[int(box.cls)]
        print(f"{cls_name}: {conf:.2f} at [{x1:.0f},{y1:.0f},{x2:.0f},{y2:.0f}]")

# Multi-camera setup
# Process 8 camera feeds simultaneously
cameras = [
    "rtsp://cam1:8080/stream",
    "rtsp://cam2:8080/stream",
    "rtsp://cam3:8080/stream",
    "rtsp://cam4:8080/stream",
]

for cam_url in cameras:
    model.predict(cam_url, stream=True, device="cuda", conf=0.3)

Performance Notes on the GB10

Actual performance depends on model, batch size, and input dimensions. The GB10's advantage is capacity, not raw speed — it can handle workloads that OOM on consumer GPUs:

CapabilityGB10Typical Consumer GPUNotes
CLIP batch size128+ images8-16 imagesGB10 handles much larger batches without VRAM pressure
SAM-H resolution4096x4096+~1024x1024SAM-H OOMs on 24GB VRAM at high resolution
Simultaneous video feeds10+ @ 1080p3-4 @ 1080pUnified memory eliminates VRAM fragmentation
Per-frame latencyVariesOften fasterConsumer GPUs may beat GB10 on raw per-frame speed
⚠️ Note: The GB10's GPU is powerful, but for raw frame-by-frame inference on standard models (YOLO, EfficientNet), consumer GPUs like the 4090 can still have a per-frame speed advantage in some cases. The GB10's edge is batch capacity and model size — not per-frame latency.

Best Vision Use Cases for the Spark

The Bottom Line

For single-image classification or YOLO detection on a 1080p stream, a consumer GPU is fine. But when you need to process many images at once, high-resolution images simultaneously, or combine vision models with language models in the same pipeline, the GB10's unified memory architecture is a significant advantage.

← Back to DGX Spark Projects