Local RAG Server on DGX Spark

Build a private, offline retrieval-augmented generation system for your documents, research, and knowledge bases — powered by the GB10's unified memory.

RAG (Retrieval-Augmented Generation) is where a local AI system becomes genuinely useful. Instead of asking a model to hallucinate answers, you give it your own documents and let it retrieve the relevant pieces before generating a response. Your DGX Spark is overqualified for this — and that's exactly what makes it great.

Why the Spark?

The unified memory architecture solves the biggest problem with local RAG: loading embedding models and vector stores without swapping. On a consumer GPU with 24GB VRAM, you're constantly fighting between the embedding model, the retriever, and the generation model. On the GB10 with 128GB coherent memory, they all coexist at full speed.

Stack Selection

ComponentChoiceWhy
Embedding Modelnomic-embed-text (GGUF)Open, 128MB, excellent quality, runs on CPU
Vector DatabaseChromaDB (local)Lightweight, no Docker needed, Python-native
Generation ModelQwen3.6:35B (GGUF via Ollama)Runs in ~32GB VRAM, strong reasoning
FrameworkLangChain or LiteLLMLangChain for full features, LiteLLM for simplicity
Document LoaderPyPDF, markdown-it, csvHandle PDFs, Markdown, CSV, JSON, DOCX
ChunkingRecursiveCharacterTextSplitter512-1024 token chunks with 100-token overlap

Step 1: Embedding Model

Pull an embedding model via Ollama. The nomic-embed-text model is ~270MB and punches above its weight:

# Pull the embedding model
ollama pull nomic-embed-text

# Verify it's running
ollama list
# nomic-embed-text:latest  270MB

Step 2: Document Ingestion Pipeline

from langchain.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
import os

# Configuration
CHUNK_SIZE = 1024
CHUNK_OVERLAP = 100
EMBEDDING_MODEL = "nomic-embed-text"
GENERATION_MODEL = "qwen3.6:35b"
VECTOR_STORE_PATH = "/home/casper/dgx-spark-rag/vector-store"

# Load documents from directory
def load_documents(directory):
    loader = DirectoryLoader(
        directory,
        glob="**/*.{pdf,md,txt,json,csv}",
        loader_cls=TextLoader,
        loader_kwargs={'encoding': 'utf-8'}
    )
    docs = loader.load()
    print(f"Loaded {len(docs)} documents")
    return docs

# Split into chunks
def split_documents(documents):
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        length_function=len
    )
    chunks = splitter.split_documents(documents)
    print(f"Split into {len(chunks)} chunks")
    return chunks

# Create embeddings and vector store
def create_vector_store(chunks):
    embeddings = OllamaEmbeddings(
        model=EMBEDDING_MODEL,
        base_url="http://localhost:11434"
    )
    vectorstore = Chroma.from_documents(
        chunks,
        embedding=embeddings,
        persist_directory=VECTOR_STORE_PATH
    )
    print(f"Vector store created with {vectorstore._collection.count()} documents")
    return vectorstore

# Create the RAG chain
def create_rag_chain(vectorstore):
    retriever = vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 5}  # Return top 5 relevant chunks
    )
    qa = RetrievalQA.from_chain_type(
        llm={"model": GENERATION_MODEL, "base_url": "http://localhost:11434/api/chat", "temperature": 0},
        chain_type="stuff",
        retriever=retriever,
        return_source_documents=True
    )
    return qa

# Build the pipeline
docs = load_documents("/home/casper/dgx-spark-rag/documents")
chunks = split_documents(docs)
vectorstore = create_vector_store(chunks)
rag_chain = create_rag_chain(vectorstore)

# Query it
result = rag_chain.invoke("What does the research say about transformer efficiency?")
print(result['result'])

Step 3: Handling Large Documents

With 128GB of memory, the Spark can handle document sets that would crash smaller machines. Here's a batch ingestion script for large repositories:

import os
from langchain.document_loaders import UnstructuredFileLoader
from langchain.document_loaders import Docx2txtLoader

def ingest_large_collection(root_dir, chunk_size=512):
    """Ingest a large collection of documents with intelligent chunking."""
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain.embeddings import OllamaEmbeddings
    from langchain.vectorstores import Chroma
    from langchain.document_loaders import PyPDFLoader
    
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=50,
        separators=["\n\n", "\n", ".", " ", ""]
    )
    
    all_chunks = []
    metadata = []
    
    for dirpath, _, filenames in os.walk(root_dir):
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            
            if filename.endswith('.pdf'):
                loader = PyPDFLoader(filepath)
                docs = loader.load()
                chunks = splitter.split_documents(docs)
                all_chunks.extend(chunks)
                metadata.append({
                    'source': filepath,
                    'type': 'pdf',
                    'num_pages': loader._get_num_pages()
                })
            elif filename.endswith('.docx'):
                loader = Docx2txtLoader(filepath)
                docs = loader.load()
                chunks = splitter.split_documents(docs)
                all_chunks.extend(chunks)
                metadata.append({'source': filepath, 'type': 'docx'})
    
    # Ingest all at once
    embeddings = OllamaEmbeddings(model="nomic-embed-text", base_url="http://localhost:11434")
    db = Chroma.from_documents(
        all_chunks,
        embeddings,
        persist_directory=VECTOR_STORE_PATH
    )
    
    print(f"Total chunks: {len(all_chunks)}")
    print(f"Total size: {len(all_chunks) * 1024 / 1024 / 1024:.1f} GB (at {chunk_size} tokens/chunk)")
    return db

# With 128GB RAM, you can easily ingest 100K+ documents
db = ingest_large_collection("/home/casper/documents")

Step 4: Query API

Expose your RAG system as an HTTP API so other apps can query it:

from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Optional

app = FastAPI(title="DGX Spark RAG API")

class QueryRequest(BaseModel):
    question: str
    k: int = 5
    temperature: float = 0.0
    include_sources: bool = True

class QueryResponse(BaseModel):
    answer: str
    sources: List[str]
    query_time_ms: float

@app.post("/query", response_model=QueryResponse)
async def query_rag(req: QueryRequest):
    import time
    start = time.time()
    
    retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": req.k})
    qa = RetrievalQA.from_chain_type(
        llm={"model": GENERATION_MODEL, "temperature": req.temperature, "base_url": "http://localhost:11434/api/chat"},
        retriever=retriever
    )
    
    result = qa.invoke(req.question)
    
    sources = [doc.metadata.get('source', 'unknown') 
               for doc in result.get('source_documents', [])]
    
    elapsed = (time.time() - start) * 1000
    
    return QueryResponse(
        answer=result['result'],
        sources=sources if req.include_sources else [],
        query_time_ms=round(elapsed, 1)
    )

# Run with: uvicorn rag_api:app --host 0.0.0.0 --port 8000
# Access at: http://your-spark-ip:8000/docs

Step 5: Optimizing Query Performance

Even on the Spark, RAG latency depends on several factors. Here's how to optimize:

# 1. Use a larger embedding model for better retrieval accuracy
ollama pull mxbai-embed-large  # Better quality, ~400MB

# 2. Use metadata filtering for targeted retrieval
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance (diverse results)
    search_kwargs={
        "k": 8,
        "fetch_k": 20,
        "filter": {"source_type": "research_paper"}
    }
)

# 3. Increase chunk size for dense documents
# (legal docs, technical specs benefit from larger chunks)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=2048,  # Larger chunks for dense content
    chunk_overlap=200
)

# 4. Use the GPU for generation to keep response times low
# The GB10's GPU can generate text in real-time while the embedding model
# handles retrieval on the CPU — true parallelism thanks to unified memory

Step 6: Adding a Web Dashboard

A simple web UI lets you query the system without writing code:

from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse
from fastapi.staticfiles import StaticFiles

app = FastAPI(title="DGX Spark RAG")
app.mount("/static", StaticFiles(directory="/home/casper/dgx-spark-rag/static"), name="static")

@app.get("/", response_class=HTMLResponse)
async def home():
    return open("/home/casper/dgx-spark-rag/static/index.html").read()
💡 Pro tip: The GB10's unified memory means the embedding model, vector store, and generation model all live in the same memory pool. There's no VRAM overhead for GPU-specific models — everything shares that 128GB pool. This is the architecture's killer advantage for RAG.

Performance Expectations

The Bottom Line

A local RAG server on the DGX Spark is something most people never get to build. It's not just "private" — it's massively overpowered for the task. The same hardware that runs a 35B parameter model can also process embeddings, manage vector stores, and generate responses simultaneously, all in memory, all in real-time.

← Back to DGX Spark Projects