RAG (Retrieval-Augmented Generation) is where a local AI system becomes genuinely useful. Instead of asking a model to hallucinate answers, you give it your own documents and let it retrieve the relevant pieces before generating a response. Your DGX Spark is overqualified for this — and that's exactly what makes it great.
Why the Spark?
The unified memory architecture solves the biggest problem with local RAG: loading embedding models and vector stores without swapping. On a consumer GPU with 24GB VRAM, you're constantly fighting between the embedding model, the retriever, and the generation model. On the GB10 with 128GB coherent memory, they all coexist at full speed.
Stack Selection
| Component | Choice | Why |
|---|---|---|
| Embedding Model | nomic-embed-text (GGUF) | Open, 128MB, excellent quality, runs on CPU |
| Vector Database | ChromaDB (local) | Lightweight, no Docker needed, Python-native |
| Generation Model | Qwen3.6:35B (GGUF via Ollama) | Runs in ~32GB VRAM, strong reasoning |
| Framework | LangChain or LiteLLM | LangChain for full features, LiteLLM for simplicity |
| Document Loader | PyPDF, markdown-it, csv | Handle PDFs, Markdown, CSV, JSON, DOCX |
| Chunking | RecursiveCharacterTextSplitter | 512-1024 token chunks with 100-token overlap |
Step 1: Embedding Model
Pull an embedding model via Ollama. The nomic-embed-text model is ~270MB and punches above its weight:
# Pull the embedding model
ollama pull nomic-embed-text
# Verify it's running
ollama list
# nomic-embed-text:latest 270MB
Step 2: Document Ingestion Pipeline
from langchain.document_loaders import PyPDFLoader, DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
import os
# Configuration
CHUNK_SIZE = 1024
CHUNK_OVERLAP = 100
EMBEDDING_MODEL = "nomic-embed-text"
GENERATION_MODEL = "qwen3.6:35b"
VECTOR_STORE_PATH = "/home/casper/dgx-spark-rag/vector-store"
# Load documents from directory
def load_documents(directory):
loader = DirectoryLoader(
directory,
glob="**/*.{pdf,md,txt,json,csv}",
loader_cls=TextLoader,
loader_kwargs={'encoding': 'utf-8'}
)
docs = loader.load()
print(f"Loaded {len(docs)} documents")
return docs
# Split into chunks
def split_documents(documents):
splitter = RecursiveCharacterTextSplitter(
chunk_size=CHUNK_SIZE,
chunk_overlap=CHUNK_OVERLAP,
length_function=len
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
return chunks
# Create embeddings and vector store
def create_vector_store(chunks):
embeddings = OllamaEmbeddings(
model=EMBEDDING_MODEL,
base_url="http://localhost:11434"
)
vectorstore = Chroma.from_documents(
chunks,
embedding=embeddings,
persist_directory=VECTOR_STORE_PATH
)
print(f"Vector store created with {vectorstore._collection.count()} documents")
return vectorstore
# Create the RAG chain
def create_rag_chain(vectorstore):
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5} # Return top 5 relevant chunks
)
qa = RetrievalQA.from_chain_type(
llm={"model": GENERATION_MODEL, "base_url": "http://localhost:11434/api/chat", "temperature": 0},
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)
return qa
# Build the pipeline
docs = load_documents("/home/casper/dgx-spark-rag/documents")
chunks = split_documents(docs)
vectorstore = create_vector_store(chunks)
rag_chain = create_rag_chain(vectorstore)
# Query it
result = rag_chain.invoke("What does the research say about transformer efficiency?")
print(result['result'])
Step 3: Handling Large Documents
With 128GB of memory, the Spark can handle document sets that would crash smaller machines. Here's a batch ingestion script for large repositories:
import os
from langchain.document_loaders import UnstructuredFileLoader
from langchain.document_loaders import Docx2txtLoader
def ingest_large_collection(root_dir, chunk_size=512):
"""Ingest a large collection of documents with intelligent chunking."""
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma
from langchain.document_loaders import PyPDFLoader
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=50,
separators=["\n\n", "\n", ".", " ", ""]
)
all_chunks = []
metadata = []
for dirpath, _, filenames in os.walk(root_dir):
for filename in filenames:
filepath = os.path.join(dirpath, filename)
if filename.endswith('.pdf'):
loader = PyPDFLoader(filepath)
docs = loader.load()
chunks = splitter.split_documents(docs)
all_chunks.extend(chunks)
metadata.append({
'source': filepath,
'type': 'pdf',
'num_pages': loader._get_num_pages()
})
elif filename.endswith('.docx'):
loader = Docx2txtLoader(filepath)
docs = loader.load()
chunks = splitter.split_documents(docs)
all_chunks.extend(chunks)
metadata.append({'source': filepath, 'type': 'docx'})
# Ingest all at once
embeddings = OllamaEmbeddings(model="nomic-embed-text", base_url="http://localhost:11434")
db = Chroma.from_documents(
all_chunks,
embeddings,
persist_directory=VECTOR_STORE_PATH
)
print(f"Total chunks: {len(all_chunks)}")
print(f"Total size: {len(all_chunks) * 1024 / 1024 / 1024:.1f} GB (at {chunk_size} tokens/chunk)")
return db
# With 128GB RAM, you can easily ingest 100K+ documents
db = ingest_large_collection("/home/casper/documents")
Step 4: Query API
Expose your RAG system as an HTTP API so other apps can query it:
from fastapi import FastAPI
from pydantic import BaseModel
from typing import List, Optional
app = FastAPI(title="DGX Spark RAG API")
class QueryRequest(BaseModel):
question: str
k: int = 5
temperature: float = 0.0
include_sources: bool = True
class QueryResponse(BaseModel):
answer: str
sources: List[str]
query_time_ms: float
@app.post("/query", response_model=QueryResponse)
async def query_rag(req: QueryRequest):
import time
start = time.time()
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": req.k})
qa = RetrievalQA.from_chain_type(
llm={"model": GENERATION_MODEL, "temperature": req.temperature, "base_url": "http://localhost:11434/api/chat"},
retriever=retriever
)
result = qa.invoke(req.question)
sources = [doc.metadata.get('source', 'unknown')
for doc in result.get('source_documents', [])]
elapsed = (time.time() - start) * 1000
return QueryResponse(
answer=result['result'],
sources=sources if req.include_sources else [],
query_time_ms=round(elapsed, 1)
)
# Run with: uvicorn rag_api:app --host 0.0.0.0 --port 8000
# Access at: http://your-spark-ip:8000/docs
Step 5: Optimizing Query Performance
Even on the Spark, RAG latency depends on several factors. Here's how to optimize:
# 1. Use a larger embedding model for better retrieval accuracy
ollama pull mxbai-embed-large # Better quality, ~400MB
# 2. Use metadata filtering for targeted retrieval
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance (diverse results)
search_kwargs={
"k": 8,
"fetch_k": 20,
"filter": {"source_type": "research_paper"}
}
)
# 3. Increase chunk size for dense documents
# (legal docs, technical specs benefit from larger chunks)
splitter = RecursiveCharacterTextSplitter(
chunk_size=2048, # Larger chunks for dense content
chunk_overlap=200
)
# 4. Use the GPU for generation to keep response times low
# The GB10's GPU can generate text in real-time while the embedding model
# handles retrieval on the CPU — true parallelism thanks to unified memory
Step 6: Adding a Web Dashboard
A simple web UI lets you query the system without writing code:
from fastapi import FastAPI, Request
from fastapi.responses import HTMLResponse
from fastapi.staticfiles import StaticFiles
app = FastAPI(title="DGX Spark RAG")
app.mount("/static", StaticFiles(directory="/home/casper/dgx-spark-rag/static"), name="static")
@app.get("/", response_class=HTMLResponse)
async def home():
return open("/home/casper/dgx-spark-rag/static/index.html").read()
Performance Expectations
- Embedding 1 document (100 pages): ~2-3 seconds (CPU, parallel)
- Vector search over 10K docs: ~50-200ms (GPU-accelerated)
- Full RAG query (retrieve + generate): ~3-8 seconds on Qwen3.6:35B
- Indexing 100K documents: ~15-30 minutes (depends on file types)
- Simultaneous queries: GB10 can handle 5-10 concurrent queries without degradation
The Bottom Line
A local RAG server on the DGX Spark is something most people never get to build. It's not just "private" — it's massively overpowered for the task. The same hardware that runs a 35B parameter model can also process embeddings, manage vector stores, and generate responses simultaneously, all in memory, all in real-time.
← Back to DGX Spark Projects