Private AI API on DGX Spark — DGX Spark Projects

Why Build a Private API?

Instead of sending your prompts to OpenAI, Anthropic, or Google, run everything locally on the DGX Spark. Benefits:

Zero latency to cloud APIs — local inference is ~200ms vs 500-2000ms round-trip to cloud
Complete privacy — no data leaves your network
Unlimited usage — no per-token billing, no rate limits
OpenRouter-compatible — any app that talks to OpenRouter or OpenAI can use your Spark
$0.02/tok vs $0.05-$0.75/tok — cost difference is electricity + hardware, no per-request fees

Architecture

┌─────────────┐     ┌──────────────────┐     ┌──────────────────┐
│   Clients    │────▶│  Ollama Server   │────▶│   GB10 GPU       │
│ (any device) │     │ :11434 (HTTP)    │     │  (128GB unified) │
└─────────────┘     └────────┬─────────┘     └──────────────────┘
                              │
                              ▼
                       ┌───────────────┐
                       │   Nginx /     │
                       │   Traefik     │
                       │   (HTTPS +    │
                       │    auth)       │
                       └───────────────┘

Step 1: Install and Configure Ollama

# Install Ollama (if not already)
curl -fsSL https://ollama.com/install.sh | sh

# Pull your models
ollama pull qwen3:35b
ollama pull phi3.5:3.8b    # for multi-model routing

# Configure Ollama for remote access
sudo nano /etc/systemd/system/ollama.service.d/override.conf
# Add:
# [Service]
# Environment="OLLAMA_HOST=0.0.0.0:11434"
# Environment="OLLAMA_ORIGINS=*"
# Environment="OLLAMA_KEEP_ALIVE=-1"  # keep loaded

sudo systemctl daemon-reload
sudo systemctl restart ollama

# Test locally
curl http://localhost:11434/api/generate \
    -d '{"model": "qwen3:35b", "prompt": "Hello", "stream": false}'

# Output:
# {
#   "model": "qwen3:35b",
#   "response": "Hello! How can I help you today?",
#   "done": true,
#   "total_duration": 125000000,
#   "load_duration": 3200000000
# }

Step 2: OpenRouter-Compatible API Endpoint

Ollama already provides an OpenAI-compatible API. Wrap it in a lightweight proxy for OpenRouter compatibility:

# Using Docker for the API gateway
docker run -d \
    --name ai-gateway \
    -p 8443:8443 \
    -e OLLAMA_URL=http://host.docker.internal:11434 \
    ghcr.io/your-repo/ai-gateway:latest

# Or use the built-in Ollama API directly — it's already OpenAI-compatible:

# Chat endpoint (OpenAI format)
curl http://localhost:11434/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen3:35b",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Explain quantum entanglement."}
        ],
        "temperature": 0.7,
        "max_tokens": 512
    }'

# List models
curl http://localhost:11434/v1/models

# This returns the same format as OpenAI's API, so any OpenAI SDK client works:

# Python example:
from openai import OpenAI
client = OpenAI(base_url="http://spark.local:11434/v1", api_key="ollama")
response = client.chat.completions.create(
    model="qwen3:35b",
    messages=[{"role": "user", "content": "What is 2+2?"}],
)
print(response.choices[0].message.content)  # "4"

Step 3: HTTPS with Nginx + Let's Encrypt

For production use, terminate HTTPS at a reverse proxy:

# Install Nginx
sudo apt install nginx certbot python3-certbot-nginx

# Create Nginx config
sudo nano /etc/nginx/sites-available/ai-gateway
# Content:
server {
    listen 80;
    server_name ai.yourdomain.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name ai.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/ai.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/ai.yourdomain.com/privkey.pem;
    ssl_protocols TLSv1.3;

    # Security headers
    add_header X-Frame-Options DENY;
    add_header X-Content-Type-Options nosniff;
    add_header X-XSS-Protection "1; mode=block";

    location / {
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # WebSocket support (for streaming)
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection "upgrade";

        # Timeouts for long inference
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }

    # Rate limiting
    limit_req_zone $binary_remote_addr zone=api:10m rate=30r/s;
    location /v1/chat/completions {
        limit_req zone=api burst=50 nodelay;
        proxy_pass http://127.0.0.1:11434;
        proxy_set_header Host $host;
    }
}

sudo ln -s /etc/nginx/sites-available/ai-gateway /etc/nginx/sites-enabled/
sudo rm /etc/nginx/sites-enabled/default

# Get SSL certificate
sudo certbot --nginx -d ai.yourdomain.com

# Reload
sudo nginx -t && sudo systemctl reload nginx

Step 4: Authentication and Access Control

# Method 1: Simple API key via Nginx
sudo nano /etc/nginx/sites-available/ai-gateway
# Add inside server block:
auth_basic "AI Gateway";
auth_basic_user_file /etc/nginx/.htpasswd;

# Create the password file
sudo apt install apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd spark-admin

# Method 2: Bearer token (more flexible)
# Add to Nginx config:
map $http_authorization $api_key {
    default "";
    "~*Bearer (?.+)" $key;
}
# Then validate $api_key in your proxy logic

# Method 3: OpenRouter-compatible API key rotation
# Generate per-client keys and validate in a middleware
docker run -d \
    --name api-key-validator \
    -p 9090:9090 \
    -e BACKEND_URL=http://localhost:11434 \
    your-api-key-validator

Step 5: Multi-Model Routing

Use a lightweight proxy to route requests to different models based on the task:

# Install FastAPI-based router
pip install fastapi uvicorn httpx

cat > /opt/ai/router.py <<'PYEOF'
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import httpx, json

app = FastAPI()

MODELS = {
    "qwen3:35b": "http://localhost:11434",      # general purpose
    "phi3.5:3.8b": "http://localhost:11434",     # fast classification
    "llama3.3:70b": "http://localhost:11434",    # max quality
}

# Route by prompt length or keyword
def route_model(messages):
    first_msg = messages[0]["content"] if messages else ""
    if len(first_msg) < 50:
        return "phi3.5:3.8b"  # short → fast model
    if any(k in first_msg.lower() for k in ["math", "code", "algorithm"]):
        return "qwen3:35b"    # reasoning → capable model
    return "llama3.3:70b"    # everything else → max quality

@app.post("/v1/chat/completions")
async def chat(request: Request):
    data = await request.json()
    model = route_model(data.get("messages", []))
    data["model"] = model
    # Forward to Ollama
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "http://localhost:11434/v1/chat/completions",
            json=data,
            timeout=120.0
        )
    return resp.json()
PYEOF

# Run the router
uvicorn router:app --host 0.0.0.0 --port 8000 --workers 4

Step 6: Monitoring and Alerting

# Track API usage
apt install sqlite3

# Add logging to Ollama via environment
export OLLAMA_DEBUG=1

# Log API calls
tail -f /var/log/syslog | grep ollama > /opt/ai/api-logs.txt

# Set up Grafana dashboard for API metrics
docker run -d \
    --name grafana \
    -p 3000:3000 \
    -v grafana-data:/var/lib/grafana \
    grafana/grafana:latest

# Access at http://spark.local:3000 (admin/admin)
# Add Prometheus as data source → scrape vLLM metrics on port 9090

💡 Bottom line: Ollama gives you an OpenAI-compatible API out of the box on the Spark. Add Nginx for HTTPS, rate limiting, and authentication. The GB10's 128 GB unified memory means you can host multiple models and route between them. Total cost: the electricity (~$7/month) — no per-token fees, no vendor lock-in, complete privacy.