Production Deployment

Deploy Apollo RAG to production with Docker Compose orchestration, GPU acceleration, and production-grade monitoring.

Production Prerequisites: Ensure you have Docker 24+, Docker Compose 2.20+, NVIDIA Container Toolkit, and at least 48GB RAM + 16GB GPU VRAM.

Deployment Overview

Apollo’s production deployment uses a multi-container architecture orchestrated via Docker Compose:

atlas-backend: FastAPI + RAG pipeline with GPU acceleration
qdrant: Vector database for semantic search
redis: Multi-tier caching layer
prometheus: Metrics collection and alerting
grafana: Visualization dashboards
cadvisor: Container resource monitoring

Architecture Benefits

High Availability:
  - Health checks with auto-restart
  - Graceful degradation on component failure
  - Service dependency management
 
Performance:
  - GPU-accelerated inference (80-100 tok/s)
  - Multi-stage caching (98% latency reduction)
  - Resource isolation and limits
 
Observability:
  - Prometheus metrics collection
  - Grafana dashboards
  - Structured JSON logging
  - Container resource tracking

Docker Compose Setup

Service Architecture

The complete stack is defined in backend/docker-compose.atlas.yml:

# Network Configuration
networks:
  atlas-network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16
 
# Persistent Storage
volumes:
  qdrant_storage:     # Vector database
  redis_data:         # Cache persistence
  prometheus_data:    # Metrics history
  grafana_data:       # Dashboards
 
services:
  # Core RAG Service
  atlas-backend:
    build:
      context: ..
      dockerfile: backend/Dockerfile.atlas
    ports:
      - "8000:8000"
    depends_on:
      - redis
      - qdrant
    deploy:
      resources:
        limits:
          cpus: '14'
          memory: 48G
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

Starting the Stack

# Production deployment
cd backend
docker-compose -f docker-compose.atlas.yml up -d
 
# View logs
docker-compose logs -f atlas-backend
 
# Check service health
docker-compose ps
curl http://localhost:8000/api/health
 
# Stop all services
docker-compose down
 
# Stop with volume cleanup
docker-compose down -v

First Startup: Initial startup takes 20-30 seconds for model loading and component initialization. Subsequent restarts are faster due to cached models.

Multi-Stage Docker Builds

Apollo uses a 5-stage Dockerfile for optimal caching and minimal image size (9GB, 52% reduction from baseline).

Build Stages

# STAGE 1: Base Dependencies
FROM python:3.11-slim-bookworm AS base
# - System packages (gcc-12, CUDA toolkit)
# - NVIDIA CUDA 12.1 repository
# - CUDA stubs for Docker linking
 
# STAGE 2: Python Dependencies
FROM base AS python-deps
# - PyTorch with CUDA 12.1 support
# - llama-cpp-python (built from source with GPU)
# - Application dependencies
 
# STAGE 3: Model Pre-Caching
FROM python-deps AS model-cache
# - BGE-large-en-v1.5 (embeddings)
# - BGE-reranker-large (reranking)
# - Saves 15-20s startup time
 
# STAGE 4: Application Code
FROM model-cache AS app
# - FastAPI application
# - RAG engine and retrievers
# - GGUF models (5.4GB + 771MB)
 
# STAGE 5: Runtime Configuration
FROM app AS runtime
# - Python bytecode compilation
# - Health checks
# - Uvicorn ASGI server

Build Performance

# Build the image
docker build -f backend/Dockerfile.atlas \
  -t atlas-protocol-backend:v4.0 \
  --build-arg BUILDKIT_INLINE_CACHE=1 \
  .
 
# Build metrics:
# First build:  ~10 minutes (downloads models)
# Rebuild:      ~30 seconds (cached layers)
# Image size:   ~9GB (includes pre-cached models)

CUDA Configuration

Critical optimizations for GPU acceleration:

# Install CUDA components for compilation
RUN apt-get install -y \
    cuda-nvcc-12-1 \
    cuda-cudart-dev-12-1 \
    libcublas-dev-12-1
 
# Configure stub linking (build-time only)
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so \
          /usr/local/cuda/lib64/stubs/libcuda.so.1
 
# Build llama-cpp-python with CUDA
ENV CMAKE_ARGS="-DGGML_CUDA=ON \
                -DCMAKE_CUDA_ARCHITECTURES=all-major \
                -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc"
RUN pip install llama-cpp-python==0.3.2 --no-binary llama-cpp-python
 
# WSL2 driver path for runtime
ENV LD_LIBRARY_PATH=/usr/lib/wsl/drivers:$LD_LIBRARY_PATH

CUDA Compatibility: Requires gcc-12 or older. Newer gcc versions (13+) are incompatible with CUDA 12.1. The Dockerfile automatically configures gcc-12 via update-alternatives.

Environment Configuration

Core Configuration

# .env file for docker-compose
# LLM Configuration
LLM_BACKEND=llamacpp
MODEL_PATH=/app/models/llama-3.1-8b-instruct.Q5_K_M.gguf
GPU_LAYERS=33
CONTEXT_LENGTH=8192
TEMPERATURE=0.7
 
# Vector Store
VECTOR_STORE=qdrant
QDRANT_HOST=qdrant
QDRANT_PORT=6333
QDRANT_COLLECTION=atlas_knowledge_base
 
# Cache Configuration
RAG_CACHE__USE_REDIS=true
RAG_CACHE__REDIS_HOST=redis
RAG_CACHE__REDIS_PORT=6379
RAG_CACHE__TTL=3600
 
# Embeddings & Reranking
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
EMBEDDING_DIMENSION=1024
RERANKER_MODEL=BAAI/bge-reranker-large
RERANKER_TOP_K=10
 
# Performance
WORKERS=4
MAX_CONCURRENT_REQUESTS=100
REQUEST_TIMEOUT=120
 
# Monitoring
ENABLE_METRICS=true
LOG_LEVEL=INFO
LOG_FORMAT=json

Security Configuration

# Production Security Settings
CORS_ORIGINS=https://yourdomain.com,https://www.yourdomain.com
RATE_LIMIT_QUERIES=30  # per minute
RATE_LIMIT_GENERAL=100 # per minute
ENABLE_AUTH=true
JWT_SECRET_KEY=<your-secure-secret>
REDIS_PASSWORD=<your-redis-password>

Volume Management

Data Persistence

services:
  atlas-backend:
    volumes:
      # Document storage (read-only)
      - ./documents:/app/documents:ro
 
      # Model files (read-only)
      - ../models:/app/models:ro
 
      # Configuration (read-only)
      - ./config.yml:/app/config.yml:ro
 
      # Application logs (read-write)
      - ./logs:/app/logs
 
      # Temporary files (tmpfs, 4GB)
      - type: tmpfs
        target: /tmp
        tmpfs:
          size: 4G
 
  qdrant:
    volumes:
      # Vector database storage
      - qdrant_storage:/qdrant/storage
 
      # Database snapshots
      - ./data/qdrant:/qdrant/snapshots
 
  redis:
    volumes:
      # Cache persistence
      - redis_data:/data

Backup Strategy

# Backup Qdrant snapshots
docker exec atlas-qdrant \
  curl -X POST http://localhost:6333/collections/atlas_knowledge_base/snapshots
 
# Copy snapshot
docker cp atlas-qdrant:/qdrant/snapshots ./backups/qdrant/
 
# Backup Redis RDB
docker exec atlas-redis redis-cli BGSAVE
docker cp atlas-redis:/data/dump.rdb ./backups/redis/
 
# Restore Qdrant from snapshot
docker cp ./backups/qdrant/snapshot.tar atlas-qdrant:/qdrant/snapshots/
curl -X PUT http://localhost:6333/collections/atlas_knowledge_base/snapshots/upload \
  -F "snapshot=@snapshot.tar"
 
# Restore Redis
docker cp ./backups/redis/dump.rdb atlas-redis:/data/
docker restart atlas-redis

GPU Container Configuration

NVIDIA Runtime Setup

# docker-compose.atlas.yml
services:
  atlas-backend:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
 
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
      - NVIDIA_DRIVER_CAPABILITIES=compute,utility
      - FORCE_TORCH_CPU=1  # Force embeddings to CPU (sm_120 workaround)

GPU Verification

# Check GPU access inside container
docker exec atlas-backend nvidia-smi
 
# Expected output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 550.54       Driver Version: 550.54       CUDA Version: 12.1   |
# |-------------------------------+----------------------+----------------------+
# | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
# |   0  RTX 5080           On   | 00000000:01:00.0  On |                  N/A |

CPU Fallback: If embeddings fail to use GPU (PyTorch sm_120 incompatibility), they automatically fall back to CPU. LLM inference via llama.cpp still uses GPU.

Network Configuration

Service Communication

networks:
  atlas-network:
    driver: bridge
    ipam:
      config:
        - subnet: 172.28.0.0/16
 
# Static IP Assignment
services:
  qdrant:
    networks:
      atlas-network:
        ipv4_address: 172.28.0.2
 
  redis:
    networks:
      atlas-network:
        ipv4_address: 172.28.0.3
 
  atlas-backend:
    networks:
      atlas-network:
        ipv4_address: 172.28.0.10

Port Mapping

services:
  atlas-backend:    8000:8000  # REST API
  qdrant:          6333:6333  # HTTP API
  redis:           6379:6379  # Cache
  prometheus:      9090:9090  # Metrics
  grafana:         3001:3000  # Dashboards
  cadvisor:        8081:8080  # Container metrics

Health Checks & Readiness

Service Health Configuration

# Backend Health Check
atlas-backend:
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 60s
 
# Qdrant Health Check
qdrant:
  healthcheck:
    test: ["CMD-SHELL", "timeout 2 bash -c '</dev/tcp/localhost/6333'"]
    interval: 15s
    timeout: 5s
    retries: 3
    start_period: 30s
 
# Redis Health Check
redis:
  healthcheck:
    test: ["CMD", "redis-cli", "ping"]
    interval: 10s
    timeout: 3s
    retries: 3
    start_period: 10s

Health Check Response

// GET /api/health
{
  "status": "healthy",
  "components": {
    "vectorstore": "ready",
    "llm": "ready",
    "bm25_retriever": "ready",
    "cache": "ready",
    "conversation_memory": "ready"
  },
  "version": "4.0.0",
  "uptime_seconds": 1234
}

Scaling Strategies

Vertical Scaling

# Resource Allocation (RTX 5080 + 96GB RAM)
services:
  atlas-backend:
    deploy:
      resources:
        limits:
          cpus: '14'
          memory: 48G
        reservations:
          cpus: '8'
          memory: 24G
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
 
  qdrant:
    deploy:
      resources:
        limits:
          cpus: '4'
          memory: 16G
        reservations:
          cpus: '2'
          memory: 8G

Horizontal Scaling

GPU Limitation: Each atlas-backend instance requires dedicated GPU access. Scale horizontally by deploying multiple GPU nodes with load balancer.

# nginx.conf - Load Balancer
upstream apollo_backend {
    least_conn;
    server 192.168.1.10:8000 weight=1;  # GPU Node 1
    server 192.168.1.11:8000 weight=1;  # GPU Node 2
    server 192.168.1.12:8000 weight=1;  # GPU Node 3
}
 
server {
    listen 80;
    server_name api.yourdomain.com;
 
    location / {
        proxy_pass http://apollo_backend;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_cache_bypass $http_upgrade;
    }
}

Backup & Recovery

Automated Backup Script

#!/bin/bash
# backup.sh - Daily backup automation
 
BACKUP_DIR=/backups/apollo/$(date +%Y%m%d)
mkdir -p $BACKUP_DIR
 
# Backup Qdrant
echo "Backing up Qdrant..."
docker exec atlas-qdrant \
  curl -X POST http://localhost:6333/collections/atlas_knowledge_base/snapshots
docker cp atlas-qdrant:/qdrant/snapshots/ $BACKUP_DIR/qdrant/
 
# Backup Redis
echo "Backing up Redis..."
docker exec atlas-redis redis-cli BGSAVE
sleep 5
docker cp atlas-redis:/data/dump.rdb $BACKUP_DIR/redis/
 
# Backup application logs
echo "Backing up logs..."
cp -r backend/logs/ $BACKUP_DIR/logs/
 
# Backup configuration
echo "Backing up configuration..."
cp backend/config.yml $BACKUP_DIR/
cp backend/.env $BACKUP_DIR/
 
# Compress and archive
tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR
rm -rf $BACKUP_DIR
 
echo "Backup complete: $BACKUP_DIR.tar.gz"

Disaster Recovery

# 1. Stop all services
docker-compose down
 
# 2. Restore volumes
tar -xzf backup-20250128.tar.gz
docker volume rm atlas-protocol_qdrant_storage
docker volume create atlas-protocol_qdrant_storage
docker cp backup-20250128/qdrant/ atlas-qdrant:/qdrant/storage/
 
# 3. Restart stack
docker-compose up -d
 
# 4. Verify health
curl http://localhost:8000/api/health

Production Checklist

Before deploying to production:

Security

Configure CORS to production domains only
Enable authentication middleware (JWT/OAuth)
Set Redis password in docker-compose
Enable TLS/SSL via reverse proxy
Configure rate limiting with Redis backend
Review and harden security headers
Enable audit logging

Performance

Optimize GPU layers based on VRAM
Configure Redis maxmemory policy
Set appropriate worker count
Enable model pre-caching
Configure query timeout limits
Tune Qdrant HNSW parameters

Monitoring

Reliability

Configure health checks
Set resource limits and reservations
Enable automatic restarts
Configure backup automation
Test disaster recovery procedures
Document runbook procedures

Compliance

Next Steps

Customization →

Customize retrieval strategies, prompts, and model configurations

Troubleshooting →

Diagnose and resolve common deployment issues

Security Architecture Performance Monitoring