Production Deployment
Deploy Apollo RAG to production with Docker Compose orchestration, GPU acceleration, and production-grade monitoring.
Production Prerequisites: Ensure you have Docker 24+, Docker Compose 2.20+, NVIDIA Container Toolkit, and at least 48GB RAM + 16GB GPU VRAM.
Deployment Overview
Apollo’s production deployment uses a multi-container architecture orchestrated via Docker Compose:
- atlas-backend: FastAPI + RAG pipeline with GPU acceleration
- qdrant: Vector database for semantic search
- redis: Multi-tier caching layer
- prometheus: Metrics collection and alerting
- grafana: Visualization dashboards
- cadvisor: Container resource monitoring
Architecture Benefits
High Availability:
- Health checks with auto-restart
- Graceful degradation on component failure
- Service dependency management
Performance:
- GPU-accelerated inference (80-100 tok/s)
- Multi-stage caching (98% latency reduction)
- Resource isolation and limits
Observability:
- Prometheus metrics collection
- Grafana dashboards
- Structured JSON logging
- Container resource trackingDocker Compose Setup
Service Architecture
The complete stack is defined in backend/docker-compose.atlas.yml:
# Network Configuration
networks:
atlas-network:
driver: bridge
ipam:
config:
- subnet: 172.28.0.0/16
# Persistent Storage
volumes:
qdrant_storage: # Vector database
redis_data: # Cache persistence
prometheus_data: # Metrics history
grafana_data: # Dashboards
services:
# Core RAG Service
atlas-backend:
build:
context: ..
dockerfile: backend/Dockerfile.atlas
ports:
- "8000:8000"
depends_on:
- redis
- qdrant
deploy:
resources:
limits:
cpus: '14'
memory: 48G
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]Starting the Stack
# Production deployment
cd backend
docker-compose -f docker-compose.atlas.yml up -d
# View logs
docker-compose logs -f atlas-backend
# Check service health
docker-compose ps
curl http://localhost:8000/api/health
# Stop all services
docker-compose down
# Stop with volume cleanup
docker-compose down -vFirst Startup: Initial startup takes 20-30 seconds for model loading and component initialization. Subsequent restarts are faster due to cached models.
Multi-Stage Docker Builds
Apollo uses a 5-stage Dockerfile for optimal caching and minimal image size (9GB, 52% reduction from baseline).
Build Stages
# STAGE 1: Base Dependencies
FROM python:3.11-slim-bookworm AS base
# - System packages (gcc-12, CUDA toolkit)
# - NVIDIA CUDA 12.1 repository
# - CUDA stubs for Docker linking
# STAGE 2: Python Dependencies
FROM base AS python-deps
# - PyTorch with CUDA 12.1 support
# - llama-cpp-python (built from source with GPU)
# - Application dependencies
# STAGE 3: Model Pre-Caching
FROM python-deps AS model-cache
# - BGE-large-en-v1.5 (embeddings)
# - BGE-reranker-large (reranking)
# - Saves 15-20s startup time
# STAGE 4: Application Code
FROM model-cache AS app
# - FastAPI application
# - RAG engine and retrievers
# - GGUF models (5.4GB + 771MB)
# STAGE 5: Runtime Configuration
FROM app AS runtime
# - Python bytecode compilation
# - Health checks
# - Uvicorn ASGI serverBuild Performance
# Build the image
docker build -f backend/Dockerfile.atlas \
-t atlas-protocol-backend:v4.0 \
--build-arg BUILDKIT_INLINE_CACHE=1 \
.
# Build metrics:
# First build: ~10 minutes (downloads models)
# Rebuild: ~30 seconds (cached layers)
# Image size: ~9GB (includes pre-cached models)CUDA Configuration
Critical optimizations for GPU acceleration:
# Install CUDA components for compilation
RUN apt-get install -y \
cuda-nvcc-12-1 \
cuda-cudart-dev-12-1 \
libcublas-dev-12-1
# Configure stub linking (build-time only)
RUN ln -s /usr/local/cuda/lib64/stubs/libcuda.so \
/usr/local/cuda/lib64/stubs/libcuda.so.1
# Build llama-cpp-python with CUDA
ENV CMAKE_ARGS="-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=all-major \
-DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc"
RUN pip install llama-cpp-python==0.3.2 --no-binary llama-cpp-python
# WSL2 driver path for runtime
ENV LD_LIBRARY_PATH=/usr/lib/wsl/drivers:$LD_LIBRARY_PATHCUDA Compatibility: Requires gcc-12 or older. Newer gcc versions (13+) are incompatible with CUDA 12.1. The Dockerfile automatically configures gcc-12 via update-alternatives.
Environment Configuration
Core Configuration
# .env file for docker-compose
# LLM Configuration
LLM_BACKEND=llamacpp
MODEL_PATH=/app/models/llama-3.1-8b-instruct.Q5_K_M.gguf
GPU_LAYERS=33
CONTEXT_LENGTH=8192
TEMPERATURE=0.7
# Vector Store
VECTOR_STORE=qdrant
QDRANT_HOST=qdrant
QDRANT_PORT=6333
QDRANT_COLLECTION=atlas_knowledge_base
# Cache Configuration
RAG_CACHE__USE_REDIS=true
RAG_CACHE__REDIS_HOST=redis
RAG_CACHE__REDIS_PORT=6379
RAG_CACHE__TTL=3600
# Embeddings & Reranking
EMBEDDING_MODEL=BAAI/bge-large-en-v1.5
EMBEDDING_DIMENSION=1024
RERANKER_MODEL=BAAI/bge-reranker-large
RERANKER_TOP_K=10
# Performance
WORKERS=4
MAX_CONCURRENT_REQUESTS=100
REQUEST_TIMEOUT=120
# Monitoring
ENABLE_METRICS=true
LOG_LEVEL=INFO
LOG_FORMAT=jsonSecurity Configuration
# Production Security Settings
CORS_ORIGINS=https://yourdomain.com,https://www.yourdomain.com
RATE_LIMIT_QUERIES=30 # per minute
RATE_LIMIT_GENERAL=100 # per minute
ENABLE_AUTH=true
JWT_SECRET_KEY=<your-secure-secret>
REDIS_PASSWORD=<your-redis-password>Volume Management
Data Persistence
services:
atlas-backend:
volumes:
# Document storage (read-only)
- ./documents:/app/documents:ro
# Model files (read-only)
- ../models:/app/models:ro
# Configuration (read-only)
- ./config.yml:/app/config.yml:ro
# Application logs (read-write)
- ./logs:/app/logs
# Temporary files (tmpfs, 4GB)
- type: tmpfs
target: /tmp
tmpfs:
size: 4G
qdrant:
volumes:
# Vector database storage
- qdrant_storage:/qdrant/storage
# Database snapshots
- ./data/qdrant:/qdrant/snapshots
redis:
volumes:
# Cache persistence
- redis_data:/dataBackup Strategy
# Backup Qdrant snapshots
docker exec atlas-qdrant \
curl -X POST http://localhost:6333/collections/atlas_knowledge_base/snapshots
# Copy snapshot
docker cp atlas-qdrant:/qdrant/snapshots ./backups/qdrant/
# Backup Redis RDB
docker exec atlas-redis redis-cli BGSAVE
docker cp atlas-redis:/data/dump.rdb ./backups/redis/
# Restore Qdrant from snapshot
docker cp ./backups/qdrant/snapshot.tar atlas-qdrant:/qdrant/snapshots/
curl -X PUT http://localhost:6333/collections/atlas_knowledge_base/snapshots/upload \
-F "snapshot=@snapshot.tar"
# Restore Redis
docker cp ./backups/redis/dump.rdb atlas-redis:/data/
docker restart atlas-redisGPU Container Configuration
NVIDIA Runtime Setup
# docker-compose.atlas.yml
services:
atlas-backend:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
environment:
- NVIDIA_VISIBLE_DEVICES=0
- NVIDIA_DRIVER_CAPABILITIES=compute,utility
- FORCE_TORCH_CPU=1 # Force embeddings to CPU (sm_120 workaround)GPU Verification
# Check GPU access inside container
docker exec atlas-backend nvidia-smi
# Expected output:
# +-----------------------------------------------------------------------------+
# | NVIDIA-SMI 550.54 Driver Version: 550.54 CUDA Version: 12.1 |
# |-------------------------------+----------------------+----------------------+
# | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
# | 0 RTX 5080 On | 00000000:01:00.0 On | N/A |CPU Fallback: If embeddings fail to use GPU (PyTorch sm_120 incompatibility), they automatically fall back to CPU. LLM inference via llama.cpp still uses GPU.
Network Configuration
Service Communication
networks:
atlas-network:
driver: bridge
ipam:
config:
- subnet: 172.28.0.0/16
# Static IP Assignment
services:
qdrant:
networks:
atlas-network:
ipv4_address: 172.28.0.2
redis:
networks:
atlas-network:
ipv4_address: 172.28.0.3
atlas-backend:
networks:
atlas-network:
ipv4_address: 172.28.0.10Port Mapping
services:
atlas-backend: 8000:8000 # REST API
qdrant: 6333:6333 # HTTP API
redis: 6379:6379 # Cache
prometheus: 9090:9090 # Metrics
grafana: 3001:3000 # Dashboards
cadvisor: 8081:8080 # Container metricsHealth Checks & Readiness
Service Health Configuration
# Backend Health Check
atlas-backend:
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/api/health"]
interval: 15s
timeout: 5s
retries: 3
start_period: 60s
# Qdrant Health Check
qdrant:
healthcheck:
test: ["CMD-SHELL", "timeout 2 bash -c '</dev/tcp/localhost/6333'"]
interval: 15s
timeout: 5s
retries: 3
start_period: 30s
# Redis Health Check
redis:
healthcheck:
test: ["CMD", "redis-cli", "ping"]
interval: 10s
timeout: 3s
retries: 3
start_period: 10sHealth Check Response
// GET /api/health
{
"status": "healthy",
"components": {
"vectorstore": "ready",
"llm": "ready",
"bm25_retriever": "ready",
"cache": "ready",
"conversation_memory": "ready"
},
"version": "4.0.0",
"uptime_seconds": 1234
}Scaling Strategies
Vertical Scaling
# Resource Allocation (RTX 5080 + 96GB RAM)
services:
atlas-backend:
deploy:
resources:
limits:
cpus: '14'
memory: 48G
reservations:
cpus: '8'
memory: 24G
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
qdrant:
deploy:
resources:
limits:
cpus: '4'
memory: 16G
reservations:
cpus: '2'
memory: 8GHorizontal Scaling
GPU Limitation: Each atlas-backend instance requires dedicated GPU access. Scale horizontally by deploying multiple GPU nodes with load balancer.
# nginx.conf - Load Balancer
upstream apollo_backend {
least_conn;
server 192.168.1.10:8000 weight=1; # GPU Node 1
server 192.168.1.11:8000 weight=1; # GPU Node 2
server 192.168.1.12:8000 weight=1; # GPU Node 3
}
server {
listen 80;
server_name api.yourdomain.com;
location / {
proxy_pass http://apollo_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_cache_bypass $http_upgrade;
}
}Backup & Recovery
Automated Backup Script
#!/bin/bash
# backup.sh - Daily backup automation
BACKUP_DIR=/backups/apollo/$(date +%Y%m%d)
mkdir -p $BACKUP_DIR
# Backup Qdrant
echo "Backing up Qdrant..."
docker exec atlas-qdrant \
curl -X POST http://localhost:6333/collections/atlas_knowledge_base/snapshots
docker cp atlas-qdrant:/qdrant/snapshots/ $BACKUP_DIR/qdrant/
# Backup Redis
echo "Backing up Redis..."
docker exec atlas-redis redis-cli BGSAVE
sleep 5
docker cp atlas-redis:/data/dump.rdb $BACKUP_DIR/redis/
# Backup application logs
echo "Backing up logs..."
cp -r backend/logs/ $BACKUP_DIR/logs/
# Backup configuration
echo "Backing up configuration..."
cp backend/config.yml $BACKUP_DIR/
cp backend/.env $BACKUP_DIR/
# Compress and archive
tar -czf $BACKUP_DIR.tar.gz $BACKUP_DIR
rm -rf $BACKUP_DIR
echo "Backup complete: $BACKUP_DIR.tar.gz"Disaster Recovery
# 1. Stop all services
docker-compose down
# 2. Restore volumes
tar -xzf backup-20250128.tar.gz
docker volume rm atlas-protocol_qdrant_storage
docker volume create atlas-protocol_qdrant_storage
docker cp backup-20250128/qdrant/ atlas-qdrant:/qdrant/storage/
# 3. Restart stack
docker-compose up -d
# 4. Verify health
curl http://localhost:8000/api/healthProduction Checklist
Before deploying to production:
Security
- Configure CORS to production domains only
- Enable authentication middleware (JWT/OAuth)
- Set Redis password in docker-compose
- Enable TLS/SSL via reverse proxy
- Configure rate limiting with Redis backend
- Review and harden security headers
- Enable audit logging
Performance
- Optimize GPU layers based on VRAM
- Configure Redis maxmemory policy
- Set appropriate worker count
- Enable model pre-caching
- Configure query timeout limits
- Tune Qdrant HNSW parameters
Monitoring
- Configure Prometheus scraping
- Set up Grafana dashboards
- Configure alerting rules
- Enable structured logging
- Set up log aggregation (ELK/Loki)
- Monitor GPU utilization
Reliability
- Configure health checks
- Set resource limits and reservations
- Enable automatic restarts
- Configure backup automation
- Test disaster recovery procedures
- Document runbook procedures
Compliance
- Review data retention policies
- Configure log rotation
- Enable encryption at rest
- Document security controls
- Conduct security audit