BenchmarksReproduce Benchmarks

Reproduce Benchmarks

This guide provides detailed instructions to reproduce Apollo RAG’s benchmark results independently. Follow these steps to verify the reported performance metrics on your own hardware.

Prerequisites

Hardware Requirements: Benchmarks require NVIDIA GPU with CUDA support. Results will vary based on GPU model. Reference benchmarks use NVIDIA A100 40GB, but any CUDA-capable GPU (RTX 3090, 4090, 5090, A100, H100) will work.

Minimum Requirements

  • GPU: NVIDIA GPU with 8GB+ VRAM (CUDA 12.1+)
  • CPU: 8+ cores recommended
  • RAM: 32GB minimum, 64GB recommended
  • Storage: 100GB free space (NVMe SSD preferred)
  • OS: Ubuntu 22.04 / Debian Bookworm / Windows 11 with WSL2
  • Software: Docker 24+, Docker Compose 2.20+, Git
  • GPU: NVIDIA A100 40GB or RTX 5090 24GB
  • CPU: AMD EPYC 7763 / Intel Xeon Gold 6348 (16+ cores)
  • RAM: 128GB DDR4
  • Storage: 500GB NVMe SSD
  • Network: 10Gbps for dataset downloads

Step 1: Clone Repository

Clone the Apollo RAG repository with benchmark scripts:

# Clone main repository
git clone https://github.com/yourusername/apollo-rag.git
cd apollo-rag
 
# Checkout v4.2 production branch
git checkout v4.2-production
 
# Verify Docker Compose files
ls backend/docker-compose.atlas.yml

Step 2: Prepare Test Data

Download 100K Document Corpus

The benchmark uses a standardized corpus of 100,000 documents (15GB):

# Create data directory
mkdir -p backend/documents/benchmark
 
# Download benchmark dataset (Wikipedia subset)
wget https://benchmark-data.apollo-rag.io/wiki-100k.tar.gz
 
# Extract to documents directory
tar -xzf wiki-100k.tar.gz -C backend/documents/benchmark/
 
# Verify extraction (should show 100,000 files)
ls -1 backend/documents/benchmark/*.txt | wc -l

Generate Synthetic Queries

Create a realistic query workload:

# Run query generator (creates 10,000 test queries)
python scripts/generate_benchmark_queries.py \
  --corpus backend/documents/benchmark/ \
  --output backend/benchmark_queries.json \
  --count 10000 \
  --complexity-mix simple:50,medium:30,complex:20

Step 3: Configure Environment

Set Hardware-Specific Settings

Create benchmark configuration file:

# Copy template
cp backend/.env.benchmark.template backend/.env.benchmark
 
# Edit for your GPU
nano backend/.env.benchmark

Configuration example for RTX 5080:

# Model settings
MODEL_PATH=/models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
GPU_LAYERS=33  # Adjust based on your VRAM
N_BATCH=512    # Increase for more VRAM
N_CTX=8192

# Performance settings
FORCE_TORCH_CPU=1  # Required for RTX 50-series
RAG_CACHE__ENABLED=true
RAG_CACHE__REDIS_HOST=redis

# Vector store
VECTOR_STORE=qdrant
QDRANT_HOST=qdrant
QDRANT_PORT=6333

# Benchmark-specific
BENCHMARK_MODE=true
LOG_LEVEL=INFO

Download Required Models

# Create models directory
mkdir -p models
 
# Download Llama 3.1 8B Q5_K_M (5.4GB)
wget -P models/ \
  https://huggingface.co/TheBloke/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
 
# Verify download
ls -lh models/*.gguf

Step 4: Build Docker Images

Build the multi-stage Docker image with CUDA support:

cd backend
 
# Build with CUDA 12.1 support
docker build -f Dockerfile.atlas -t apollo-backend:benchmark .
 
# Verify build (should be ~9GB)
docker images apollo-backend:benchmark

Expected build time: 10-15 minutes (first build)

Step 5: Start Docker Compose Stack

Launch all services (Qdrant, Redis, Apollo backend):

# Start services in detached mode
docker-compose -f docker-compose.atlas.yml up -d
 
# Verify all containers are healthy
docker-compose -f docker-compose.atlas.yml ps
 
# Check backend logs for initialization
docker-compose -f docker-compose.atlas.yml logs -f atlas-backend

Wait for initialization message: “RAG Engine initialized in 20-30 seconds”

Verify System Health

# Health check
curl http://localhost:8000/api/health
 
# Expected response:
# {
#   "status": "healthy",
#   "components": {
#     "vectorstore": "ready",
#     "llm": "ready",
#     "cache": "ready"
#   }
# }

Step 6: Index Benchmark Corpus

Process and index all 100K documents:

# Trigger reindexing via API
curl -X POST http://localhost:8000/api/documents/reindex
 
# Monitor indexing progress (logs)
docker-compose logs -f atlas-backend | grep "Processing"

Expected indexing time: 45-90 minutes depending on GPU

Progress indicators:

  • Processed: 1000/100000 documents (1%)
  • Processed: 10000/100000 documents (10%)
  • Indexing complete: 100000 chunks in 87 minutes

Step 7: Run Benchmark Suite

Latency Benchmark

Measure P50, P95, P99 latency across query complexity levels:

# Run latency benchmark (5,000 queries)
python scripts/benchmark_latency.py \
  --api-url http://localhost:8000 \
  --queries backend/benchmark_queries.json \
  --iterations 5000 \
  --modes simple,adaptive \
  --output results/latency_results.json

Throughput Benchmark

Test concurrent query handling:

# Run throughput benchmark (50 concurrent users)
python scripts/benchmark_throughput.py \
  --api-url http://localhost:8000 \
  --queries backend/benchmark_queries.json \
  --concurrent-users 50 \
  --duration 600 \
  --output results/throughput_results.json

Accuracy Benchmark

Evaluate retrieval quality with ground truth dataset:

# Run accuracy benchmark (1,000 queries with known answers)
python scripts/benchmark_accuracy.py \
  --api-url http://localhost:8000 \
  --ground-truth backend/ground_truth.json \
  --output results/accuracy_results.json

GPU Utilization Monitor

Track GPU metrics during benchmarks:

# Monitor in separate terminal
nvidia-smi dmon -s pucvmet -d 1 > results/gpu_utilization.log

Step 8: Collect Metrics

Export Results

Generate comprehensive benchmark report:

# Consolidate all results
python scripts/generate_report.py \
  --latency results/latency_results.json \
  --throughput results/throughput_results.json \
  --accuracy results/accuracy_results.json \
  --gpu-log results/gpu_utilization.log \
  --output report.html
 
# View report
open report.html  # macOS
xdg-open report.html  # Linux

Key Metrics to Verify

Expected results on NVIDIA A100:

  • P95 Latency: 127ms (simple), 250ms (adaptive)
  • Throughput: 450 queries/second
  • Accuracy: 94.2% context relevance
  • GPU Utilization: 88% average
  • Cache Hit Rate: 60-80%

Step 9: Compare Results

Baseline Comparison

Compare your results against reference benchmarks:

# Run comparison tool
python scripts/compare_benchmarks.py \
  --your-results report.html \
  --baseline benchmarks/reference_a100.json \
  --output comparison.html

Result Interpretation

Results will vary based on hardware. RTX 5080 achieves ~80% of A100 performance, RTX 4090 ~65%, RTX 3090 ~50%.

Performance scaling factors:

  • A100 40GB: 1.0x (baseline)
  • RTX 5090 24GB: 0.9x
  • RTX 5080 16GB: 0.8x
  • RTX 4090 24GB: 0.65x
  • RTX 3090 24GB: 0.50x

Troubleshooting

Issue: Slow Indexing

Symptom: Indexing takes more than 2 hours

Solution:

# Increase batch size for more VRAM
export EMBEDDING_BATCH_SIZE=128
 
# Check CUDA availability
docker exec apollo-backend python -c "import torch; print(torch.cuda.is_available())"

Issue: Out of Memory Errors

Symptom: CUDA OOM during queries

Solution:

# Reduce GPU layers
export GPU_LAYERS=25  # From 33
 
# Reduce batch size
export N_BATCH=256    # From 512
 
# Restart stack
docker-compose down
docker-compose up -d

Issue: Low GPU Utilization

Symptom: GPU utilization less than 50%

Solution:

# Check if embeddings are on CPU (RTX 50-series workaround)
docker-compose logs atlas-backend | grep "FORCE_TORCH_CPU"
 
# Verify CUDA version match
nvidia-smi  # Should show CUDA 12.1+

Issue: Cache Misses

Symptom: Cache hit rate less than 30%

Solution:

# Verify Redis connection
docker exec atlas-backend redis-cli -h redis ping
 
# Check cache TTL settings
curl http://localhost:8000/api/cache/stats

Next Steps

Successfully reproduced benchmarks? Share your results with the community! We track performance across different GPU configurations.

Additional Resources

  • Docker Compose Reference: backend/docker-compose.atlas.yml
  • Benchmark Scripts: scripts/benchmark_*.py
  • Configuration Guide: /guides/configuration
  • Performance Tuning: /guides/performance-optimization