Reproduce Benchmarks
This guide provides detailed instructions to reproduce Apollo RAG’s benchmark results independently. Follow these steps to verify the reported performance metrics on your own hardware.
Prerequisites
Hardware Requirements: Benchmarks require NVIDIA GPU with CUDA support. Results will vary based on GPU model. Reference benchmarks use NVIDIA A100 40GB, but any CUDA-capable GPU (RTX 3090, 4090, 5090, A100, H100) will work.
Minimum Requirements
- GPU: NVIDIA GPU with 8GB+ VRAM (CUDA 12.1+)
- CPU: 8+ cores recommended
- RAM: 32GB minimum, 64GB recommended
- Storage: 100GB free space (NVMe SSD preferred)
- OS: Ubuntu 22.04 / Debian Bookworm / Windows 11 with WSL2
- Software: Docker 24+, Docker Compose 2.20+, Git
Recommended Configuration
- GPU: NVIDIA A100 40GB or RTX 5090 24GB
- CPU: AMD EPYC 7763 / Intel Xeon Gold 6348 (16+ cores)
- RAM: 128GB DDR4
- Storage: 500GB NVMe SSD
- Network: 10Gbps for dataset downloads
Step 1: Clone Repository
Clone the Apollo RAG repository with benchmark scripts:
# Clone main repository
git clone https://github.com/yourusername/apollo-rag.git
cd apollo-rag
# Checkout v4.2 production branch
git checkout v4.2-production
# Verify Docker Compose files
ls backend/docker-compose.atlas.ymlStep 2: Prepare Test Data
Download 100K Document Corpus
The benchmark uses a standardized corpus of 100,000 documents (15GB):
# Create data directory
mkdir -p backend/documents/benchmark
# Download benchmark dataset (Wikipedia subset)
wget https://benchmark-data.apollo-rag.io/wiki-100k.tar.gz
# Extract to documents directory
tar -xzf wiki-100k.tar.gz -C backend/documents/benchmark/
# Verify extraction (should show 100,000 files)
ls -1 backend/documents/benchmark/*.txt | wc -lGenerate Synthetic Queries
Create a realistic query workload:
# Run query generator (creates 10,000 test queries)
python scripts/generate_benchmark_queries.py \
--corpus backend/documents/benchmark/ \
--output backend/benchmark_queries.json \
--count 10000 \
--complexity-mix simple:50,medium:30,complex:20Step 3: Configure Environment
Set Hardware-Specific Settings
Create benchmark configuration file:
# Copy template
cp backend/.env.benchmark.template backend/.env.benchmark
# Edit for your GPU
nano backend/.env.benchmarkConfiguration example for RTX 5080:
# Model settings
MODEL_PATH=/models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
GPU_LAYERS=33 # Adjust based on your VRAM
N_BATCH=512 # Increase for more VRAM
N_CTX=8192
# Performance settings
FORCE_TORCH_CPU=1 # Required for RTX 50-series
RAG_CACHE__ENABLED=true
RAG_CACHE__REDIS_HOST=redis
# Vector store
VECTOR_STORE=qdrant
QDRANT_HOST=qdrant
QDRANT_PORT=6333
# Benchmark-specific
BENCHMARK_MODE=true
LOG_LEVEL=INFODownload Required Models
# Create models directory
mkdir -p models
# Download Llama 3.1 8B Q5_K_M (5.4GB)
wget -P models/ \
https://huggingface.co/TheBloke/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
# Verify download
ls -lh models/*.ggufStep 4: Build Docker Images
Build the multi-stage Docker image with CUDA support:
cd backend
# Build with CUDA 12.1 support
docker build -f Dockerfile.atlas -t apollo-backend:benchmark .
# Verify build (should be ~9GB)
docker images apollo-backend:benchmarkExpected build time: 10-15 minutes (first build)
Step 5: Start Docker Compose Stack
Launch all services (Qdrant, Redis, Apollo backend):
# Start services in detached mode
docker-compose -f docker-compose.atlas.yml up -d
# Verify all containers are healthy
docker-compose -f docker-compose.atlas.yml ps
# Check backend logs for initialization
docker-compose -f docker-compose.atlas.yml logs -f atlas-backendWait for initialization message: “RAG Engine initialized in 20-30 seconds”
Verify System Health
# Health check
curl http://localhost:8000/api/health
# Expected response:
# {
# "status": "healthy",
# "components": {
# "vectorstore": "ready",
# "llm": "ready",
# "cache": "ready"
# }
# }Step 6: Index Benchmark Corpus
Process and index all 100K documents:
# Trigger reindexing via API
curl -X POST http://localhost:8000/api/documents/reindex
# Monitor indexing progress (logs)
docker-compose logs -f atlas-backend | grep "Processing"Expected indexing time: 45-90 minutes depending on GPU
Progress indicators:
- Processed: 1000/100000 documents (1%)
- Processed: 10000/100000 documents (10%)
- …
- Indexing complete: 100000 chunks in 87 minutes
Step 7: Run Benchmark Suite
Latency Benchmark
Measure P50, P95, P99 latency across query complexity levels:
# Run latency benchmark (5,000 queries)
python scripts/benchmark_latency.py \
--api-url http://localhost:8000 \
--queries backend/benchmark_queries.json \
--iterations 5000 \
--modes simple,adaptive \
--output results/latency_results.jsonThroughput Benchmark
Test concurrent query handling:
# Run throughput benchmark (50 concurrent users)
python scripts/benchmark_throughput.py \
--api-url http://localhost:8000 \
--queries backend/benchmark_queries.json \
--concurrent-users 50 \
--duration 600 \
--output results/throughput_results.jsonAccuracy Benchmark
Evaluate retrieval quality with ground truth dataset:
# Run accuracy benchmark (1,000 queries with known answers)
python scripts/benchmark_accuracy.py \
--api-url http://localhost:8000 \
--ground-truth backend/ground_truth.json \
--output results/accuracy_results.jsonGPU Utilization Monitor
Track GPU metrics during benchmarks:
# Monitor in separate terminal
nvidia-smi dmon -s pucvmet -d 1 > results/gpu_utilization.logStep 8: Collect Metrics
Export Results
Generate comprehensive benchmark report:
# Consolidate all results
python scripts/generate_report.py \
--latency results/latency_results.json \
--throughput results/throughput_results.json \
--accuracy results/accuracy_results.json \
--gpu-log results/gpu_utilization.log \
--output report.html
# View report
open report.html # macOS
xdg-open report.html # LinuxKey Metrics to Verify
Expected results on NVIDIA A100:
- P95 Latency: 127ms (simple), 250ms (adaptive)
- Throughput: 450 queries/second
- Accuracy: 94.2% context relevance
- GPU Utilization: 88% average
- Cache Hit Rate: 60-80%
Step 9: Compare Results
Baseline Comparison
Compare your results against reference benchmarks:
# Run comparison tool
python scripts/compare_benchmarks.py \
--your-results report.html \
--baseline benchmarks/reference_a100.json \
--output comparison.htmlResult Interpretation
Results will vary based on hardware. RTX 5080 achieves ~80% of A100 performance, RTX 4090 ~65%, RTX 3090 ~50%.
Performance scaling factors:
- A100 40GB: 1.0x (baseline)
- RTX 5090 24GB: 0.9x
- RTX 5080 16GB: 0.8x
- RTX 4090 24GB: 0.65x
- RTX 3090 24GB: 0.50x
Troubleshooting
Issue: Slow Indexing
Symptom: Indexing takes more than 2 hours
Solution:
# Increase batch size for more VRAM
export EMBEDDING_BATCH_SIZE=128
# Check CUDA availability
docker exec apollo-backend python -c "import torch; print(torch.cuda.is_available())"Issue: Out of Memory Errors
Symptom: CUDA OOM during queries
Solution:
# Reduce GPU layers
export GPU_LAYERS=25 # From 33
# Reduce batch size
export N_BATCH=256 # From 512
# Restart stack
docker-compose down
docker-compose up -dIssue: Low GPU Utilization
Symptom: GPU utilization less than 50%
Solution:
# Check if embeddings are on CPU (RTX 50-series workaround)
docker-compose logs atlas-backend | grep "FORCE_TORCH_CPU"
# Verify CUDA version match
nvidia-smi # Should show CUDA 12.1+Issue: Cache Misses
Symptom: Cache hit rate less than 30%
Solution:
# Verify Redis connection
docker exec atlas-backend redis-cli -h redis ping
# Check cache TTL settings
curl http://localhost:8000/api/cache/statsNext Steps
- Analyze Results: Review benchmark methodology to understand metrics
- Optimize Performance: See configuration guide for tuning tips
- Share Results: Submit your benchmark results via GitHub Discussions
Successfully reproduced benchmarks? Share your results with the community! We track performance across different GPU configurations.
Additional Resources
- Docker Compose Reference:
backend/docker-compose.atlas.yml - Benchmark Scripts:
scripts/benchmark_*.py - Configuration Guide:
/guides/configuration - Performance Tuning:
/guides/performance-optimization