Benchmark Methodology
This document outlines the comprehensive methodology used to benchmark Apollo RAG’s performance characteristics, ensuring transparent, reproducible, and fair measurements.
Overview
Apollo’s benchmarking methodology is designed to provide transparent, reproducible performance measurements across key RAG system dimensions: latency, throughput, accuracy, and resource utilization. All benchmarks are conducted in a controlled environment with standardized test data and clearly documented hardware specifications.
Benchmark Objectives
- Performance Characterization: Establish baseline metrics for query latency, inference speed, and system throughput
- Optimization Validation: Quantify the impact of architectural optimizations (KV cache preservation, parallel initialization, multi-tier caching)
- Resource Profiling: Measure GPU, CPU, and memory utilization patterns under various load conditions
- Accuracy Baselines: Establish retrieval quality and answer relevance metrics
Test Environment
All benchmarks were conducted in a controlled environment with consistent hardware and software configurations.
Hardware Specifications
| Component | Specification | Details |
|---|---|---|
| GPU | NVIDIA RTX 5080 | 16GB VRAM, Ada Lovelace architecture |
| CPU | Intel Core i9-13900K | 24 cores (8P + 16E), 32 threads |
| RAM | 64GB DDR5-6000 | CL36 latency |
| Storage | Samsung 990 Pro 2TB | NVMe Gen 4, 7450 MB/s read |
| OS | Windows 11 Pro (WSL2) | Kernel 5.15.133.1-microsoft-standard-WSL2 |
| Docker | Docker Desktop 4.25.0 | With NVIDIA Container Toolkit |
GPU Note: The RTX 5080 uses Ada Lovelace’s sm_120 architecture, requiring PyTorch embeddings to run on CPU. LLM inference remains fully GPU-accelerated via llama.cpp with CUDA 12.1.
Software Configuration
| Component | Version | Configuration |
|---|---|---|
| Python | 3.11.7 | Slim Bookworm base image |
| FastAPI | 0.115.0 | Uvicorn ASGI server |
| llama.cpp | 0.3.2 | CUDA 12.1 build, 33 GPU layers |
| PyTorch | 2.5.1+cu121 | CPU mode for embeddings |
| Qdrant | 1.15.0 | HNSW index, cosine distance |
| Redis | 7.2-alpine | 8GB max memory, LRU eviction |
| CUDA Toolkit | 12.1 | With cuBLAS, cuDNN |
Model Configurations
Primary Model (llama.cpp):
- Model: Meta-Llama-3.1-8B-Instruct
- Quantization: Q5_K_M (5-bit mixed quantization)
- Context Window: 8192 tokens
- GPU Layers: 33 (full offload)
- Batch Size: 512
- Temperature: 0.0 (deterministic)
- VRAM Usage: ~5.4GB
Embedding Model (CPU):
- Model: BAAI/bge-large-en-v1.5
- Dimensions: 1024
- Device: CPU (PyTorch sm_120 workaround)
- Batch Size: 64 documents
- Latency: ~50ms per query
Reranker Model (GPU):
- Model: BAAI/bge-reranker-v2-m3
- Device: CUDA
- Batch Size: 32 documents
- Latency: ~60ms for 32 docs (85% faster than LLM reranking)
Data Corpus
Document Collection
The benchmark corpus consists of 100,000 diverse documents spanning multiple domains and complexity levels, designed to simulate real-world RAG workloads.
| Category | Document Count | Average Size | Source |
|---|---|---|---|
| Technical Documentation | 25,000 | 8-15 pages | Software manuals, API docs |
| Research Papers | 20,000 | 12-25 pages | arXiv, academic journals |
| Business Reports | 15,000 | 20-50 pages | Financial statements, quarterly reports |
| Legal Documents | 10,000 | 30-100 pages | Contracts, case law, regulations |
| Knowledge Articles | 30,000 | 2-5 pages | Wikipedia, technical blogs |
Total Corpus Statistics:
- Total Documents: 100,000
- Total Chunks: ~1.2 million (avg 1024 tokens, 128 overlap)
- Index Size: 4.8GB (dense + sparse vectors)
- Embedding Dimension: 1024 (BGE-large)
- Vector Database: Qdrant with HNSW indexing
Query Set
Benchmark Query Collection: 1,000 representative queries across three complexity tiers:
-
Simple Queries (40%): Direct factual questions requiring single-document retrieval
- Example: “What is the capital of France?”
- Expected Latency: 8-15 seconds
-
Moderate Queries (40%): Multi-faceted questions requiring hybrid search
- Example: “Compare Python and JavaScript for web development”
- Expected Latency: 10-18 seconds
-
Complex Queries (20%): Ambiguous or multi-step reasoning requiring advanced retrieval
- Example: “What are the implications of quantum computing on current encryption standards?”
- Expected Latency: 15-25 seconds
Query Diversity: The benchmark query set intentionally includes ambiguous, multi-faceted, and edge-case questions to stress-test retrieval and generation quality, not just measure optimal-path performance.
Metrics Definitions
1. Latency Metrics
End-to-End Query Latency: Total time from API request reception to response completion.
| Metric | Definition | Target | Measurement Method |
|---|---|---|---|
| P50 Latency | Median query time | less than 15s (simple mode) | StageTimer per-query tracking |
| P95 Latency | 95th percentile query time | less than 25s (adaptive mode) | Histogram distribution |
| P99 Latency | 99th percentile query time | less than 30s | Long-tail analysis |
| TTFT | Time to first token | less than 500ms | SSE streaming timestamp |
Stage Breakdown:
Total Latency = Security + Cache Lookup + Retrieval + Embedding +
Vector Search + Reranking + Generation + Confidence ScoringExample breakdown for cache miss (simple mode):
- Security Checks: less than 5ms
- Cache Lookup: 0.86ms (Redis query cache)
- Query Embedding: 50ms (CPU, L2 cache miss)
- Vector Search: 100ms (Qdrant HNSW)
- LLM Generation: 8-12s (80-100 tok/s)
- Confidence Scoring: 500ms (parallel)
- Total: ~9-13s
2. Throughput Metrics
Inference Throughput: Token generation rate during LLM inference.
| Metric | Definition | Target | Measurement |
|---|---|---|---|
| Tokens/Second | LLM generation speed | 80-100 tok/s | llama.cpp performance counter |
| Queries/Minute | Sustained query rate | 4-6 queries/min | Concurrent load testing |
| Cache Hit Rate | Query cache effectiveness | 60-80% | Redis hit/miss ratio |
3. Accuracy Metrics
Retrieval Quality: Relevance of retrieved documents to user queries.
| Metric | Definition | Target | Measurement |
|---|---|---|---|
| Recall@K | Proportion of relevant docs in top-K | more than 0.90 @ K=3 | Manual relevance judgments |
| Precision@K | Proportion of top-K docs that are relevant | more than 0.80 @ K=3 | Human evaluation |
| MRR | Mean Reciprocal Rank of first relevant doc | more than 0.85 | Position tracking |
| Confidence Score | System’s self-assessed answer quality | Calibrated | Weighted signal aggregation |
Confidence Scoring Formula:
confidence = 0.30 * retrieval_quality +
0.40 * answer_relevance +
0.30 * source_consistency4. Resource Utilization
| Resource | Metric | Measurement | Target |
|---|---|---|---|
| VRAM | GPU memory usage | nvidia-smi | less than 8GB @ idle, less than 12GB @ peak |
| RAM | System memory | Docker stats | less than 12GB @ idle, less than 20GB @ peak |
| CPU | Processor utilization | top/htop | less than 80% avg, single core bottleneck |
| Disk I/O | Read/write throughput | iostat | less than 500MB per s (NVMe headroom) |
Testing Approach
Benchmark Phases
Phase 1: Cold Start Testing
- Measure system initialization time (parallel component loading)
- Target: less than 30 seconds from container start to first query ready
- Components: Embeddings, LLM, Vector DB, Cache, Reranker
Phase 2: Warm System Testing
- Execute 100 diverse queries with empty cache
- Measure baseline latency without cache optimization
- Record per-stage timing breakdowns
Phase 3: Cache Effectiveness Testing
- Re-run Phase 2 queries to measure cache hit benefits
- Target: 98% latency reduction on exact matches (less than 1ms vs 50-100ms)
- Validate semantic cache matching (0.95 similarity threshold)
Phase 4: Load Testing
- Ramp up concurrent users (1 → 5 → 10 → 20)
- Measure throughput degradation and latency increase
- Identify bottlenecks (GPU inference, Redis contention)
Phase 5: Optimization Validation
- A/B test individual optimizations:
- KV cache preservation: 40-60% speedup validation
- Parallel initialization: 3.4x startup speedup validation
- BGE reranker: 85% faster than LLM reranking validation
Reproducibility: All benchmark scripts and configurations are available in the benchmarks/ directory. Docker Compose ensures consistent environment setup across runs.
Baseline Measurements
Startup Performance
| Metric | Baseline (Sequential) | Optimized (Parallel) | Improvement |
|---|---|---|---|
| Startup Time | 78 seconds | 23 seconds | 3.4x faster |
| Component Init | Sequential loading | 3-group parallelization | 55s saved |
| First Query Ready | 80+ seconds | 25-30 seconds | 3x faster |
Parallel Initialization Groups:
- Group 1 (parallel): embedding_cache, embeddings, bm25_data, llm → ~12s
- Group 2 (parallel): cache_manager, vectorstore, bm25_retriever, llm_warmup → ~8s
- Group 3 (parallel): metadata, conversation_memory, confidence_scorer → ~3s
Query Latency Baselines
| Mode | Cache Status | Median (P50) | P95 | P99 |
|---|---|---|---|---|
| Simple | Cold (miss) | 12.5s | 15.2s | 18.4s |
| Simple | Warm (hit) | 0.9ms | 1.2ms | 2.1ms |
| Adaptive | Cold (miss) | 18.3s | 24.7s | 29.1s |
| Adaptive | Warm (hit) | 1.1ms | 1.5ms | 2.4ms |
Cache Hit Rate Over Time:
- Initial 100 queries: 0% hit rate (cold cache)
- After 500 queries: 45% hit rate
- Steady state: 60-80% hit rate (depends on query diversity)
Retrieval Accuracy Baselines
| Strategy | Recall@3 | Precision@3 | MRR | Avg Confidence |
|---|---|---|---|---|
| Dense Only | 0.87 | 0.78 | 0.82 | 0.71 |
| Hybrid (Dense + BM25) | 0.92 | 0.85 | 0.88 | 0.78 |
| Advanced (Multi-query + HyDE + Rerank) | 0.94 | 0.89 | 0.91 | 0.83 |
Fairness Considerations
Transparency Commitment: These benchmarks represent real-world performance on a specific hardware configuration. Results will vary based on GPU model, document corpus characteristics, and query complexity distribution.
Hardware Limitations
- RTX 5080 Constraint: PyTorch sm_120 incompatibility forces CPU embeddings (50ms overhead per query)
- Single GPU: No multi-GPU benchmarks; results reflect single-device throughput
- WSL2 Overhead: ~5-10% performance penalty vs native Linux (Docker/WSL2 virtualization)
Benchmark Design Choices
| Choice | Rationale | Impact on Results |
|---|---|---|
| Q5_K_M Quantization | Balances speed (80-100 tok/s) and quality | Faster than Q8, slightly lower accuracy than FP16 |
| 8K Context Window | Standard for Llama 3.1 8B | Shorter context = faster, but limited for long docs |
| 0.0 Temperature | Deterministic generation | Reproducible but less creative answers |
| Top-K=3 Retrieval | Fast, focused retrieval | May miss relevant docs in top-5 or top-10 |
Excluded Scenarios
The following scenarios are not included in baseline benchmarks but may be added in future testing:
- Multi-modal queries (images + text)
- Multi-language retrieval (non-English corpora)
- Streaming vs batch processing (SSE vs synchronous)
- Multi-turn conversations (10+ exchange history)
- Cross-document reasoning (requires more than 5 sources)
Limitations
Known Bottlenecks
- CPU Embeddings: RTX 5080 PyTorch incompatibility adds 50ms per query (embedding generation)
- Single-threaded Inference: llama.cpp requires thread-safe single executor (ThreadPoolExecutor with max_workers=1)
- Redis Single-instance: Distributed caching not tested; production may require Redis Cluster
- No Multi-GPU: Benchmarks reflect single-GPU throughput; model sharding not implemented
Query Complexity Bias
- Simple queries (40% of test set) are over-represented vs real-world distribution
- Complex reasoning queries (20%) may underestimate tail latencies
- Benchmark queries are pre-vetted; production includes typos, ambiguity, adversarial inputs
Document Corpus Limitations
- 100K documents is mid-scale; enterprise deployments may exceed 1M+ documents
- Primarily English-language corpus (no multilingual benchmarking)
- No code repositories or structured data (CSV, JSON) included
Continuous Improvement: Benchmark methodology is versioned (v4.0) and will be expanded with additional test cases, hardware configurations, and accuracy evaluations in future releases.
Conclusion
Apollo’s benchmark methodology prioritizes transparency, reproducibility, and fairness in performance reporting. By documenting hardware specs, test data, metrics definitions, and known limitations, we aim to provide an honest assessment of RAG system capabilities.
Key Takeaways:
- Benchmarks conducted on RTX 5080 with 100K document corpus
- Query latency: 8-15s (simple mode), 10-25s (adaptive mode)
- Cache optimization: 98% latency reduction on hits (less than 1ms)
- Inference speed: 80-100 tokens/sec (Q5_K_M quantization)
- Retrieval accuracy: 0.94 Recall@3 (advanced mode with reranking)
Next Steps
Explore detailed benchmark results and performance analysis:
- View Results - Comprehensive performance data and visualizations
- Configuration Guide - Learn how to tune Apollo for your workload
- Deployment Guide - Production deployment strategies
For questions about methodology or to request additional benchmarks, please open an issue on GitHub.