Benchmark Methodology

This document outlines the comprehensive methodology used to benchmark Apollo RAG’s performance characteristics, ensuring transparent, reproducible, and fair measurements.

Overview

Apollo’s benchmarking methodology is designed to provide transparent, reproducible performance measurements across key RAG system dimensions: latency, throughput, accuracy, and resource utilization. All benchmarks are conducted in a controlled environment with standardized test data and clearly documented hardware specifications.

Benchmark Objectives

Performance Characterization: Establish baseline metrics for query latency, inference speed, and system throughput
Optimization Validation: Quantify the impact of architectural optimizations (KV cache preservation, parallel initialization, multi-tier caching)
Resource Profiling: Measure GPU, CPU, and memory utilization patterns under various load conditions
Accuracy Baselines: Establish retrieval quality and answer relevance metrics

Test Environment

All benchmarks were conducted in a controlled environment with consistent hardware and software configurations.

Hardware Specifications

Component	Specification	Details
GPU	NVIDIA RTX 5080	16GB VRAM, Ada Lovelace architecture
CPU	Intel Core i9-13900K	24 cores (8P + 16E), 32 threads
RAM	64GB DDR5-6000	CL36 latency
Storage	Samsung 990 Pro 2TB	NVMe Gen 4, 7450 MB/s read
OS	Windows 11 Pro (WSL2)	Kernel 5.15.133.1-microsoft-standard-WSL2
Docker	Docker Desktop 4.25.0	With NVIDIA Container Toolkit

GPU Note: The RTX 5080 uses Ada Lovelace’s sm_120 architecture, requiring PyTorch embeddings to run on CPU. LLM inference remains fully GPU-accelerated via llama.cpp with CUDA 12.1.

Software Configuration

Component	Version	Configuration
Python	3.11.7	Slim Bookworm base image
FastAPI	0.115.0	Uvicorn ASGI server
llama.cpp	0.3.2	CUDA 12.1 build, 33 GPU layers
PyTorch	2.5.1+cu121	CPU mode for embeddings
Qdrant	1.15.0	HNSW index, cosine distance
Redis	7.2-alpine	8GB max memory, LRU eviction
CUDA Toolkit	12.1	With cuBLAS, cuDNN

Model Configurations

Primary Model (llama.cpp):

Model: Meta-Llama-3.1-8B-Instruct
Quantization: Q5_K_M (5-bit mixed quantization)
Context Window: 8192 tokens
GPU Layers: 33 (full offload)
Batch Size: 512
Temperature: 0.0 (deterministic)
VRAM Usage: ~5.4GB

Embedding Model (CPU):

Model: BAAI/bge-large-en-v1.5
Dimensions: 1024
Device: CPU (PyTorch sm_120 workaround)
Batch Size: 64 documents
Latency: ~50ms per query

Reranker Model (GPU):

Model: BAAI/bge-reranker-v2-m3
Device: CUDA
Batch Size: 32 documents
Latency: ~60ms for 32 docs (85% faster than LLM reranking)

Data Corpus

Document Collection

The benchmark corpus consists of 100,000 diverse documents spanning multiple domains and complexity levels, designed to simulate real-world RAG workloads.

Category	Document Count	Average Size	Source
Technical Documentation	25,000	8-15 pages	Software manuals, API docs
Research Papers	20,000	12-25 pages	arXiv, academic journals
Business Reports	15,000	20-50 pages	Financial statements, quarterly reports
Legal Documents	10,000	30-100 pages	Contracts, case law, regulations
Knowledge Articles	30,000	2-5 pages	Wikipedia, technical blogs

Total Corpus Statistics:

Total Documents: 100,000
Total Chunks: ~1.2 million (avg 1024 tokens, 128 overlap)
Index Size: 4.8GB (dense + sparse vectors)
Embedding Dimension: 1024 (BGE-large)
Vector Database: Qdrant with HNSW indexing

Query Set

Benchmark Query Collection: 1,000 representative queries across three complexity tiers:

Simple Queries (40%): Direct factual questions requiring single-document retrieval
- Example: “What is the capital of France?”
- Expected Latency: 8-15 seconds
Moderate Queries (40%): Multi-faceted questions requiring hybrid search
- Example: “Compare Python and JavaScript for web development”
- Expected Latency: 10-18 seconds
Complex Queries (20%): Ambiguous or multi-step reasoning requiring advanced retrieval
- Example: “What are the implications of quantum computing on current encryption standards?”
- Expected Latency: 15-25 seconds

Query Diversity: The benchmark query set intentionally includes ambiguous, multi-faceted, and edge-case questions to stress-test retrieval and generation quality, not just measure optimal-path performance.

Metrics Definitions

1. Latency Metrics

End-to-End Query Latency: Total time from API request reception to response completion.

Metric	Definition	Target	Measurement Method
P50 Latency	Median query time	less than 15s (simple mode)	StageTimer per-query tracking
P95 Latency	95th percentile query time	less than 25s (adaptive mode)	Histogram distribution
P99 Latency	99th percentile query time	less than 30s	Long-tail analysis
TTFT	Time to first token	less than 500ms	SSE streaming timestamp

Stage Breakdown:

Total Latency = Security + Cache Lookup + Retrieval + Embedding +
                Vector Search + Reranking + Generation + Confidence Scoring

Example breakdown for cache miss (simple mode):

Security Checks: less than 5ms
Cache Lookup: 0.86ms (Redis query cache)
Query Embedding: 50ms (CPU, L2 cache miss)
Vector Search: 100ms (Qdrant HNSW)
LLM Generation: 8-12s (80-100 tok/s)
Confidence Scoring: 500ms (parallel)
Total: ~9-13s

2. Throughput Metrics

Inference Throughput: Token generation rate during LLM inference.

Metric	Definition	Target	Measurement
Tokens/Second	LLM generation speed	80-100 tok/s	llama.cpp performance counter
Queries/Minute	Sustained query rate	4-6 queries/min	Concurrent load testing
Cache Hit Rate	Query cache effectiveness	60-80%	Redis hit/miss ratio

3. Accuracy Metrics

Retrieval Quality: Relevance of retrieved documents to user queries.

Metric	Definition	Target	Measurement
Recall@K	Proportion of relevant docs in top-K	more than 0.90 @ K=3	Manual relevance judgments
Precision@K	Proportion of top-K docs that are relevant	more than 0.80 @ K=3	Human evaluation
MRR	Mean Reciprocal Rank of first relevant doc	more than 0.85	Position tracking
Confidence Score	System’s self-assessed answer quality	Calibrated	Weighted signal aggregation

Confidence Scoring Formula:

confidence = 0.30 * retrieval_quality +
             0.40 * answer_relevance +
             0.30 * source_consistency

4. Resource Utilization

Resource	Metric	Measurement	Target
VRAM	GPU memory usage	nvidia-smi	less than 8GB @ idle, less than 12GB @ peak
RAM	System memory	Docker stats	less than 12GB @ idle, less than 20GB @ peak
CPU	Processor utilization	top/htop	less than 80% avg, single core bottleneck
Disk I/O	Read/write throughput	iostat	less than 500MB per s (NVMe headroom)

Testing Approach

Benchmark Phases

Phase 1: Cold Start Testing

Measure system initialization time (parallel component loading)
Target: less than 30 seconds from container start to first query ready
Components: Embeddings, LLM, Vector DB, Cache, Reranker

Phase 2: Warm System Testing

Execute 100 diverse queries with empty cache
Measure baseline latency without cache optimization
Record per-stage timing breakdowns

Phase 3: Cache Effectiveness Testing

Re-run Phase 2 queries to measure cache hit benefits
Target: 98% latency reduction on exact matches (less than 1ms vs 50-100ms)
Validate semantic cache matching (0.95 similarity threshold)

Phase 4: Load Testing

Ramp up concurrent users (1 → 5 → 10 → 20)
Measure throughput degradation and latency increase
Identify bottlenecks (GPU inference, Redis contention)

Phase 5: Optimization Validation

A/B test individual optimizations:
- KV cache preservation: 40-60% speedup validation
- Parallel initialization: 3.4x startup speedup validation
- BGE reranker: 85% faster than LLM reranking validation

Reproducibility: All benchmark scripts and configurations are available in the benchmarks/ directory. Docker Compose ensures consistent environment setup across runs.

Baseline Measurements

Startup Performance

Metric	Baseline (Sequential)	Optimized (Parallel)	Improvement
Startup Time	78 seconds	23 seconds	3.4x faster
Component Init	Sequential loading	3-group parallelization	55s saved
First Query Ready	80+ seconds	25-30 seconds	3x faster

Parallel Initialization Groups:

Group 1 (parallel): embedding_cache, embeddings, bm25_data, llm → ~12s
Group 2 (parallel): cache_manager, vectorstore, bm25_retriever, llm_warmup → ~8s
Group 3 (parallel): metadata, conversation_memory, confidence_scorer → ~3s

Query Latency Baselines

Mode	Cache Status	Median (P50)	P95	P99
Simple	Cold (miss)	12.5s	15.2s	18.4s
Simple	Warm (hit)	0.9ms	1.2ms	2.1ms
Adaptive	Cold (miss)	18.3s	24.7s	29.1s
Adaptive	Warm (hit)	1.1ms	1.5ms	2.4ms

Cache Hit Rate Over Time:

Initial 100 queries: 0% hit rate (cold cache)
After 500 queries: 45% hit rate
Steady state: 60-80% hit rate (depends on query diversity)

Retrieval Accuracy Baselines

Strategy	Recall@3	Precision@3	MRR	Avg Confidence
Dense Only	0.87	0.78	0.82	0.71
Hybrid (Dense + BM25)	0.92	0.85	0.88	0.78
Advanced (Multi-query + HyDE + Rerank)	0.94	0.89	0.91	0.83

Fairness Considerations

Transparency Commitment: These benchmarks represent real-world performance on a specific hardware configuration. Results will vary based on GPU model, document corpus characteristics, and query complexity distribution.

Hardware Limitations

RTX 5080 Constraint: PyTorch sm_120 incompatibility forces CPU embeddings (50ms overhead per query)
Single GPU: No multi-GPU benchmarks; results reflect single-device throughput
WSL2 Overhead: ~5-10% performance penalty vs native Linux (Docker/WSL2 virtualization)

Benchmark Design Choices

Choice	Rationale	Impact on Results
Q5_K_M Quantization	Balances speed (80-100 tok/s) and quality	Faster than Q8, slightly lower accuracy than FP16
8K Context Window	Standard for Llama 3.1 8B	Shorter context = faster, but limited for long docs
0.0 Temperature	Deterministic generation	Reproducible but less creative answers
Top-K=3 Retrieval	Fast, focused retrieval	May miss relevant docs in top-5 or top-10

Excluded Scenarios

The following scenarios are not included in baseline benchmarks but may be added in future testing:

Multi-modal queries (images + text)
Multi-language retrieval (non-English corpora)
Streaming vs batch processing (SSE vs synchronous)
Multi-turn conversations (10+ exchange history)
Cross-document reasoning (requires more than 5 sources)

Limitations

Known Bottlenecks

CPU Embeddings: RTX 5080 PyTorch incompatibility adds 50ms per query (embedding generation)
Single-threaded Inference: llama.cpp requires thread-safe single executor (ThreadPoolExecutor with max_workers=1)
Redis Single-instance: Distributed caching not tested; production may require Redis Cluster
No Multi-GPU: Benchmarks reflect single-GPU throughput; model sharding not implemented

Query Complexity Bias

Simple queries (40% of test set) are over-represented vs real-world distribution
Complex reasoning queries (20%) may underestimate tail latencies
Benchmark queries are pre-vetted; production includes typos, ambiguity, adversarial inputs

Document Corpus Limitations

100K documents is mid-scale; enterprise deployments may exceed 1M+ documents
Primarily English-language corpus (no multilingual benchmarking)
No code repositories or structured data (CSV, JSON) included

Continuous Improvement: Benchmark methodology is versioned (v4.0) and will be expanded with additional test cases, hardware configurations, and accuracy evaluations in future releases.

Conclusion

Apollo’s benchmark methodology prioritizes transparency, reproducibility, and fairness in performance reporting. By documenting hardware specs, test data, metrics definitions, and known limitations, we aim to provide an honest assessment of RAG system capabilities.

Key Takeaways:

Benchmarks conducted on RTX 5080 with 100K document corpus
Query latency: 8-15s (simple mode), 10-25s (adaptive mode)
Cache optimization: 98% latency reduction on hits (less than 1ms)
Inference speed: 80-100 tokens/sec (Q5_K_M quantization)
Retrieval accuracy: 0.94 Recall@3 (advanced mode with reranking)

Next Steps

Explore detailed benchmark results and performance analysis:

View Results - Comprehensive performance data and visualizations
Configuration Guide - Learn how to tune Apollo for your workload
Deployment Guide - Production deployment strategies

For questions about methodology or to request additional benchmarks, please open an issue on GitHub.

Reproduce Benchmarks