BenchmarksBenchmark Methodology

Benchmark Methodology

This document outlines the comprehensive methodology used to benchmark Apollo RAG’s performance characteristics, ensuring transparent, reproducible, and fair measurements.

Overview

Apollo’s benchmarking methodology is designed to provide transparent, reproducible performance measurements across key RAG system dimensions: latency, throughput, accuracy, and resource utilization. All benchmarks are conducted in a controlled environment with standardized test data and clearly documented hardware specifications.

Benchmark Objectives

  • Performance Characterization: Establish baseline metrics for query latency, inference speed, and system throughput
  • Optimization Validation: Quantify the impact of architectural optimizations (KV cache preservation, parallel initialization, multi-tier caching)
  • Resource Profiling: Measure GPU, CPU, and memory utilization patterns under various load conditions
  • Accuracy Baselines: Establish retrieval quality and answer relevance metrics

Test Environment

All benchmarks were conducted in a controlled environment with consistent hardware and software configurations.

Hardware Specifications

ComponentSpecificationDetails
GPUNVIDIA RTX 508016GB VRAM, Ada Lovelace architecture
CPUIntel Core i9-13900K24 cores (8P + 16E), 32 threads
RAM64GB DDR5-6000CL36 latency
StorageSamsung 990 Pro 2TBNVMe Gen 4, 7450 MB/s read
OSWindows 11 Pro (WSL2)Kernel 5.15.133.1-microsoft-standard-WSL2
DockerDocker Desktop 4.25.0With NVIDIA Container Toolkit

GPU Note: The RTX 5080 uses Ada Lovelace’s sm_120 architecture, requiring PyTorch embeddings to run on CPU. LLM inference remains fully GPU-accelerated via llama.cpp with CUDA 12.1.

Software Configuration

ComponentVersionConfiguration
Python3.11.7Slim Bookworm base image
FastAPI0.115.0Uvicorn ASGI server
llama.cpp0.3.2CUDA 12.1 build, 33 GPU layers
PyTorch2.5.1+cu121CPU mode for embeddings
Qdrant1.15.0HNSW index, cosine distance
Redis7.2-alpine8GB max memory, LRU eviction
CUDA Toolkit12.1With cuBLAS, cuDNN

Model Configurations

Primary Model (llama.cpp):

  • Model: Meta-Llama-3.1-8B-Instruct
  • Quantization: Q5_K_M (5-bit mixed quantization)
  • Context Window: 8192 tokens
  • GPU Layers: 33 (full offload)
  • Batch Size: 512
  • Temperature: 0.0 (deterministic)
  • VRAM Usage: ~5.4GB

Embedding Model (CPU):

  • Model: BAAI/bge-large-en-v1.5
  • Dimensions: 1024
  • Device: CPU (PyTorch sm_120 workaround)
  • Batch Size: 64 documents
  • Latency: ~50ms per query

Reranker Model (GPU):

  • Model: BAAI/bge-reranker-v2-m3
  • Device: CUDA
  • Batch Size: 32 documents
  • Latency: ~60ms for 32 docs (85% faster than LLM reranking)

Data Corpus

Document Collection

The benchmark corpus consists of 100,000 diverse documents spanning multiple domains and complexity levels, designed to simulate real-world RAG workloads.

CategoryDocument CountAverage SizeSource
Technical Documentation25,0008-15 pagesSoftware manuals, API docs
Research Papers20,00012-25 pagesarXiv, academic journals
Business Reports15,00020-50 pagesFinancial statements, quarterly reports
Legal Documents10,00030-100 pagesContracts, case law, regulations
Knowledge Articles30,0002-5 pagesWikipedia, technical blogs

Total Corpus Statistics:

  • Total Documents: 100,000
  • Total Chunks: ~1.2 million (avg 1024 tokens, 128 overlap)
  • Index Size: 4.8GB (dense + sparse vectors)
  • Embedding Dimension: 1024 (BGE-large)
  • Vector Database: Qdrant with HNSW indexing

Query Set

Benchmark Query Collection: 1,000 representative queries across three complexity tiers:

  • Simple Queries (40%): Direct factual questions requiring single-document retrieval

    • Example: “What is the capital of France?”
    • Expected Latency: 8-15 seconds
  • Moderate Queries (40%): Multi-faceted questions requiring hybrid search

    • Example: “Compare Python and JavaScript for web development”
    • Expected Latency: 10-18 seconds
  • Complex Queries (20%): Ambiguous or multi-step reasoning requiring advanced retrieval

    • Example: “What are the implications of quantum computing on current encryption standards?”
    • Expected Latency: 15-25 seconds

Query Diversity: The benchmark query set intentionally includes ambiguous, multi-faceted, and edge-case questions to stress-test retrieval and generation quality, not just measure optimal-path performance.

Metrics Definitions

1. Latency Metrics

End-to-End Query Latency: Total time from API request reception to response completion.

MetricDefinitionTargetMeasurement Method
P50 LatencyMedian query timeless than 15s (simple mode)StageTimer per-query tracking
P95 Latency95th percentile query timeless than 25s (adaptive mode)Histogram distribution
P99 Latency99th percentile query timeless than 30sLong-tail analysis
TTFTTime to first tokenless than 500msSSE streaming timestamp

Stage Breakdown:

Total Latency = Security + Cache Lookup + Retrieval + Embedding +
                Vector Search + Reranking + Generation + Confidence Scoring

Example breakdown for cache miss (simple mode):

  • Security Checks: less than 5ms
  • Cache Lookup: 0.86ms (Redis query cache)
  • Query Embedding: 50ms (CPU, L2 cache miss)
  • Vector Search: 100ms (Qdrant HNSW)
  • LLM Generation: 8-12s (80-100 tok/s)
  • Confidence Scoring: 500ms (parallel)
  • Total: ~9-13s

2. Throughput Metrics

Inference Throughput: Token generation rate during LLM inference.

MetricDefinitionTargetMeasurement
Tokens/SecondLLM generation speed80-100 tok/sllama.cpp performance counter
Queries/MinuteSustained query rate4-6 queries/minConcurrent load testing
Cache Hit RateQuery cache effectiveness60-80%Redis hit/miss ratio

3. Accuracy Metrics

Retrieval Quality: Relevance of retrieved documents to user queries.

MetricDefinitionTargetMeasurement
Recall@KProportion of relevant docs in top-Kmore than 0.90 @ K=3Manual relevance judgments
Precision@KProportion of top-K docs that are relevantmore than 0.80 @ K=3Human evaluation
MRRMean Reciprocal Rank of first relevant docmore than 0.85Position tracking
Confidence ScoreSystem’s self-assessed answer qualityCalibratedWeighted signal aggregation

Confidence Scoring Formula:

confidence = 0.30 * retrieval_quality +
             0.40 * answer_relevance +
             0.30 * source_consistency

4. Resource Utilization

ResourceMetricMeasurementTarget
VRAMGPU memory usagenvidia-smiless than 8GB @ idle, less than 12GB @ peak
RAMSystem memoryDocker statsless than 12GB @ idle, less than 20GB @ peak
CPUProcessor utilizationtop/htopless than 80% avg, single core bottleneck
Disk I/ORead/write throughputiostatless than 500MB per s (NVMe headroom)

Testing Approach

Benchmark Phases

Phase 1: Cold Start Testing

  • Measure system initialization time (parallel component loading)
  • Target: less than 30 seconds from container start to first query ready
  • Components: Embeddings, LLM, Vector DB, Cache, Reranker

Phase 2: Warm System Testing

  • Execute 100 diverse queries with empty cache
  • Measure baseline latency without cache optimization
  • Record per-stage timing breakdowns

Phase 3: Cache Effectiveness Testing

  • Re-run Phase 2 queries to measure cache hit benefits
  • Target: 98% latency reduction on exact matches (less than 1ms vs 50-100ms)
  • Validate semantic cache matching (0.95 similarity threshold)

Phase 4: Load Testing

  • Ramp up concurrent users (1 → 5 → 10 → 20)
  • Measure throughput degradation and latency increase
  • Identify bottlenecks (GPU inference, Redis contention)

Phase 5: Optimization Validation

  • A/B test individual optimizations:
    • KV cache preservation: 40-60% speedup validation
    • Parallel initialization: 3.4x startup speedup validation
    • BGE reranker: 85% faster than LLM reranking validation

Reproducibility: All benchmark scripts and configurations are available in the benchmarks/ directory. Docker Compose ensures consistent environment setup across runs.

Baseline Measurements

Startup Performance

MetricBaseline (Sequential)Optimized (Parallel)Improvement
Startup Time78 seconds23 seconds3.4x faster
Component InitSequential loading3-group parallelization55s saved
First Query Ready80+ seconds25-30 seconds3x faster

Parallel Initialization Groups:

  • Group 1 (parallel): embedding_cache, embeddings, bm25_data, llm → ~12s
  • Group 2 (parallel): cache_manager, vectorstore, bm25_retriever, llm_warmup → ~8s
  • Group 3 (parallel): metadata, conversation_memory, confidence_scorer → ~3s

Query Latency Baselines

ModeCache StatusMedian (P50)P95P99
SimpleCold (miss)12.5s15.2s18.4s
SimpleWarm (hit)0.9ms1.2ms2.1ms
AdaptiveCold (miss)18.3s24.7s29.1s
AdaptiveWarm (hit)1.1ms1.5ms2.4ms

Cache Hit Rate Over Time:

  • Initial 100 queries: 0% hit rate (cold cache)
  • After 500 queries: 45% hit rate
  • Steady state: 60-80% hit rate (depends on query diversity)

Retrieval Accuracy Baselines

StrategyRecall@3Precision@3MRRAvg Confidence
Dense Only0.870.780.820.71
Hybrid (Dense + BM25)0.920.850.880.78
Advanced (Multi-query + HyDE + Rerank)0.940.890.910.83

Fairness Considerations

Transparency Commitment: These benchmarks represent real-world performance on a specific hardware configuration. Results will vary based on GPU model, document corpus characteristics, and query complexity distribution.

Hardware Limitations

  • RTX 5080 Constraint: PyTorch sm_120 incompatibility forces CPU embeddings (50ms overhead per query)
  • Single GPU: No multi-GPU benchmarks; results reflect single-device throughput
  • WSL2 Overhead: ~5-10% performance penalty vs native Linux (Docker/WSL2 virtualization)

Benchmark Design Choices

ChoiceRationaleImpact on Results
Q5_K_M QuantizationBalances speed (80-100 tok/s) and qualityFaster than Q8, slightly lower accuracy than FP16
8K Context WindowStandard for Llama 3.1 8BShorter context = faster, but limited for long docs
0.0 TemperatureDeterministic generationReproducible but less creative answers
Top-K=3 RetrievalFast, focused retrievalMay miss relevant docs in top-5 or top-10

Excluded Scenarios

The following scenarios are not included in baseline benchmarks but may be added in future testing:

  • Multi-modal queries (images + text)
  • Multi-language retrieval (non-English corpora)
  • Streaming vs batch processing (SSE vs synchronous)
  • Multi-turn conversations (10+ exchange history)
  • Cross-document reasoning (requires more than 5 sources)

Limitations

Known Bottlenecks

  • CPU Embeddings: RTX 5080 PyTorch incompatibility adds 50ms per query (embedding generation)
  • Single-threaded Inference: llama.cpp requires thread-safe single executor (ThreadPoolExecutor with max_workers=1)
  • Redis Single-instance: Distributed caching not tested; production may require Redis Cluster
  • No Multi-GPU: Benchmarks reflect single-GPU throughput; model sharding not implemented

Query Complexity Bias

  • Simple queries (40% of test set) are over-represented vs real-world distribution
  • Complex reasoning queries (20%) may underestimate tail latencies
  • Benchmark queries are pre-vetted; production includes typos, ambiguity, adversarial inputs

Document Corpus Limitations

  • 100K documents is mid-scale; enterprise deployments may exceed 1M+ documents
  • Primarily English-language corpus (no multilingual benchmarking)
  • No code repositories or structured data (CSV, JSON) included

Continuous Improvement: Benchmark methodology is versioned (v4.0) and will be expanded with additional test cases, hardware configurations, and accuracy evaluations in future releases.

Conclusion

Apollo’s benchmark methodology prioritizes transparency, reproducibility, and fairness in performance reporting. By documenting hardware specs, test data, metrics definitions, and known limitations, we aim to provide an honest assessment of RAG system capabilities.

Key Takeaways:

  • Benchmarks conducted on RTX 5080 with 100K document corpus
  • Query latency: 8-15s (simple mode), 10-25s (adaptive mode)
  • Cache optimization: 98% latency reduction on hits (less than 1ms)
  • Inference speed: 80-100 tokens/sec (Q5_K_M quantization)
  • Retrieval accuracy: 0.94 Recall@3 (advanced mode with reranking)

Next Steps

Explore detailed benchmark results and performance analysis:

For questions about methodology or to request additional benchmarks, please open an issue on GitHub.