GPU Acceleration

Apollo RAG achieves 80-100 tokens/sec inference speeds by leveraging GPU acceleration through CUDA 12.1 and llama.cpp. This page explains how the system utilizes GPU resources for optimal performance.

Why GPU Acceleration Matters

GPU acceleration provides several critical advantages for RAG systems:

10x Faster Inference: GPU-accelerated inference achieves 80-100 tok/s vs. 8-12 tok/s on CPU
Lower Latency: Time to first token (TTFT) reduced from 5-7s to less than 500ms
Better User Experience: Real-time streaming with 60fps UI updates
Higher Throughput: Parallel processing of embeddings and reranking

Performance Comparison

Operation	CPU (Baseline)	GPU (CUDA)	Speedup
LLM Inference	8-12 tok/s	80-100 tok/s	8-10x
Time to First Token	5-7s	less than 500ms	10-14x
Embedding Generation	50ms	5ms	10x
BGE Reranking	400ms	60ms	6.7x

CUDA Integration

Apollo RAG requires CUDA 12.1 for GPU acceleration. The system is optimized for modern NVIDIA GPUs with compute capability 8.0+.

CUDA Configuration

# backend/_src/llm_engine_llamacpp.py
CUDA_HOME = "/usr/local/cuda"
PATH = "/usr/local/cuda/bin:$PATH"
LD_LIBRARY_PATH = "/usr/lib/wsl/drivers:/usr/local/cuda/lib64"
 
# CUDA compilation flags
CMAKE_ARGS = "-DGGML_CUDA=ON"
CMAKE_CUDA_ARCHITECTURES = "all-major"

Build Configuration

The Docker image includes CUDA toolkit and libraries:

# backend/Dockerfile.atlas
FROM python:3.11-slim-bookworm AS base
 
# Install CUDA Toolkit 12.1
RUN apt-get update && apt-get install -y \
    gcc-12 g++-12 cmake \
    cuda-nvcc-12-1 \
    cuda-cudart-dev-12-1 \
    libcublas-dev-12-1
 
# Set CUDA environment variables
ENV CUDA_HOME=/usr/local/cuda \
    PATH=/usr/local/cuda/bin:$PATH \
    LD_LIBRARY_PATH=/usr/local/cuda/lib64
 
# Build llama.cpp with CUDA support
RUN CMAKE_ARGS="-DGGML_CUDA=ON" \
    pip install llama-cpp-python==0.3.2

⚠️

GCC Version Requirement: CUDA 12.1 requires gcc-12/g++-12. Using gcc-13 or higher will cause compilation errors.

llama.cpp GPU Backend

Apollo uses llama.cpp via the llama-cpp-python bindings for direct GPU acceleration, bypassing Ollama’s HTTP overhead.

GPU Offloading Strategy

The system fully offloads model layers to GPU VRAM:

# backend/_src/config.py
LLM_CONFIG = {
    "model_path": "./models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf",
    "n_gpu_layers": 33,  # Full offload for 8B models
    "n_ctx": 8192,
    "n_batch": 512,      # Optimal for RTX 5080
    "use_mlock": True,   # Lock model in RAM
    "use_mmap": True,    # Memory-map model file
    "temperature": 0.0,
    "max_tokens": 512
}

GPU Layer Configuration

The n_gpu_layers parameter controls how many transformer layers are offloaded to GPU:

# backend/_src/llm_engine_llamacpp.py
from llama_cpp import Llama
 
class LlamaCppEngine:
    def __init__(self, config: LLMConfig):
        self.llm = Llama(
            model_path=config.model_path,
            n_gpu_layers=33,  # Full GPU offload
            n_ctx=8192,
            n_batch=512,
            use_mlock=True,
            use_mmap=True,
            verbose=False
        )

Layer Configuration Guide:

0 layers: CPU-only (8-12 tok/s)
16 layers: Hybrid (40-50 tok/s, 3GB VRAM)
33 layers: Full GPU (80-100 tok/s, 5.4GB VRAM)

Model Size Impact: Larger models require more VRAM. For example, Qwen 2.5 14B Q8 needs ~14.8GB VRAM with 40 GPU layers.

Performance Characteristics

Throughput Metrics

With RTX 5080 GPU and Q5_K_M quantization:

Target Speed: 80-100 tokens/sec
Time to First Token: less than 500ms
Context Window: 8192 tokens
Batch Size: 512 tokens
VRAM Usage: 5.4GB (8B model)

Latency Breakdown

Query processing with GPU acceleration:

Total Query Time: 8-15 seconds (simple mode)

Breakdown:
├─ Security Checks: 5ms
├─ Cache Lookup: less than 1ms (98% hit rate)
├─ Query Embedding: 5ms (GPU-accelerated BGE)
├─ Vector Search: 100ms (HNSW index)
├─ BGE Reranking: 60ms (GPU)
└─ LLM Generation: 8-12s (80-100 tok/s)
    ├─ TTFT: less than 500ms
    └─ Token Generation: 7-11s

GPU vs CPU Comparison

CPU Configuration (Baseline):

Backend: Ollama HTTP
Speed: 8-12 tok/s
TTFT: 5-7 seconds
Context Window: 8192 tokens
RAM Usage: 8GB

GPU Configuration (Optimized):

Backend: llama.cpp CUDA
Speed: 80-100 tok/s
TTFT: less than 500ms
Context Window: 8192 tokens
VRAM Usage: 5.4GB
Speedup: 8-10x

GPU Memory Management

VRAM Allocation

The system carefully manages VRAM to prevent out-of-memory errors:

# backend/_src/model_manager.py
import torch
import gc
 
class ModelManager:
    def _unload_current_model(self):
        """Clean unload with VRAM cleanup"""
        if self.current_llm is not None:
            # Step 1: Delete model reference
            del self.current_llm
            self.current_llm = None
 
            # Step 2: Force garbage collection
            gc.collect()
 
            # Step 3: Clear CUDA cache
            if torch.cuda.is_available():
                torch.cuda.empty_cache()
                torch.cuda.synchronize()
 
            # Step 4: Allow VRAM release (500ms)
            import time
            time.sleep(0.5)

VRAM Requirements by Model

Model	Quantization	GPU Layers	VRAM Required	Speed
Llama 3.1 8B	Q5_K_M	33	5.4GB	80-100 tok/s
Llama 3.1 8B	Q8_0	33	8.2GB	75-95 tok/s
Qwen 2.5 14B	Q5_K_M	40	9.8GB	60-80 tok/s
Qwen 2.5 14B	Q8_0	40	14.8GB	40-50 tok/s

⚠️

VRAM Headroom: Always maintain 2-3GB of free VRAM for embeddings, reranking, and peak usage during generation.

Batching Strategies

Optimal Batch Size

The n_batch parameter controls token processing parallelism:

# backend/_src/config.py
BATCH_CONFIG = {
    "n_batch": 512,  # Optimal for RTX 5080
    "n_ubatch": 512  # Micro-batch size
}

Batch Size Guidelines:

RTX 3080/3090: 256-384 tokens
RTX 4080/4090: 512 tokens
RTX 5080/5090: 512-1024 tokens

Concurrent Processing

The system uses thread-safe execution for concurrent requests:

# backend/_src/llm_engine_llamacpp.py
from concurrent.futures import ThreadPoolExecutor
import asyncio
 
class LlamaCppEngine:
    def __init__(self):
        # Single-thread executor for thread safety
        self.executor = ThreadPoolExecutor(max_workers=1)
        self.semaphore = asyncio.Semaphore(1)
 
    async def generate_async(self, prompt: str, **kwargs):
        async with self.semaphore:
            # Run blocking llama.cpp call in executor
            loop = asyncio.get_running_loop()
            response = await loop.run_in_executor(
                self.executor,
                self.llm,
                prompt
            )
            return response

Thread Safety: llama.cpp is NOT thread-safe. Apollo uses a single-thread executor with semaphore to prevent race conditions.

Hardware Requirements

Minimum Requirements

GPU: NVIDIA RTX 3060 (12GB VRAM)
CUDA: 12.1 or higher
Compute Capability: 8.0+
Driver: 525.x or higher
RAM: 16GB system memory

Recommended Configuration

GPU: NVIDIA RTX 5080 (16GB VRAM)
CUDA: 12.1
Compute Capability: 8.9
Driver: 550.x or higher
RAM: 32GB system memory
CPU: 12+ cores for document processing

GPU Compatibility

Supported GPUs (Compute Capability 8.0 or higher):

RTX 30 Series: 3060, 3070, 3080, 3090
RTX 40 Series: 4070, 4080, 4090
RTX 50 Series: RTX 5080, RTX 5090
A Series: A100, A6000, A5000

Not Supported (Compute Capability less than 8.0):

GTX 16 Series
RTX 20 Series
Quadro RTX 4000/5000

🚫

PyTorch Compatibility: RTX 5080 has sm_120 architecture incompatibility with PyTorch 2.5.1. Apollo forces embeddings to CPU as a workaround.

Configuration Examples

Development Setup (8B Model)

# backend/.env
LLM_BACKEND=llamacpp
MODEL_PATH=/models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
GPU_LAYERS=33
N_CTX=8192
N_BATCH=512
USE_MLOCK=true
USE_MMAP=true

Production Setup (14B Model)

# backend/.env
LLM_BACKEND=llamacpp
MODEL_PATH=/models/qwen2.5-14b-instruct-q8_0.gguf
GPU_LAYERS=40
N_CTX=8192
N_BATCH=512
USE_MLOCK=true
USE_MMAP=true
DRAFT_MODEL_PATH=/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf  # Speculative decoding

Hybrid CPU/GPU Setup

# backend/.env
LLM_BACKEND=llamacpp
MODEL_PATH=/models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
GPU_LAYERS=16  # Half GPU, half CPU
N_CTX=8192
N_BATCH=256
USE_MLOCK=false  # Reduce RAM usage

Monitoring GPU Usage

Real-Time Monitoring

Use nvidia-smi to monitor GPU utilization:

# Watch GPU stats every 1 second
watch -n 1 nvidia-smi
 
# Query specific GPU metrics
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total \
           --format=csv,noheader,nounits

Docker Container Monitoring

# Monitor GPU usage inside container
docker exec atlas-backend nvidia-smi
 
# Stream GPU stats
docker exec atlas-backend watch -n 1 nvidia-smi

Performance Metrics API

Apollo exposes GPU metrics via the health endpoint:

curl http://localhost:8000/api/health | jq .components.llm
 
# Response:
{
  "status": "ready",
  "backend": "llamacpp",
  "model": "Meta-Llama-3.1-8B-Instruct-Q5_K_M",
  "gpu_layers": 33,
  "vram_used": "5.4GB",
  "tokens_per_second": 94.2
}

Troubleshooting

CUDA Not Found

Symptom: RuntimeError: CUDA not available

Solutions:

Verify CUDA installation: nvcc --version
Check GPU detection: nvidia-smi
Verify Docker GPU passthrough: docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
Rebuild llama-cpp-python: CMAKE_ARGS="-DGGML_CUDA=ON" pip install --force-reinstall llama-cpp-python==0.3.2

Out of Memory Errors

Symptom: torch.cuda.OutOfMemoryError: CUDA out of memory

Solutions:

Reduce n_gpu_layers (e.g., 33 → 24)
Lower n_batch size (e.g., 512 → 256)
Use smaller model (14B → 8B)
Use more aggressive quantization (Q8 → Q5 → Q4)
Clear VRAM: torch.cuda.empty_cache()

Slow Inference Speed

Symptom: Less than 40 tokens/sec on GPU

Solutions:

Verify all layers on GPU: Check n_gpu_layers matches model layer count
Increase batch size: Try n_batch=512 or 1024
Enable memory locking: use_mlock=true
Check GPU utilization: Should be greater than 90% during generation
Verify no CPU throttling: Check nvidia-smi clocks

CUDA Compilation Errors

Symptom: error: no suitable 'sm' architecture found

Solutions:

Pin gcc version: export CC=gcc-12 CXX=g++-12
Use compatible CUDA: CUDA 12.1 requires gcc-12 (not gcc-13+)
Set architecture explicitly: CMAKE_CUDA_ARCHITECTURES=all-major

Next Steps

Learn about Multi-Level Caching for 98% latency reduction
Explore Adaptive Retrieval strategies
Understand Model Management for hot-swapping

Related Resources: