GPU Acceleration
Apollo RAG achieves 80-100 tokens/sec inference speeds by leveraging GPU acceleration through CUDA 12.1 and llama.cpp. This page explains how the system utilizes GPU resources for optimal performance.
Why GPU Acceleration Matters
GPU acceleration provides several critical advantages for RAG systems:
- 10x Faster Inference: GPU-accelerated inference achieves 80-100 tok/s vs. 8-12 tok/s on CPU
- Lower Latency: Time to first token (TTFT) reduced from 5-7s to less than 500ms
- Better User Experience: Real-time streaming with 60fps UI updates
- Higher Throughput: Parallel processing of embeddings and reranking
Performance Comparison
| Operation | CPU (Baseline) | GPU (CUDA) | Speedup |
|---|---|---|---|
| LLM Inference | 8-12 tok/s | 80-100 tok/s | 8-10x |
| Time to First Token | 5-7s | less than 500ms | 10-14x |
| Embedding Generation | 50ms | 5ms | 10x |
| BGE Reranking | 400ms | 60ms | 6.7x |
CUDA Integration
Apollo RAG requires CUDA 12.1 for GPU acceleration. The system is optimized for modern NVIDIA GPUs with compute capability 8.0+.
CUDA Configuration
# backend/_src/llm_engine_llamacpp.py
CUDA_HOME = "/usr/local/cuda"
PATH = "/usr/local/cuda/bin:$PATH"
LD_LIBRARY_PATH = "/usr/lib/wsl/drivers:/usr/local/cuda/lib64"
# CUDA compilation flags
CMAKE_ARGS = "-DGGML_CUDA=ON"
CMAKE_CUDA_ARCHITECTURES = "all-major"Build Configuration
The Docker image includes CUDA toolkit and libraries:
# backend/Dockerfile.atlas
FROM python:3.11-slim-bookworm AS base
# Install CUDA Toolkit 12.1
RUN apt-get update && apt-get install -y \
gcc-12 g++-12 cmake \
cuda-nvcc-12-1 \
cuda-cudart-dev-12-1 \
libcublas-dev-12-1
# Set CUDA environment variables
ENV CUDA_HOME=/usr/local/cuda \
PATH=/usr/local/cuda/bin:$PATH \
LD_LIBRARY_PATH=/usr/local/cuda/lib64
# Build llama.cpp with CUDA support
RUN CMAKE_ARGS="-DGGML_CUDA=ON" \
pip install llama-cpp-python==0.3.2GCC Version Requirement: CUDA 12.1 requires gcc-12/g++-12. Using gcc-13 or higher will cause compilation errors.
llama.cpp GPU Backend
Apollo uses llama.cpp via the llama-cpp-python bindings for direct GPU acceleration, bypassing Ollama’s HTTP overhead.
GPU Offloading Strategy
The system fully offloads model layers to GPU VRAM:
# backend/_src/config.py
LLM_CONFIG = {
"model_path": "./models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf",
"n_gpu_layers": 33, # Full offload for 8B models
"n_ctx": 8192,
"n_batch": 512, # Optimal for RTX 5080
"use_mlock": True, # Lock model in RAM
"use_mmap": True, # Memory-map model file
"temperature": 0.0,
"max_tokens": 512
}GPU Layer Configuration
The n_gpu_layers parameter controls how many transformer layers are offloaded to GPU:
# backend/_src/llm_engine_llamacpp.py
from llama_cpp import Llama
class LlamaCppEngine:
def __init__(self, config: LLMConfig):
self.llm = Llama(
model_path=config.model_path,
n_gpu_layers=33, # Full GPU offload
n_ctx=8192,
n_batch=512,
use_mlock=True,
use_mmap=True,
verbose=False
)Layer Configuration Guide:
- 0 layers: CPU-only (8-12 tok/s)
- 16 layers: Hybrid (40-50 tok/s, 3GB VRAM)
- 33 layers: Full GPU (80-100 tok/s, 5.4GB VRAM)
Model Size Impact: Larger models require more VRAM. For example, Qwen 2.5 14B Q8 needs ~14.8GB VRAM with 40 GPU layers.
Performance Characteristics
Throughput Metrics
With RTX 5080 GPU and Q5_K_M quantization:
Target Speed: 80-100 tokens/sec
Time to First Token: less than 500ms
Context Window: 8192 tokens
Batch Size: 512 tokens
VRAM Usage: 5.4GB (8B model)Latency Breakdown
Query processing with GPU acceleration:
Total Query Time: 8-15 seconds (simple mode)
Breakdown:
├─ Security Checks: 5ms
├─ Cache Lookup: less than 1ms (98% hit rate)
├─ Query Embedding: 5ms (GPU-accelerated BGE)
├─ Vector Search: 100ms (HNSW index)
├─ BGE Reranking: 60ms (GPU)
└─ LLM Generation: 8-12s (80-100 tok/s)
├─ TTFT: less than 500ms
└─ Token Generation: 7-11sGPU vs CPU Comparison
CPU Configuration (Baseline):
Backend: Ollama HTTP
Speed: 8-12 tok/s
TTFT: 5-7 seconds
Context Window: 8192 tokens
RAM Usage: 8GBGPU Configuration (Optimized):
Backend: llama.cpp CUDA
Speed: 80-100 tok/s
TTFT: less than 500ms
Context Window: 8192 tokens
VRAM Usage: 5.4GB
Speedup: 8-10xGPU Memory Management
VRAM Allocation
The system carefully manages VRAM to prevent out-of-memory errors:
# backend/_src/model_manager.py
import torch
import gc
class ModelManager:
def _unload_current_model(self):
"""Clean unload with VRAM cleanup"""
if self.current_llm is not None:
# Step 1: Delete model reference
del self.current_llm
self.current_llm = None
# Step 2: Force garbage collection
gc.collect()
# Step 3: Clear CUDA cache
if torch.cuda.is_available():
torch.cuda.empty_cache()
torch.cuda.synchronize()
# Step 4: Allow VRAM release (500ms)
import time
time.sleep(0.5)VRAM Requirements by Model
| Model | Quantization | GPU Layers | VRAM Required | Speed |
|---|---|---|---|---|
| Llama 3.1 8B | Q5_K_M | 33 | 5.4GB | 80-100 tok/s |
| Llama 3.1 8B | Q8_0 | 33 | 8.2GB | 75-95 tok/s |
| Qwen 2.5 14B | Q5_K_M | 40 | 9.8GB | 60-80 tok/s |
| Qwen 2.5 14B | Q8_0 | 40 | 14.8GB | 40-50 tok/s |
VRAM Headroom: Always maintain 2-3GB of free VRAM for embeddings, reranking, and peak usage during generation.
Batching Strategies
Optimal Batch Size
The n_batch parameter controls token processing parallelism:
# backend/_src/config.py
BATCH_CONFIG = {
"n_batch": 512, # Optimal for RTX 5080
"n_ubatch": 512 # Micro-batch size
}Batch Size Guidelines:
- RTX 3080/3090: 256-384 tokens
- RTX 4080/4090: 512 tokens
- RTX 5080/5090: 512-1024 tokens
Concurrent Processing
The system uses thread-safe execution for concurrent requests:
# backend/_src/llm_engine_llamacpp.py
from concurrent.futures import ThreadPoolExecutor
import asyncio
class LlamaCppEngine:
def __init__(self):
# Single-thread executor for thread safety
self.executor = ThreadPoolExecutor(max_workers=1)
self.semaphore = asyncio.Semaphore(1)
async def generate_async(self, prompt: str, **kwargs):
async with self.semaphore:
# Run blocking llama.cpp call in executor
loop = asyncio.get_running_loop()
response = await loop.run_in_executor(
self.executor,
self.llm,
prompt
)
return responseThread Safety: llama.cpp is NOT thread-safe. Apollo uses a single-thread executor with semaphore to prevent race conditions.
Hardware Requirements
Minimum Requirements
GPU: NVIDIA RTX 3060 (12GB VRAM)
CUDA: 12.1 or higher
Compute Capability: 8.0+
Driver: 525.x or higher
RAM: 16GB system memoryRecommended Configuration
GPU: NVIDIA RTX 5080 (16GB VRAM)
CUDA: 12.1
Compute Capability: 8.9
Driver: 550.x or higher
RAM: 32GB system memory
CPU: 12+ cores for document processingGPU Compatibility
Supported GPUs (Compute Capability 8.0 or higher):
- RTX 30 Series: 3060, 3070, 3080, 3090
- RTX 40 Series: 4070, 4080, 4090
- RTX 50 Series: RTX 5080, RTX 5090
- A Series: A100, A6000, A5000
Not Supported (Compute Capability less than 8.0):
- GTX 16 Series
- RTX 20 Series
- Quadro RTX 4000/5000
PyTorch Compatibility: RTX 5080 has sm_120 architecture incompatibility with PyTorch 2.5.1. Apollo forces embeddings to CPU as a workaround.
Configuration Examples
Development Setup (8B Model)
# backend/.env
LLM_BACKEND=llamacpp
MODEL_PATH=/models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
GPU_LAYERS=33
N_CTX=8192
N_BATCH=512
USE_MLOCK=true
USE_MMAP=trueProduction Setup (14B Model)
# backend/.env
LLM_BACKEND=llamacpp
MODEL_PATH=/models/qwen2.5-14b-instruct-q8_0.gguf
GPU_LAYERS=40
N_CTX=8192
N_BATCH=512
USE_MLOCK=true
USE_MMAP=true
DRAFT_MODEL_PATH=/models/Llama-3.2-1B-Instruct-Q4_K_M.gguf # Speculative decodingHybrid CPU/GPU Setup
# backend/.env
LLM_BACKEND=llamacpp
MODEL_PATH=/models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
GPU_LAYERS=16 # Half GPU, half CPU
N_CTX=8192
N_BATCH=256
USE_MLOCK=false # Reduce RAM usageMonitoring GPU Usage
Real-Time Monitoring
Use nvidia-smi to monitor GPU utilization:
# Watch GPU stats every 1 second
watch -n 1 nvidia-smi
# Query specific GPU metrics
nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total \
--format=csv,noheader,nounitsDocker Container Monitoring
# Monitor GPU usage inside container
docker exec atlas-backend nvidia-smi
# Stream GPU stats
docker exec atlas-backend watch -n 1 nvidia-smiPerformance Metrics API
Apollo exposes GPU metrics via the health endpoint:
curl http://localhost:8000/api/health | jq .components.llm
# Response:
{
"status": "ready",
"backend": "llamacpp",
"model": "Meta-Llama-3.1-8B-Instruct-Q5_K_M",
"gpu_layers": 33,
"vram_used": "5.4GB",
"tokens_per_second": 94.2
}Troubleshooting
CUDA Not Found
Symptom: RuntimeError: CUDA not available
Solutions:
- Verify CUDA installation:
nvcc --version - Check GPU detection:
nvidia-smi - Verify Docker GPU passthrough:
docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi - Rebuild llama-cpp-python:
CMAKE_ARGS="-DGGML_CUDA=ON" pip install --force-reinstall llama-cpp-python==0.3.2
Out of Memory Errors
Symptom: torch.cuda.OutOfMemoryError: CUDA out of memory
Solutions:
- Reduce
n_gpu_layers(e.g., 33 → 24) - Lower
n_batchsize (e.g., 512 → 256) - Use smaller model (14B → 8B)
- Use more aggressive quantization (Q8 → Q5 → Q4)
- Clear VRAM:
torch.cuda.empty_cache()
Slow Inference Speed
Symptom: Less than 40 tokens/sec on GPU
Solutions:
- Verify all layers on GPU: Check
n_gpu_layersmatches model layer count - Increase batch size: Try
n_batch=512or1024 - Enable memory locking:
use_mlock=true - Check GPU utilization: Should be greater than 90% during generation
- Verify no CPU throttling: Check
nvidia-smiclocks
CUDA Compilation Errors
Symptom: error: no suitable 'sm' architecture found
Solutions:
- Pin gcc version:
export CC=gcc-12 CXX=g++-12 - Use compatible CUDA: CUDA 12.1 requires gcc-12 (not gcc-13+)
- Set architecture explicitly:
CMAKE_CUDA_ARCHITECTURES=all-major
Next Steps
- Learn about Multi-Level Caching for 98% latency reduction
- Explore Adaptive Retrieval strategies
- Understand Model Management for hot-swapping
Related Resources: