Model Management
Apollo RAG provides sophisticated hot-swappable LLM models that allow you to switch between different models at runtime without restarting the backend. This enables you to balance speed, quality, and VRAM usage based on your current needs.
Model Management Overview
The Model Management system (_src/model_manager.py) provides:
- Runtime model switching without server restart
- Hot-swap process completing in 15-30 seconds
- Multi-backend support (llama.cpp, Ollama)
- VRAM optimization with explicit cleanup
- Automatic validation before switching
- Fallback mechanisms if new model fails
Model hotswapping was designed to enable experimentation and production deployment flexibility. You can switch from a fast 8B model during development to a high-quality 14B model for production use cases—without any downtime.
Hot Model Swapping
Zero Downtime Architecture
The hot-swap system uses a mutex-based locking mechanism to ensure thread-safe model transitions:
# backend/_src/model_manager.py
async def select_model(model_id: str):
    # 1. Acquire switching lock (blocks concurrent switches)
    async with self._switching_lock:
        # 2. Validate model ID and configuration
        model_config = self._get_model_config(model_id)
 
        # 3. Unload current model
        await self._unload_current_model()
 
        # 4. VRAM cleanup
        torch.cuda.empty_cache()
        gc.collect()
        await asyncio.sleep(0.5)  # Allow cleanup
 
        # 5. Load new model
        await self._load_model(model_config)
 
        # 6. Test generation (5 tokens)
        await self._test_model()
 
        # 7. Update RAGEngine reference
        # 8. Clear cache (old model answers incompatible)
        # 9. Release lockTiming Breakdown
The hot-swap process typically completes in 15-30 seconds:
| Stage | Duration | Description | 
|---|---|---|
| Lock acquisition | less than 1ms | Prevents concurrent switches | 
| Model unload | 2-3s | Releases model from memory | 
| VRAM cleanup | 0.5s | torch.cuda.empty_cache()+gc.collect() | 
| Model load | 10-25s | Loads GGUF model to GPU | 
| Test generation | 2s | Validates model works (5 tokens) | 
| Cache clearing | 100ms | Invalidates old model answers | 
| RAG sync | less than 1ms | Updates engine references | 
During the 15-30 second switch window:
- New queries return HTTP 503 (Service Temporarily Unavailable)
- In-flight queries continue processing with the old model
- The frontend displays “Switching models…” status
Model Registry
Apollo includes a curated registry of production-tested models optimized for RAG workloads:
Available Models
# backend/_src/model_manager.py
MODEL_PROFILES = [
    {
        "id": "llama-8b-q5",
        "name": "Llama 3.1 8B Q5_K_M",
        "backend": "llamacpp",
        "path": "./models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf",
        "vram": "5.4GB",
        "speed": "80-100 tok/s",
        "context_window": 8192,
        "gpu_layers": 33,
        "description": "Fast, balanced quality. Default model."
    },
    {
        "id": "qwen-14b-q8",
        "name": "Qwen 2.5 14B Q8_0",
        "backend": "llamacpp",
        "path": "./models/qwen2.5-14b-instruct-q8_0.gguf",
        "vram": "14.8GB",
        "speed": "40-50 tok/s",
        "context_window": 8192,
        "gpu_layers": 33,
        "description": "High quality, slower. Best for complex queries."
    },
    {
        "id": "ollama",
        "name": "Ollama (qwen2.5:14b)",
        "backend": "ollama",
        "vram": "Variable",
        "speed": "8-12 tok/s",
        "description": "Fallback HTTP backend. Slower but more compatible."
    }
]Model Comparison Table
| Model | Size | VRAM | Speed | Context | Quality | Use Case | 
|---|---|---|---|---|---|---|
| Llama 3.1 8B Q5 | 5.6GB | 5.4GB | 80-100 tok/s | 8K | Good | Fast queries, high throughput | 
| Qwen 2.5 14B Q8 | 15GB | 14.8GB | 40-50 tok/s | 8K | Excellent | Complex queries, high accuracy | 
| Ollama Backend | Variable | Variable | 8-12 tok/s | Variable | Variable | Fallback, easy setup | 
Switching Models
API Endpoint
Switch models using the /api/models/select endpoint:
POST http://localhost:8000/api/models/select
Content-Type: application/json
 
{
  "model_id": "qwen-14b-q8"
}Response (Success):
{
  "success": true,
  "current_model": {
    "id": "qwen-14b-q8",
    "name": "Qwen 2.5 14B Q8_0",
    "backend": "llamacpp",
    "vram": "14.8GB",
    "speed": "40-50 tok/s"
  },
  "previous_model": "llama-8b-q5",
  "switch_time": 18.3
}Response (Error):
{
  "success": false,
  "error": "Model 'invalid-model' not found in registry"
}Frontend Integration
The frontend provides a Settings Panel for model switching:
// src/components/Settings/ModelSelector.tsx
const handleModelSwitch = async (modelId: string) => {
  setIsSwitching(true);
 
  try {
    const response = await fetch('http://localhost:8000/api/models/select', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ model_id: modelId })
    });
 
    const result = await response.json();
 
    if (result.success) {
      toast.success(`Switched to ${result.current_model.name}`);
    } else {
      toast.error(`Switch failed: ${result.error}`);
    }
  } catch (error) {
    toast.error('Failed to switch model');
  } finally {
    setIsSwitching(false);
  }
};Tauri IPC Integration
You can also use Tauri IPC commands for native integration:
// src-tauri/src/commands.rs
#[tauri::command]
async fn switch_model(model_id: String) -> Result<ModelInfo, String> {
    let client = reqwest::Client::new();
 
    let response = client
        .post("http://localhost:8000/api/models/select")
        .json(&serde_json::json!({ "model_id": model_id }))
        .send()
        .await
        .map_err(|e| format!("Switch failed: {}", e))?;
 
    let result: ModelSwitchResponse = response
        .json()
        .await
        .map_err(|e| format!("Failed to parse response: {}", e))?;
 
    Ok(result.current_model)
}Model Configuration
Quantization Levels
Apollo supports various GGUF quantization levels (tradeoff between size/speed and quality):
| Quantization | File Size | VRAM | Speed | Quality | Best For | 
|---|---|---|---|---|---|
| Q4_K_M | ~4.5GB | 4.8GB | 100-120 tok/s | Fair | Fast inference, resource-constrained | 
| Q5_K_M | ~5.6GB | 5.4GB | 80-100 tok/s | Good | Recommended balanced option | 
| Q6_K | ~6.8GB | 7.2GB | 60-80 tok/s | Very Good | Higher quality, moderate speed | 
| Q8_0 | ~9.5GB | 10GB | 40-60 tok/s | Excellent | Best quality, slower | 
Quantization Format: Apollo uses GGUF (GPT-Generated Unified Format), the modern replacement for GGML. GGUF models are:
- More portable across platforms
- Faster to load (metadata in header)
- Better memory-mapped I/O support
- Compatible with llama.cpp 0.3.2+
Model Parameters
When configuring models, these parameters control behavior:
# backend/_src/llm_engine_llamacpp.py
LLM_CONFIG = {
    "model_path": "./models/llama-3.1-8b-instruct.Q5_K_M.gguf",
    "n_gpu_layers": 33,        # Offload all layers to GPU (faster)
    "n_ctx": 8192,             # Context window size
    "n_batch": 512,            # Batch size for prompt processing
    "temperature": 0.0,        # Deterministic (no randomness)
    "max_tokens": 512,         # Max response length
    "use_mlock": True,         # Lock model in RAM (prevents swapping)
    "use_mmap": True,          # Memory-map model file (faster load)
    "verbose": False           # Disable llama.cpp logging
}Parameter Explanations:
- n_gpu_layers: Number of model layers offloaded to GPU. Set to- 33for full GPU offload on 8B models.
- n_ctx: Context window size (8192 tokens = ~6000 words). Larger = more context but slower.
- n_batch: Batch size for prompt processing.- 512is optimal for RTX 5080.
- temperature: Controls randomness.- 0.0= deterministic (recommended for RAG).
- use_mlock: Locks model in RAM to prevent OS swapping (improves consistency).
- use_mmap: Memory-maps model file instead of loading into RAM (faster startup).
Supported Formats
GGUF (Primary Format)
Apollo exclusively uses GGUF quantized models for optimal performance:
Advantages:
- GPU acceleration via llama.cpp (80-100 tok/s on RTX 5080)
- Memory efficient (5-15GB VRAM depending on quantization)
- Fast loading (less than 10s for 8B models)
- Portable (runs on CPU or GPU)
Where to Download:
Example GGUF Models:
# Llama 3.1 8B (Recommended)
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
 
# Qwen 2.5 14B (High Quality)
wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q8_0.ggufUnsupported Formats
Apollo does not support:
- PyTorch (.bin, .pt, .safetensors): Too large, slow, requires transformers library
- ONNX (.onnx): No llama.cpp support
- TensorFlow (.pb): Incompatible runtime
Memory Considerations
VRAM Requirements by Model Size
VRAM Planning: Always allocate 10-15% overhead for activations, KV cache, and system overhead.
| Model Size | Quantization | Base VRAM | Overhead | Total VRAM | RTX GPU | 
|---|---|---|---|---|---|
| 7-8B | Q4_K_M | 4.5GB | +0.8GB | 5.3GB | RTX 3060 (12GB) | 
| 7-8B | Q5_K_M | 5.4GB | +1.0GB | 6.4GB | RTX 3060 Ti (8GB) | 
| 7-8B | Q8_0 | 7.5GB | +1.2GB | 8.7GB | RTX 4060 Ti (16GB) | 
| 13-14B | Q4_K_M | 8.5GB | +1.5GB | 10GB | RTX 4070 (12GB) | 
| 13-14B | Q5_K_M | 10.5GB | +2.0GB | 12.5GB | RTX 4070 Ti (12GB) | 
| 13-14B | Q8_0 | 14.8GB | +2.5GB | 17.3GB | RTX 5080 (16GB) | 
Out-of-Memory Handling
If VRAM is exhausted, Apollo falls back gracefully:
# backend/_src/llm_engine_llamacpp.py
try:
    self.llm = Llama(
        model_path=config.model_path,
        n_gpu_layers=config.n_gpu_layers,  # Try full GPU offload
        n_ctx=config.n_ctx
    )
except Exception as e:
    logger.warning(f"GPU offload failed: {e}. Falling back to CPU.")
 
    # Fallback: Reduce GPU layers
    self.llm = Llama(
        model_path=config.model_path,
        n_gpu_layers=0,  # CPU-only
        n_ctx=config.n_ctx
    )Critical: If you see CUDA out of memory errors:
- Switch to a smaller quantization (Q8 → Q5 → Q4)
- Reduce n_gpu_layers(33 → 20 → 0)
- Close other GPU applications
- Consider a smaller model (14B → 8B → 3B)
Performance Characteristics
Inference Speed by Model Size
Performance measured on RTX 5080 (16GB VRAM) with CUDA 12.1:
| Model | Quantization | VRAM | Speed (tok/s) | TTFT | Use Case | 
|---|---|---|---|---|---|
| Llama 3.1 8B | Q5_K_M | 5.4GB | 80-100 | less than 500ms | Fast, general-purpose | 
| Llama 3.1 8B | Q8_0 | 7.5GB | 60-80 | less than 600ms | Higher quality | 
| Qwen 2.5 14B | Q5_K_M | 10.5GB | 50-60 | less than 800ms | Complex queries | 
| Qwen 2.5 14B | Q8_0 | 14.8GB | 40-50 | less than 1000ms | Best quality | 
TTFT: Time to First Token (latency before streaming starts)
Speed vs Quality Tradeoff
Fast (100 tok/s) ←─────────────────→ Quality (40 tok/s)
        ↑                                    ↑
   Llama 8B Q4                         Qwen 14B Q8
        |                                    |
        └─ Best for: High throughput        └─ Best for: Accuracy-critical
           Simple queries                       Complex reasoning
           Real-time chat                       Production RAGBest Practices
When to Switch Models
Use Llama 3.1 8B Q5 (Fast) for:
- Simple factual queries
- High throughput requirements (many concurrent users)
- Real-time chat experiences
- Development and testing
Use Qwen 2.5 14B Q8 (Quality) for:
- Complex reasoning tasks
- Multi-step analysis
- Production RAG with accuracy requirements
- Domain-specific questions
Production Tips
- 
Pre-load models during Docker build to avoid startup delays: # backend/Dockerfile.atlas COPY models/*.gguf /models/ RUN python -c "from llama_cpp import Llama; Llama('/models/llama-3.1-8b.gguf', n_gpu_layers=0)"
- 
Use environment variables for model selection: # docker-compose.atlas.yml environment: MODEL_PATH: /models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf GPU_LAYERS: 33
- 
Monitor VRAM usage with health checks: # backend/app/api/health.py vram_used = torch.cuda.memory_allocated() / 1024**3 # GB vram_total = torch.cuda.get_device_properties(0).total_memory / 1024**3 if vram_used / vram_total > 0.9: logger.warning(f"VRAM usage high: {vram_used:.1f}/{vram_total:.1f} GB")
- 
Test model switches in staging before production: # Test switch to Qwen 14B curl -X POST http://localhost:8000/api/models/select \ -H "Content-Type: application/json" \ -d '{"model_id": "qwen-14b-q8"}'
Troubleshooting
Common Issues
Issue: Model switch fails with “CUDA out of memory”
Solution:
- Verify VRAM available: nvidia-smi
- Switch to smaller quantization (Q8 → Q5)
- Reduce GPU layers in configIssue: Model loads but generation is very slow (less than 10 tok per s)
Solution:
- Check GPU layers: Should be 33 for full offload
- Verify CUDA build: llama-cpp-python must be built with CUDA
- Check batch size: Increase n_batch to 512Issue: Hot-swap takes more than 60 seconds
Solution:
- This is normal for 14B+ models
- Pre-cache models in Docker to speed up first load
- Use SSD for model storage (not HDD)Issue: Model switch succeeds but old model still responds
Solution:
- Cache was not cleared. Manually clear:
   POST /api/conversation/clear
- Restart frontend to reset stateDebug Commands
# List available models
curl http://localhost:8000/api/models
 
# Get current model
curl http://localhost:8000/api/models/current
 
# Check VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
 
# Monitor model switching
docker-compose -f backend/docker-compose.atlas.yml logs -f atlas-backendNext Steps
Learn about Streaming to understand how Apollo delivers real-time token-by-token responses: