Model Management

Apollo RAG provides sophisticated hot-swappable LLM models that allow you to switch between different models at runtime without restarting the backend. This enables you to balance speed, quality, and VRAM usage based on your current needs.

Model Management Overview

The Model Management system (_src/model_manager.py) provides:

  • Runtime model switching without server restart
  • Hot-swap process completing in 15-30 seconds
  • Multi-backend support (llama.cpp, Ollama)
  • VRAM optimization with explicit cleanup
  • Automatic validation before switching
  • Fallback mechanisms if new model fails

Model hotswapping was designed to enable experimentation and production deployment flexibility. You can switch from a fast 8B model during development to a high-quality 14B model for production use cases—without any downtime.

Hot Model Swapping

Zero Downtime Architecture

The hot-swap system uses a mutex-based locking mechanism to ensure thread-safe model transitions:

# backend/_src/model_manager.py
async def select_model(model_id: str):
    # 1. Acquire switching lock (blocks concurrent switches)
    async with self._switching_lock:
        # 2. Validate model ID and configuration
        model_config = self._get_model_config(model_id)
 
        # 3. Unload current model
        await self._unload_current_model()
 
        # 4. VRAM cleanup
        torch.cuda.empty_cache()
        gc.collect()
        await asyncio.sleep(0.5)  # Allow cleanup
 
        # 5. Load new model
        await self._load_model(model_config)
 
        # 6. Test generation (5 tokens)
        await self._test_model()
 
        # 7. Update RAGEngine reference
        # 8. Clear cache (old model answers incompatible)
        # 9. Release lock

Timing Breakdown

The hot-swap process typically completes in 15-30 seconds:

StageDurationDescription
Lock acquisitionless than 1msPrevents concurrent switches
Model unload2-3sReleases model from memory
VRAM cleanup0.5storch.cuda.empty_cache() + gc.collect()
Model load10-25sLoads GGUF model to GPU
Test generation2sValidates model works (5 tokens)
Cache clearing100msInvalidates old model answers
RAG syncless than 1msUpdates engine references
⚠️

During the 15-30 second switch window:

  • New queries return HTTP 503 (Service Temporarily Unavailable)
  • In-flight queries continue processing with the old model
  • The frontend displays “Switching models…” status

Model Registry

Apollo includes a curated registry of production-tested models optimized for RAG workloads:

Available Models

# backend/_src/model_manager.py
MODEL_PROFILES = [
    {
        "id": "llama-8b-q5",
        "name": "Llama 3.1 8B Q5_K_M",
        "backend": "llamacpp",
        "path": "./models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf",
        "vram": "5.4GB",
        "speed": "80-100 tok/s",
        "context_window": 8192,
        "gpu_layers": 33,
        "description": "Fast, balanced quality. Default model."
    },
    {
        "id": "qwen-14b-q8",
        "name": "Qwen 2.5 14B Q8_0",
        "backend": "llamacpp",
        "path": "./models/qwen2.5-14b-instruct-q8_0.gguf",
        "vram": "14.8GB",
        "speed": "40-50 tok/s",
        "context_window": 8192,
        "gpu_layers": 33,
        "description": "High quality, slower. Best for complex queries."
    },
    {
        "id": "ollama",
        "name": "Ollama (qwen2.5:14b)",
        "backend": "ollama",
        "vram": "Variable",
        "speed": "8-12 tok/s",
        "description": "Fallback HTTP backend. Slower but more compatible."
    }
]

Model Comparison Table

ModelSizeVRAMSpeedContextQualityUse Case
Llama 3.1 8B Q55.6GB5.4GB80-100 tok/s8KGoodFast queries, high throughput
Qwen 2.5 14B Q815GB14.8GB40-50 tok/s8KExcellentComplex queries, high accuracy
Ollama BackendVariableVariable8-12 tok/sVariableVariableFallback, easy setup

Switching Models

API Endpoint

Switch models using the /api/models/select endpoint:

POST http://localhost:8000/api/models/select
Content-Type: application/json
 
{
  "model_id": "qwen-14b-q8"
}

Response (Success):

{
  "success": true,
  "current_model": {
    "id": "qwen-14b-q8",
    "name": "Qwen 2.5 14B Q8_0",
    "backend": "llamacpp",
    "vram": "14.8GB",
    "speed": "40-50 tok/s"
  },
  "previous_model": "llama-8b-q5",
  "switch_time": 18.3
}

Response (Error):

{
  "success": false,
  "error": "Model 'invalid-model' not found in registry"
}

Frontend Integration

The frontend provides a Settings Panel for model switching:

// src/components/Settings/ModelSelector.tsx
const handleModelSwitch = async (modelId: string) => {
  setIsSwitching(true);
 
  try {
    const response = await fetch('http://localhost:8000/api/models/select', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ model_id: modelId })
    });
 
    const result = await response.json();
 
    if (result.success) {
      toast.success(`Switched to ${result.current_model.name}`);
    } else {
      toast.error(`Switch failed: ${result.error}`);
    }
  } catch (error) {
    toast.error('Failed to switch model');
  } finally {
    setIsSwitching(false);
  }
};

Tauri IPC Integration

You can also use Tauri IPC commands for native integration:

// src-tauri/src/commands.rs
#[tauri::command]
async fn switch_model(model_id: String) -> Result<ModelInfo, String> {
    let client = reqwest::Client::new();
 
    let response = client
        .post("http://localhost:8000/api/models/select")
        .json(&serde_json::json!({ "model_id": model_id }))
        .send()
        .await
        .map_err(|e| format!("Switch failed: {}", e))?;
 
    let result: ModelSwitchResponse = response
        .json()
        .await
        .map_err(|e| format!("Failed to parse response: {}", e))?;
 
    Ok(result.current_model)
}

Model Configuration

Quantization Levels

Apollo supports various GGUF quantization levels (tradeoff between size/speed and quality):

QuantizationFile SizeVRAMSpeedQualityBest For
Q4_K_M~4.5GB4.8GB100-120 tok/sFairFast inference, resource-constrained
Q5_K_M~5.6GB5.4GB80-100 tok/sGoodRecommended balanced option
Q6_K~6.8GB7.2GB60-80 tok/sVery GoodHigher quality, moderate speed
Q8_0~9.5GB10GB40-60 tok/sExcellentBest quality, slower

Quantization Format: Apollo uses GGUF (GPT-Generated Unified Format), the modern replacement for GGML. GGUF models are:

  • More portable across platforms
  • Faster to load (metadata in header)
  • Better memory-mapped I/O support
  • Compatible with llama.cpp 0.3.2+

Model Parameters

When configuring models, these parameters control behavior:

# backend/_src/llm_engine_llamacpp.py
LLM_CONFIG = {
    "model_path": "./models/llama-3.1-8b-instruct.Q5_K_M.gguf",
    "n_gpu_layers": 33,        # Offload all layers to GPU (faster)
    "n_ctx": 8192,             # Context window size
    "n_batch": 512,            # Batch size for prompt processing
    "temperature": 0.0,        # Deterministic (no randomness)
    "max_tokens": 512,         # Max response length
    "use_mlock": True,         # Lock model in RAM (prevents swapping)
    "use_mmap": True,          # Memory-map model file (faster load)
    "verbose": False           # Disable llama.cpp logging
}

Parameter Explanations:

  • n_gpu_layers: Number of model layers offloaded to GPU. Set to 33 for full GPU offload on 8B models.
  • n_ctx: Context window size (8192 tokens = ~6000 words). Larger = more context but slower.
  • n_batch: Batch size for prompt processing. 512 is optimal for RTX 5080.
  • temperature: Controls randomness. 0.0 = deterministic (recommended for RAG).
  • use_mlock: Locks model in RAM to prevent OS swapping (improves consistency).
  • use_mmap: Memory-maps model file instead of loading into RAM (faster startup).

Supported Formats

GGUF (Primary Format)

Apollo exclusively uses GGUF quantized models for optimal performance:

Advantages:

  • GPU acceleration via llama.cpp (80-100 tok/s on RTX 5080)
  • Memory efficient (5-15GB VRAM depending on quantization)
  • Fast loading (less than 10s for 8B models)
  • Portable (runs on CPU or GPU)

Where to Download:

Example GGUF Models:

# Llama 3.1 8B (Recommended)
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
 
# Qwen 2.5 14B (High Quality)
wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q8_0.gguf

Unsupported Formats

Apollo does not support:

  • PyTorch (.bin, .pt, .safetensors): Too large, slow, requires transformers library
  • ONNX (.onnx): No llama.cpp support
  • TensorFlow (.pb): Incompatible runtime

Memory Considerations

VRAM Requirements by Model Size

⚠️

VRAM Planning: Always allocate 10-15% overhead for activations, KV cache, and system overhead.

Model SizeQuantizationBase VRAMOverheadTotal VRAMRTX GPU
7-8BQ4_K_M4.5GB+0.8GB5.3GBRTX 3060 (12GB)
7-8BQ5_K_M5.4GB+1.0GB6.4GBRTX 3060 Ti (8GB)
7-8BQ8_07.5GB+1.2GB8.7GBRTX 4060 Ti (16GB)
13-14BQ4_K_M8.5GB+1.5GB10GBRTX 4070 (12GB)
13-14BQ5_K_M10.5GB+2.0GB12.5GBRTX 4070 Ti (12GB)
13-14BQ8_014.8GB+2.5GB17.3GBRTX 5080 (16GB)

Out-of-Memory Handling

If VRAM is exhausted, Apollo falls back gracefully:

# backend/_src/llm_engine_llamacpp.py
try:
    self.llm = Llama(
        model_path=config.model_path,
        n_gpu_layers=config.n_gpu_layers,  # Try full GPU offload
        n_ctx=config.n_ctx
    )
except Exception as e:
    logger.warning(f"GPU offload failed: {e}. Falling back to CPU.")
 
    # Fallback: Reduce GPU layers
    self.llm = Llama(
        model_path=config.model_path,
        n_gpu_layers=0,  # CPU-only
        n_ctx=config.n_ctx
    )
🚫

Critical: If you see CUDA out of memory errors:

  • Switch to a smaller quantization (Q8 → Q5 → Q4)
  • Reduce n_gpu_layers (33 → 20 → 0)
  • Close other GPU applications
  • Consider a smaller model (14B → 8B → 3B)

Performance Characteristics

Inference Speed by Model Size

Performance measured on RTX 5080 (16GB VRAM) with CUDA 12.1:

ModelQuantizationVRAMSpeed (tok/s)TTFTUse Case
Llama 3.1 8BQ5_K_M5.4GB80-100less than 500msFast, general-purpose
Llama 3.1 8BQ8_07.5GB60-80less than 600msHigher quality
Qwen 2.5 14BQ5_K_M10.5GB50-60less than 800msComplex queries
Qwen 2.5 14BQ8_014.8GB40-50less than 1000msBest quality

TTFT: Time to First Token (latency before streaming starts)

Speed vs Quality Tradeoff

Fast (100 tok/s) ←─────────────────→ Quality (40 tok/s)
        ↑                                    ↑
   Llama 8B Q4                         Qwen 14B Q8
        |                                    |
        └─ Best for: High throughput        └─ Best for: Accuracy-critical
           Simple queries                       Complex reasoning
           Real-time chat                       Production RAG

Best Practices

When to Switch Models

Use Llama 3.1 8B Q5 (Fast) for:

  • Simple factual queries
  • High throughput requirements (many concurrent users)
  • Real-time chat experiences
  • Development and testing

Use Qwen 2.5 14B Q8 (Quality) for:

  • Complex reasoning tasks
  • Multi-step analysis
  • Production RAG with accuracy requirements
  • Domain-specific questions

Production Tips

  • Pre-load models during Docker build to avoid startup delays:

    # backend/Dockerfile.atlas
    COPY models/*.gguf /models/
    RUN python -c "from llama_cpp import Llama; Llama('/models/llama-3.1-8b.gguf', n_gpu_layers=0)"
  • Use environment variables for model selection:

    # docker-compose.atlas.yml
    environment:
      MODEL_PATH: /models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
      GPU_LAYERS: 33
  • Monitor VRAM usage with health checks:

    # backend/app/api/health.py
    vram_used = torch.cuda.memory_allocated() / 1024**3  # GB
    vram_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
     
    if vram_used / vram_total > 0.9:
        logger.warning(f"VRAM usage high: {vram_used:.1f}/{vram_total:.1f} GB")
  • Test model switches in staging before production:

    # Test switch to Qwen 14B
    curl -X POST http://localhost:8000/api/models/select \
      -H "Content-Type: application/json" \
      -d '{"model_id": "qwen-14b-q8"}'

Troubleshooting

Common Issues

Issue: Model switch fails with “CUDA out of memory”

Solution:
- Verify VRAM available: nvidia-smi
- Switch to smaller quantization (Q8 → Q5)
- Reduce GPU layers in config

Issue: Model loads but generation is very slow (less than 10 tok per s)

Solution:
- Check GPU layers: Should be 33 for full offload
- Verify CUDA build: llama-cpp-python must be built with CUDA
- Check batch size: Increase n_batch to 512

Issue: Hot-swap takes more than 60 seconds

Solution:
- This is normal for 14B+ models
- Pre-cache models in Docker to speed up first load
- Use SSD for model storage (not HDD)

Issue: Model switch succeeds but old model still responds

Solution:
- Cache was not cleared. Manually clear:
   POST /api/conversation/clear
- Restart frontend to reset state

Debug Commands

# List available models
curl http://localhost:8000/api/models
 
# Get current model
curl http://localhost:8000/api/models/current
 
# Check VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
 
# Monitor model switching
docker-compose -f backend/docker-compose.atlas.yml logs -f atlas-backend

Next Steps

Learn about Streaming to understand how Apollo delivers real-time token-by-token responses: