Model Management

Apollo RAG provides sophisticated hot-swappable LLM models that allow you to switch between different models at runtime without restarting the backend. This enables you to balance speed, quality, and VRAM usage based on your current needs.

Model Management Overview

The Model Management system (_src/model_manager.py) provides:

Runtime model switching without server restart
Hot-swap process completing in 15-30 seconds
Multi-backend support (llama.cpp, Ollama)
VRAM optimization with explicit cleanup
Automatic validation before switching
Fallback mechanisms if new model fails

Model hotswapping was designed to enable experimentation and production deployment flexibility. You can switch from a fast 8B model during development to a high-quality 14B model for production use cases—without any downtime.

Hot Model Swapping

Zero Downtime Architecture

The hot-swap system uses a mutex-based locking mechanism to ensure thread-safe model transitions:

# backend/_src/model_manager.py
async def select_model(model_id: str):
    # 1. Acquire switching lock (blocks concurrent switches)
    async with self._switching_lock:
        # 2. Validate model ID and configuration
        model_config = self._get_model_config(model_id)
 
        # 3. Unload current model
        await self._unload_current_model()
 
        # 4. VRAM cleanup
        torch.cuda.empty_cache()
        gc.collect()
        await asyncio.sleep(0.5)  # Allow cleanup
 
        # 5. Load new model
        await self._load_model(model_config)
 
        # 6. Test generation (5 tokens)
        await self._test_model()
 
        # 7. Update RAGEngine reference
        # 8. Clear cache (old model answers incompatible)
        # 9. Release lock

Timing Breakdown

The hot-swap process typically completes in 15-30 seconds:

Stage	Duration	Description
Lock acquisition	less than 1ms	Prevents concurrent switches
Model unload	2-3s	Releases model from memory
VRAM cleanup	0.5s	`torch.cuda.empty_cache()` + `gc.collect()`
Model load	10-25s	Loads GGUF model to GPU
Test generation	2s	Validates model works (5 tokens)
Cache clearing	100ms	Invalidates old model answers
RAG sync	less than 1ms	Updates engine references

⚠️

During the 15-30 second switch window:

New queries return HTTP 503 (Service Temporarily Unavailable)
In-flight queries continue processing with the old model
The frontend displays “Switching models…” status

Model Registry

Apollo includes a curated registry of production-tested models optimized for RAG workloads:

Available Models

# backend/_src/model_manager.py
MODEL_PROFILES = [
    {
        "id": "llama-8b-q5",
        "name": "Llama 3.1 8B Q5_K_M",
        "backend": "llamacpp",
        "path": "./models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf",
        "vram": "5.4GB",
        "speed": "80-100 tok/s",
        "context_window": 8192,
        "gpu_layers": 33,
        "description": "Fast, balanced quality. Default model."
    },
    {
        "id": "qwen-14b-q8",
        "name": "Qwen 2.5 14B Q8_0",
        "backend": "llamacpp",
        "path": "./models/qwen2.5-14b-instruct-q8_0.gguf",
        "vram": "14.8GB",
        "speed": "40-50 tok/s",
        "context_window": 8192,
        "gpu_layers": 33,
        "description": "High quality, slower. Best for complex queries."
    },
    {
        "id": "ollama",
        "name": "Ollama (qwen2.5:14b)",
        "backend": "ollama",
        "vram": "Variable",
        "speed": "8-12 tok/s",
        "description": "Fallback HTTP backend. Slower but more compatible."
    }
]

Model Comparison Table

Model	Size	VRAM	Speed	Context	Quality	Use Case
Llama 3.1 8B Q5	5.6GB	5.4GB	80-100 tok/s	8K	Good	Fast queries, high throughput
Qwen 2.5 14B Q8	15GB	14.8GB	40-50 tok/s	8K	Excellent	Complex queries, high accuracy
Ollama Backend	Variable	Variable	8-12 tok/s	Variable	Variable	Fallback, easy setup

Switching Models

API Endpoint

Switch models using the /api/models/select endpoint:

POST http://localhost:8000/api/models/select
Content-Type: application/json
 
{
  "model_id": "qwen-14b-q8"
}

Response (Success):

{
  "success": true,
  "current_model": {
    "id": "qwen-14b-q8",
    "name": "Qwen 2.5 14B Q8_0",
    "backend": "llamacpp",
    "vram": "14.8GB",
    "speed": "40-50 tok/s"
  },
  "previous_model": "llama-8b-q5",
  "switch_time": 18.3
}

Response (Error):

{
  "success": false,
  "error": "Model 'invalid-model' not found in registry"
}

Frontend Integration

The frontend provides a Settings Panel for model switching:

// src/components/Settings/ModelSelector.tsx
const handleModelSwitch = async (modelId: string) => {
  setIsSwitching(true);
 
  try {
    const response = await fetch('http://localhost:8000/api/models/select', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ model_id: modelId })
    });
 
    const result = await response.json();
 
    if (result.success) {
      toast.success(`Switched to ${result.current_model.name}`);
    } else {
      toast.error(`Switch failed: ${result.error}`);
    }
  } catch (error) {
    toast.error('Failed to switch model');
  } finally {
    setIsSwitching(false);
  }
};

Tauri IPC Integration

You can also use Tauri IPC commands for native integration:

// src-tauri/src/commands.rs
#[tauri::command]
async fn switch_model(model_id: String) -> Result<ModelInfo, String> {
    let client = reqwest::Client::new();
 
    let response = client
        .post("http://localhost:8000/api/models/select")
        .json(&serde_json::json!({ "model_id": model_id }))
        .send()
        .await
        .map_err(|e| format!("Switch failed: {}", e))?;
 
    let result: ModelSwitchResponse = response
        .json()
        .await
        .map_err(|e| format!("Failed to parse response: {}", e))?;
 
    Ok(result.current_model)
}

Model Configuration

Quantization Levels

Apollo supports various GGUF quantization levels (tradeoff between size/speed and quality):

Quantization	File Size	VRAM	Speed	Quality	Best For
Q4_K_M	~4.5GB	4.8GB	100-120 tok/s	Fair	Fast inference, resource-constrained
Q5_K_M	~5.6GB	5.4GB	80-100 tok/s	Good	Recommended balanced option
Q6_K	~6.8GB	7.2GB	60-80 tok/s	Very Good	Higher quality, moderate speed
Q8_0	~9.5GB	10GB	40-60 tok/s	Excellent	Best quality, slower

Quantization Format: Apollo uses GGUF (GPT-Generated Unified Format), the modern replacement for GGML. GGUF models are:

More portable across platforms
Faster to load (metadata in header)
Better memory-mapped I/O support
Compatible with llama.cpp 0.3.2+

Model Parameters

When configuring models, these parameters control behavior:

# backend/_src/llm_engine_llamacpp.py
LLM_CONFIG = {
    "model_path": "./models/llama-3.1-8b-instruct.Q5_K_M.gguf",
    "n_gpu_layers": 33,        # Offload all layers to GPU (faster)
    "n_ctx": 8192,             # Context window size
    "n_batch": 512,            # Batch size for prompt processing
    "temperature": 0.0,        # Deterministic (no randomness)
    "max_tokens": 512,         # Max response length
    "use_mlock": True,         # Lock model in RAM (prevents swapping)
    "use_mmap": True,          # Memory-map model file (faster load)
    "verbose": False           # Disable llama.cpp logging
}

Parameter Explanations:

n_gpu_layers: Number of model layers offloaded to GPU. Set to 33 for full GPU offload on 8B models.
n_ctx: Context window size (8192 tokens = ~6000 words). Larger = more context but slower.
n_batch: Batch size for prompt processing. 512 is optimal for RTX 5080.
temperature: Controls randomness. 0.0 = deterministic (recommended for RAG).
use_mlock: Locks model in RAM to prevent OS swapping (improves consistency).
use_mmap: Memory-maps model file instead of loading into RAM (faster startup).

Supported Formats

GGUF (Primary Format)

Apollo exclusively uses GGUF quantized models for optimal performance:

Advantages:

GPU acceleration via llama.cpp (80-100 tok/s on RTX 5080)
Memory efficient (5-15GB VRAM depending on quantization)
Fast loading (less than 10s for 8B models)
Portable (runs on CPU or GPU)

Where to Download:

Example GGUF Models:

# Llama 3.1 8B (Recommended)
wget https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
 
# Qwen 2.5 14B (High Quality)
wget https://huggingface.co/Qwen/Qwen2.5-14B-Instruct-GGUF/resolve/main/qwen2.5-14b-instruct-q8_0.gguf

Unsupported Formats

Apollo does not support:

PyTorch (.bin, .pt, .safetensors): Too large, slow, requires transformers library
ONNX (.onnx): No llama.cpp support
TensorFlow (.pb): Incompatible runtime

Memory Considerations

VRAM Requirements by Model Size

⚠️

VRAM Planning: Always allocate 10-15% overhead for activations, KV cache, and system overhead.

Model Size	Quantization	Base VRAM	Overhead	Total VRAM	RTX GPU
7-8B	Q4_K_M	4.5GB	+0.8GB	5.3GB	RTX 3060 (12GB)
7-8B	Q5_K_M	5.4GB	+1.0GB	6.4GB	RTX 3060 Ti (8GB)
7-8B	Q8_0	7.5GB	+1.2GB	8.7GB	RTX 4060 Ti (16GB)
13-14B	Q4_K_M	8.5GB	+1.5GB	10GB	RTX 4070 (12GB)
13-14B	Q5_K_M	10.5GB	+2.0GB	12.5GB	RTX 4070 Ti (12GB)
13-14B	Q8_0	14.8GB	+2.5GB	17.3GB	RTX 5080 (16GB)

Out-of-Memory Handling

If VRAM is exhausted, Apollo falls back gracefully:

# backend/_src/llm_engine_llamacpp.py
try:
    self.llm = Llama(
        model_path=config.model_path,
        n_gpu_layers=config.n_gpu_layers,  # Try full GPU offload
        n_ctx=config.n_ctx
    )
except Exception as e:
    logger.warning(f"GPU offload failed: {e}. Falling back to CPU.")
 
    # Fallback: Reduce GPU layers
    self.llm = Llama(
        model_path=config.model_path,
        n_gpu_layers=0,  # CPU-only
        n_ctx=config.n_ctx
    )

🚫

Critical: If you see CUDA out of memory errors:

Switch to a smaller quantization (Q8 → Q5 → Q4)
Reduce n_gpu_layers (33 → 20 → 0)
Close other GPU applications
Consider a smaller model (14B → 8B → 3B)

Performance Characteristics

Inference Speed by Model Size

Performance measured on RTX 5080 (16GB VRAM) with CUDA 12.1:

Model	Quantization	VRAM	Speed (tok/s)	TTFT	Use Case
Llama 3.1 8B	Q5_K_M	5.4GB	80-100	less than 500ms	Fast, general-purpose
Llama 3.1 8B	Q8_0	7.5GB	60-80	less than 600ms	Higher quality
Qwen 2.5 14B	Q5_K_M	10.5GB	50-60	less than 800ms	Complex queries
Qwen 2.5 14B	Q8_0	14.8GB	40-50	less than 1000ms	Best quality

TTFT: Time to First Token (latency before streaming starts)

Speed vs Quality Tradeoff

Fast (100 tok/s) ←─────────────────→ Quality (40 tok/s)
        ↑                                    ↑
   Llama 8B Q4                         Qwen 14B Q8
        |                                    |
        └─ Best for: High throughput        └─ Best for: Accuracy-critical
           Simple queries                       Complex reasoning
           Real-time chat                       Production RAG

Best Practices

When to Switch Models

Use Llama 3.1 8B Q5 (Fast) for:

Simple factual queries
High throughput requirements (many concurrent users)
Real-time chat experiences
Development and testing

Use Qwen 2.5 14B Q8 (Quality) for:

Complex reasoning tasks
Multi-step analysis
Production RAG with accuracy requirements
Domain-specific questions

Production Tips

Pre-load models during Docker build to avoid startup delays:

# backend/Dockerfile.atlas
COPY models/*.gguf /models/
RUN python -c "from llama_cpp import Llama; Llama('/models/llama-3.1-8b.gguf', n_gpu_layers=0)"

Use environment variables for model selection:

# docker-compose.atlas.yml
environment:
  MODEL_PATH: /models/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf
  GPU_LAYERS: 33

Monitor VRAM usage with health checks:

# backend/app/api/health.py
vram_used = torch.cuda.memory_allocated() / 1024**3  # GB
vram_total = torch.cuda.get_device_properties(0).total_memory / 1024**3
 
if vram_used / vram_total > 0.9:
    logger.warning(f"VRAM usage high: {vram_used:.1f}/{vram_total:.1f} GB")

Test model switches in staging before production:

# Test switch to Qwen 14B
curl -X POST http://localhost:8000/api/models/select \
  -H "Content-Type: application/json" \
  -d '{"model_id": "qwen-14b-q8"}'

Troubleshooting

Common Issues

Issue: Model switch fails with “CUDA out of memory”

Solution:
- Verify VRAM available: nvidia-smi
- Switch to smaller quantization (Q8 → Q5)
- Reduce GPU layers in config

Issue: Model loads but generation is very slow (less than 10 tok per s)

Solution:
- Check GPU layers: Should be 33 for full offload
- Verify CUDA build: llama-cpp-python must be built with CUDA
- Check batch size: Increase n_batch to 512

Issue: Hot-swap takes more than 60 seconds

Solution:
- This is normal for 14B+ models
- Pre-cache models in Docker to speed up first load
- Use SSD for model storage (not HDD)

Issue: Model switch succeeds but old model still responds

Solution:
- Cache was not cleared. Manually clear:
   POST /api/conversation/clear
- Restart frontend to reset state

Debug Commands

# List available models
curl http://localhost:8000/api/models
 
# Get current model
curl http://localhost:8000/api/models/current
 
# Check VRAM usage
nvidia-smi --query-gpu=memory.used,memory.total --format=csv
 
# Monitor model switching
docker-compose -f backend/docker-compose.atlas.yml logs -f atlas-backend

Next Steps

Learn about Streaming to understand how Apollo delivers real-time token-by-token responses:

Streaming →