Architecture Overview

Apollo RAG is built as a GPU-accelerated, offline-capable RAG system packaged as a modern desktop application. The architecture combines three sophisticated layers working in harmony to deliver high-performance document intelligence.

Key Design Goal: Achieve 80-100 tokens/sec inference with 8-15 second query latency while maintaining production-grade reliability and security.

System Overview

Apollo RAG implements a three-tier hybrid architecture that separates concerns while optimizing for performance:

Frontend Layer: Modern React 19 UI with real-time streaming and performance optimization
Desktop Bridge Layer: Tauri 2.9 native runtime providing OS integration and IPC
Backend Layer: FastAPI-based RAG engine with GPU-accelerated inference and vector search

This separation enables:

Independent scaling and optimization of each layer
Hot-swappable components without full system restart
Clear security boundaries and type-safe communication
Native desktop capabilities combined with web technology flexibility

Three-Tier Architecture

+---------------------------------------------------------------+
|                     FRONTEND LAYER                            |
|                                                               |
|  React 19 + TypeScript + Zustand                              |
|  Runtime: WebView2 (Windows) / WebKit (macOS/Linux)           |
|                                                               |
|  Features:                                                    |
|  * Real-time SSE streaming with 60fps token buffering         |
|  * Performance-optimized rendering (React.memo)               |
|  * Document management with drag-drop upload                  |
|  * Live performance metrics dashboard                         |
|  * Settings panel with model hot-swap UI                      |
|  * Dark mode with localStorage persistence                    |
|                                                               |
+----------------+----------------------------------------------+
                 |
                 |  +---------------------------------------+
                 +--| Tauri IPC (JSON-RPC)                  |
                 |  | * Native features                     |
                 |  | * Health monitoring                   |
                 |  | * System notifications                |
                 |  +---------------------------------------+
                 |
                 |  +---------------------------------------+
                 +--| Direct HTTP (Fetch API)               |
                    | * Query endpoints                     |
                    | * Document upload                     |
                    | * SSE streaming                       |
                    +---------------------------------------+
                    |
+-------------------v---------------------------------------+
|                DESKTOP BRIDGE LAYER                       |
|                                                           |
|  Tauri 2.9 (Rust Native Binary)                           |
|  Runtime: Native system process                           |
|                                                           |
|  Responsibilities:                                        |
|  * IPC command handlers (type-safe bidirectional)         |
|  * Backend health monitoring (background task)            |
|  * HTTP proxy to FastAPI backend                          |
|  * Native OS integration (file dialogs, notifications)    |
|  * WebView management and security policies               |
|  * Application lifecycle management                       |
|                                                           |
+----------------+------------------------------------------+
                 |
                 | HTTP/REST (reqwest crate)
                 | Target: http://localhost:8000
                 |
+----------------v------------------------------------------+
|                   BACKEND LAYER                           |
|                                                           |
|  FastAPI + Python 3.11                                    |
|  Runtime: Docker containers with CUDA support             |
|                                                           |
|  Core Components:                                         |
|  * RAG orchestration engine (parallel initialization)     |
|  * llama.cpp GPU inference (80-100 tokens/sec)            |
|  * Vector database - Qdrant (dense + sparse search)       |
|  * Multi-tier caching - Redis (L1-L5 ATLAS protocol)      |
|  * Document processing pipeline (PDF/DOCX/TXT)            |
|  * Adaptive retrieval strategies (simple/hybrid/advanced) |
|  * Model management (hot-swappable LLMs)                  |
|  * Confidence scoring and source citation                 |
|                                                           |
|  Infrastructure:                                          |
|  * Qdrant: Vector storage and similarity search           |
|  * Redis: Multi-level caching and session management      |
|  * CUDA Toolkit: GPU acceleration for inference           |
|                                                           |
+-----------------------------------------------------------+

Dual Communication Pattern: Apollo uses both Tauri IPC and direct HTTP for optimal performance. IPC handles lightweight native features, while HTTP handles data-intensive operations like streaming queries.

Frontend Layer

Technology Stack

Framework: React 19.1.1 (cutting-edge concurrent features)
Language: TypeScript 5.7 with strict mode
State Management: Zustand 5.0.8 with persistence middleware
Styling: TailwindCSS 3.4.18 with custom design system
Build Tool: Vite 7.1.7 targeting ESNext
Desktop Integration: Tauri 2.9.0 API bindings

Key Features

Performance Optimizations:

Token buffering with 60fps throttling prevents UI freezes during streaming
React.memo with custom comparison functions for efficient re-renders
Memoized performance store computations with checksum validation
Lazy loading for heavy components (PDF viewer, charts)

User Experience:

Real-time SSE streaming with visual token buffering
Drag-and-drop document upload with validation
Live performance metrics and timing history
Comprehensive error boundaries with field operator logging
Offline detection with automatic recovery
Dark mode with system preference detection

Accessibility:

Full keyboard navigation support
ARIA labels and semantic HTML
Screen reader compatibility
Focus management and tab order

Critical Files

src/
├── App.tsx                    // Main application entry with error boundary
├── hooks/
│   ├── useChat.ts            // Chat orchestration with token batching
│   └── useStreamingChat.ts   // SSE streaming implementation
├── store/
│   ├── useStore.ts           // Main Zustand store with persistence
│   └── performanceStore.ts   // Memoized performance calculations
├── services/
│   └── api.ts                // API client with retry logic
└── components/
    ├── Chat/
    │   ├── ChatWindow.tsx    // Main chat interface
    │   └── ChatMessage.tsx   // Optimized message rendering (React.memo)
    └── ErrorBoundary/
        └── ErrorBoundary.tsx // Production error recovery

Desktop Bridge Layer

Technology Stack

Framework: Tauri 2.9.0 (stable)
Language: Rust 2021 edition with tokio async runtime
HTTP Client: reqwest 0.11 with connection pooling
Plugins:
- tauri-plugin-fs - File system access
- tauri-plugin-dialog - Native dialogs
- tauri-plugin-store - Key-value storage
- tauri-plugin-shell - System integration
- tauri-plugin-log - Structured logging

Responsibilities

IPC Bridge:

Type-safe command handlers with serde serialization
Bidirectional communication (Frontend ↔ Rust ↔ Backend)
Async/await support for non-blocking operations
Error propagation with detailed context

Backend Management:

Health monitoring with 10-second polling interval
HTTP proxy for backend API calls
Connection pooling and retry logic
Status tracking (running, healthy, error states)

Native Integration:

File picker dialogs
System notifications
URL opening in default browser
Window management and lifecycle

Security Model

// Command allowlisting prevents arbitrary Rust execution
#[tauri::command]
async fn check_atlas_health(
    state: State<'_, Arc<tokio::sync::Mutex<AppState>>>
) -> Result<HealthStatus, String> {
    // Type-safe, validated, allowlisted command
}

Security Boundaries:

Command allowlisting (only explicitly registered functions)
HTTP client isolated from frontend (prevents SSRF)
Shell access restricted to open::that() for URLs only
State management via IPC prevents direct access
WebView sandboxing with capability system

Distribution

Packaging:

Windows: NSIS installer with WebView2 bootstrapper (~25-30MB)
macOS: DMG with code signing
Linux: AppImage with required dependencies bundled

Build Process:

Vite builds React frontend → dist/
Cargo compiles Rust binary → native executable
Tauri packages frontend + binary + resources
Platform-specific installer creation

Backend Layer

Technology Stack

Framework: FastAPI 0.115.6 with async/await
Language: Python 3.11 with type hints
LLM Backend: llama.cpp 0.3.2 with CUDA 12.1
Vector Database: Qdrant 1.15.0 (production)
Cache Layer: Redis 7.2 with LRU eviction
Document Processing: PyPDF2, python-docx, unstructured
Embeddings: sentence-transformers (BAAI/bge-large-en-v1.5)
Reranking: BGE reranker (GPU-accelerated)

Core Architecture

Apollo’s backend implements 8 architectural layers:

API Layer: REST endpoints with rate limiting, CORS, validation
RAG Engine: Master orchestrator with parallel initialization
Inference Engine: llama.cpp with thread-safe executor
Retrieval Engine: Adaptive strategies (simple/hybrid/advanced)
Vector Stores: Qdrant (dense + sparse search)
Document Processor: Multi-format support with semantic chunking
Caching System: L1-L5 ATLAS protocol
Model Manager: Hot-swappable LLM runtime

Performance Characteristics

Startup Time: 20-30 seconds (optimized from 78s via parallel init)
 
Query Latency:
  Cache Hit: less than 1ms (98% reduction from 50-100ms)
  Simple Mode: 8-15 seconds
  Adaptive Mode: 10-25 seconds
 
Inference Speed: 80-100 tokens/sec (RTX 5080, Q5_K_M quantization)
 
Cache Hit Rate: 60-80% in production workloads
 
Resource Usage:
  RAM: 8GB idle, 12GB active, 20GB during reindexing
  VRAM: 6GB loaded model, 10GB with reranker
  CPU: less than 5% idle, 100% single-core during generation

Critical Innovations

Parallel Component Initialization

# Groups independent components for concurrent loading
# Speedup: 3.4x (78s → 23s)
async def initialize():
    # GROUP 1: Fully independent
    await asyncio.gather(
        init_embedding_cache(),
        init_embeddings(),
        load_bm25_data(),
        init_llm()
    )
 
    # GROUP 2: Depends on GROUP 1
    await asyncio.gather(
        init_cache_manager(),
        init_vectorstore(),
        init_bm25_retriever(),
        llm_warmup()
    )
 
    # GROUP 3: Final components
    await asyncio.gather(
        load_metadata(),
        init_conversation_memory(),
        init_confidence_scorer()
    )

Multi-Level Caching (ATLAS Protocol)

L1 - Query Cache: Exact → Normalized → Semantic matching (Redis hash)
L2 - Embedding Cache: 98% latency reduction (50ms → less than 1ms)
L3 - Conversation Memory: Ring buffer with automatic summarization
L4 - Model Cache: Pre-cached in Docker image (15-20s saved)
L5 - Query Prefetcher: Predictive pattern-based prefetching

KV Cache Preservation

# Trade-off: Minor context bleed for 40-60% speedup
# Does NOT clear cache between requests
def generate(self, prompt, **kwargs):
    response = self.llm(prompt, max_tokens=kwargs.get('max_tokens', 512))
    return response['choices'][0]['text']

Hot Model Swapping

Workflow:

POST /api/models/switch with model_id
Validate and acquire switching lock (blocks queries)
Unload current model + VRAM cleanup (torch.cuda.empty_cache())
Load new model from /models/ directory
Test generation (5 tokens)
Sync RAGEngine LLM reference
Clear cache (old model answers incompatible)
Release lock

Timing: 15-30 seconds total downtime Safety: Mutex prevents concurrent switches, automatic fallback on failure

Communication Patterns

Dual Communication Architecture

Apollo uses two parallel communication channels for optimization:

Tauri IPC (JSON-RPC):

Use Case: Native features, health checks, system integration
Protocol: JSON serialization over IPC bridge
Latency: less than 10ms for lightweight operations
Examples: check_atlas_health(), switch_model(), show_notification()

Direct HTTP (Fetch API):

Use Case: Data-intensive operations, streaming
Protocol: REST + Server-Sent Events
Latency: Varies by operation (ms for health, seconds for queries)
Examples: /api/query/stream, /api/documents/upload

Why Dual Channels? IPC adds serialization overhead. For high-throughput operations like streaming tokens (100+/sec), direct HTTP bypasses IPC, reducing latency by 40-60%.

IPC Bridge Pattern

// Frontend: Type-safe Tauri invoke
import { invoke } from '@tauri-apps/api/core';
 
interface HealthStatus {
  status: 'healthy' | 'unhealthy' | 'degraded';
  components: {
    vectorstore: 'ready' | 'error';
    llm: 'ready' | 'error';
    cache: 'ready' | 'error';
  };
}
 
export const checkHealth = async (): Promise<HealthStatus> => {
  return await invoke<HealthStatus>('check_atlas_health');
};

// Backend: Rust command handler
#[tauri::command]
async fn check_atlas_health(
    state: State<'_, Arc<tokio::sync::Mutex<AppState>>>
) -> Result<HealthStatus, String> {
    let app_state = state.lock().await;
 
    let response = app_state.http_client
        .get(format!("{}/api/health", app_state.backend_url))
        .timeout(Duration::from_secs(5))
        .send()
        .await
        .map_err(|e| format!("Health check failed: {}", e))?;
 
    response.json().await.map_err(|e| format!("Parse error: {}", e))
}

Data Flow Overview

Query Processing Pipeline (High-Level)

User Query --> Frontend Validation --> Cache Lookup (L1-L2)
    |
    +--> [Cache Hit] --> Return cached result (<1ms)
    |
    +--> [Cache Miss] --> Continue to retrieval
                |
                v
        Security Checks (rate limit, injection detection)
                |
                v
        Query Classification (simple/moderate/complex)
                |
                v
        Document Retrieval (mode-specific strategy)
                |
                v
        Answer Generation (llama.cpp GPU)
                |
                v
        Confidence Scoring (parallel)
                |
                v
        Response Formatting
                |
                v
        Background Tasks (cache update, prefetch)

Timing Breakdown:

Security checks: 5ms
Classification (if adaptive): 200ms
Embedding generation: 50ms (CPU) or less than 1ms (cache hit)
Vector search: 100ms (Qdrant HNSW @ 1M documents)
Reranking (if advanced): 60ms (BGE GPU)
LLM generation: 8-12s (simple) or 10-15s (adaptive)
Confidence scoring: 500ms (parallel with generation)

Document Upload Pipeline (High-Level)

File Selection --> Frontend Validation (size, type)
    |
    v
POST multipart/form-data --> Backend /api/documents/upload
    |
    v
Filename sanitization --> SHA256 hash --> Duplicate check
    |
    v
Save to documents/ directory
    |
    v
User triggers reindexing
    |
    v
Acquire mutex lock (prevent concurrent reindex)
    |
    v
Text extraction --> Semantic chunking (1024/128)
    |
    v
Generate embeddings (BGE-large on CPU)
    |
    v
Index in Qdrant (dense + sparse BM25)
    |
    v
Atomic swap (temp --> production)
    |
    v
Release lock --> Return stats

Timing: 15-25 seconds for typical document (depending on size and complexity)

Technology Stack Summary

Layer-by-Layer Breakdown

Layer	Component	Technology	Purpose
Frontend	UI Framework	React 19.1.1	Modern concurrent rendering
	Language	TypeScript 5.7	Type-safe development
	State	Zustand 5.0.8	Lightweight state management
	Styling	TailwindCSS 3.4.18	Utility-first CSS
	Build	Vite 7.1.7	Fast HMR and bundling
Bridge	Runtime	Tauri 2.9.0	Native desktop capabilities
	Language	Rust 2021	Performance and safety
	HTTP Client	reqwest 0.11	Async HTTP requests
	Serialization	serde + serde_json	Type-safe IPC
Backend	Framework	FastAPI 0.115.6	Async REST API
	Language	Python 3.11	ML/AI ecosystem
	LLM	llama.cpp 0.3.2	GPU inference
	Embeddings	sentence-transformers	Dense vectors
	Vector DB	Qdrant 1.15.0	Similarity search
	Cache	Redis 7.2	Multi-tier caching
	Queue	asyncio	Background tasks

Infrastructure

Component	Version	Configuration	Purpose
Docker	24+	Compose v2	Container orchestration
CUDA Toolkit	12.1	cuBLAS, cuDNN	GPU acceleration
Qdrant	1.15.0	4 CPU, 16GB RAM	Vector storage
Redis	7.2	8GB maxmemory, LRU	Caching layer
WebView2	Latest	Embedded bootstrapper	Desktop rendering

Design Principles

Apollo’s architecture is guided by several core principles:

Performance First

Parallel initialization: 3.4x speedup through concurrent component loading
Multi-tier caching: 60-80% cache hit rate reduces latency by 98%
Token buffering: 60fps cap prevents UI freezes during streaming
KV cache preservation: 40-60% speedup in exchange for minor context bleed

Type Safety Everywhere

TypeScript: Strict mode with no implicit any
Pydantic: Request/response validation with auto-documentation
Rust: Compile-time guarantees and memory safety
serde: Type-safe serialization across language boundaries

Graceful Degradation

Try preferred method → fallback → log: Every component has error handling
Offline detection: Automatic recovery when connectivity returns
Component isolation: Failures don’t cascade across layers
Health monitoring: Background checks with automatic restart on failure

Security by Default

Command allowlisting: Only registered Tauri commands can execute
Input sanitization: Multi-layer validation (rate limit → injection → schema)
Path validation: Prevent directory traversal in file operations
WebView sandboxing: Isolated execution environment with capability system

Developer Ergonomics

Hot reload: Frontend and Tauri changes reflected instantly
Type-safe APIs: IntelliSense and compile-time validation
Comprehensive logging: Structured logs with field operator export
Clear separation: Each layer can be developed independently

Production Readiness

Error boundaries: Catch and recover from React errors
Resource limits: Docker memory/CPU constraints prevent OOM
Health checks: Automatic restart on failure
Atomic operations: Prevent corruption during updates

Architecture Philosophy: Apollo prioritizes performance and reliability over simplicity. Every optimization is documented with its trade-offs, and every component is designed for production use.

Next Steps: Deep Dives

Now that you understand the high-level architecture, explore each layer in detail:

Backend Layer: RAG engine, caching, retrieval strategies
Frontend Layer: React patterns, state management, streaming
Desktop Bridge: Tauri IPC, security model, native integration
Integration Layer: Cross-layer communication and data flow

Architecture Diagram Note: This overview uses ASCII art for universal compatibility. For interactive Mermaid diagrams and detailed component flows, see the deep-dive pages.

Backend Layer