ArchitectureArchitecture Overview

Architecture Overview

Apollo RAG is built as a GPU-accelerated, offline-capable RAG system packaged as a modern desktop application. The architecture combines three sophisticated layers working in harmony to deliver high-performance document intelligence.

Key Design Goal: Achieve 80-100 tokens/sec inference with 8-15 second query latency while maintaining production-grade reliability and security.

System Overview

Apollo RAG implements a three-tier hybrid architecture that separates concerns while optimizing for performance:

  • Frontend Layer: Modern React 19 UI with real-time streaming and performance optimization
  • Desktop Bridge Layer: Tauri 2.9 native runtime providing OS integration and IPC
  • Backend Layer: FastAPI-based RAG engine with GPU-accelerated inference and vector search

This separation enables:

  • Independent scaling and optimization of each layer
  • Hot-swappable components without full system restart
  • Clear security boundaries and type-safe communication
  • Native desktop capabilities combined with web technology flexibility

Three-Tier Architecture

+---------------------------------------------------------------+
|                     FRONTEND LAYER                            |
|                                                               |
|  React 19 + TypeScript + Zustand                              |
|  Runtime: WebView2 (Windows) / WebKit (macOS/Linux)           |
|                                                               |
|  Features:                                                    |
|  * Real-time SSE streaming with 60fps token buffering         |
|  * Performance-optimized rendering (React.memo)               |
|  * Document management with drag-drop upload                  |
|  * Live performance metrics dashboard                         |
|  * Settings panel with model hot-swap UI                      |
|  * Dark mode with localStorage persistence                    |
|                                                               |
+----------------+----------------------------------------------+
                 |
                 |  +---------------------------------------+
                 +--| Tauri IPC (JSON-RPC)                  |
                 |  | * Native features                     |
                 |  | * Health monitoring                   |
                 |  | * System notifications                |
                 |  +---------------------------------------+
                 |
                 |  +---------------------------------------+
                 +--| Direct HTTP (Fetch API)               |
                    | * Query endpoints                     |
                    | * Document upload                     |
                    | * SSE streaming                       |
                    +---------------------------------------+
                    |
+-------------------v---------------------------------------+
|                DESKTOP BRIDGE LAYER                       |
|                                                           |
|  Tauri 2.9 (Rust Native Binary)                           |
|  Runtime: Native system process                           |
|                                                           |
|  Responsibilities:                                        |
|  * IPC command handlers (type-safe bidirectional)         |
|  * Backend health monitoring (background task)            |
|  * HTTP proxy to FastAPI backend                          |
|  * Native OS integration (file dialogs, notifications)    |
|  * WebView management and security policies               |
|  * Application lifecycle management                       |
|                                                           |
+----------------+------------------------------------------+
                 |
                 | HTTP/REST (reqwest crate)
                 | Target: http://localhost:8000
                 |
+----------------v------------------------------------------+
|                   BACKEND LAYER                           |
|                                                           |
|  FastAPI + Python 3.11                                    |
|  Runtime: Docker containers with CUDA support             |
|                                                           |
|  Core Components:                                         |
|  * RAG orchestration engine (parallel initialization)     |
|  * llama.cpp GPU inference (80-100 tokens/sec)            |
|  * Vector database - Qdrant (dense + sparse search)       |
|  * Multi-tier caching - Redis (L1-L5 ATLAS protocol)      |
|  * Document processing pipeline (PDF/DOCX/TXT)            |
|  * Adaptive retrieval strategies (simple/hybrid/advanced) |
|  * Model management (hot-swappable LLMs)                  |
|  * Confidence scoring and source citation                 |
|                                                           |
|  Infrastructure:                                          |
|  * Qdrant: Vector storage and similarity search           |
|  * Redis: Multi-level caching and session management      |
|  * CUDA Toolkit: GPU acceleration for inference           |
|                                                           |
+-----------------------------------------------------------+

Dual Communication Pattern: Apollo uses both Tauri IPC and direct HTTP for optimal performance. IPC handles lightweight native features, while HTTP handles data-intensive operations like streaming queries.

Frontend Layer

Technology Stack

  • Framework: React 19.1.1 (cutting-edge concurrent features)
  • Language: TypeScript 5.7 with strict mode
  • State Management: Zustand 5.0.8 with persistence middleware
  • Styling: TailwindCSS 3.4.18 with custom design system
  • Build Tool: Vite 7.1.7 targeting ESNext
  • Desktop Integration: Tauri 2.9.0 API bindings

Key Features

Performance Optimizations:

  • Token buffering with 60fps throttling prevents UI freezes during streaming
  • React.memo with custom comparison functions for efficient re-renders
  • Memoized performance store computations with checksum validation
  • Lazy loading for heavy components (PDF viewer, charts)

User Experience:

  • Real-time SSE streaming with visual token buffering
  • Drag-and-drop document upload with validation
  • Live performance metrics and timing history
  • Comprehensive error boundaries with field operator logging
  • Offline detection with automatic recovery
  • Dark mode with system preference detection

Accessibility:

  • Full keyboard navigation support
  • ARIA labels and semantic HTML
  • Screen reader compatibility
  • Focus management and tab order

Critical Files

src/
├── App.tsx                    // Main application entry with error boundary
├── hooks/
│   ├── useChat.ts            // Chat orchestration with token batching
│   └── useStreamingChat.ts   // SSE streaming implementation
├── store/
│   ├── useStore.ts           // Main Zustand store with persistence
│   └── performanceStore.ts   // Memoized performance calculations
├── services/
│   └── api.ts                // API client with retry logic
└── components/
    ├── Chat/
    │   ├── ChatWindow.tsx    // Main chat interface
    │   └── ChatMessage.tsx   // Optimized message rendering (React.memo)
    └── ErrorBoundary/
        └── ErrorBoundary.tsx // Production error recovery

Desktop Bridge Layer

Technology Stack

  • Framework: Tauri 2.9.0 (stable)
  • Language: Rust 2021 edition with tokio async runtime
  • HTTP Client: reqwest 0.11 with connection pooling
  • Plugins:
    • tauri-plugin-fs - File system access
    • tauri-plugin-dialog - Native dialogs
    • tauri-plugin-store - Key-value storage
    • tauri-plugin-shell - System integration
    • tauri-plugin-log - Structured logging

Responsibilities

IPC Bridge:

  • Type-safe command handlers with serde serialization
  • Bidirectional communication (Frontend ↔ Rust ↔ Backend)
  • Async/await support for non-blocking operations
  • Error propagation with detailed context

Backend Management:

  • Health monitoring with 10-second polling interval
  • HTTP proxy for backend API calls
  • Connection pooling and retry logic
  • Status tracking (running, healthy, error states)

Native Integration:

  • File picker dialogs
  • System notifications
  • URL opening in default browser
  • Window management and lifecycle

Security Model

// Command allowlisting prevents arbitrary Rust execution
#[tauri::command]
async fn check_atlas_health(
    state: State<'_, Arc<tokio::sync::Mutex<AppState>>>
) -> Result<HealthStatus, String> {
    // Type-safe, validated, allowlisted command
}

Security Boundaries:

  • Command allowlisting (only explicitly registered functions)
  • HTTP client isolated from frontend (prevents SSRF)
  • Shell access restricted to open::that() for URLs only
  • State management via IPC prevents direct access
  • WebView sandboxing with capability system

Distribution

Packaging:

  • Windows: NSIS installer with WebView2 bootstrapper (~25-30MB)
  • macOS: DMG with code signing
  • Linux: AppImage with required dependencies bundled

Build Process:

  • Vite builds React frontend → dist/
  • Cargo compiles Rust binary → native executable
  • Tauri packages frontend + binary + resources
  • Platform-specific installer creation

Backend Layer

Technology Stack

  • Framework: FastAPI 0.115.6 with async/await
  • Language: Python 3.11 with type hints
  • LLM Backend: llama.cpp 0.3.2 with CUDA 12.1
  • Vector Database: Qdrant 1.15.0 (production)
  • Cache Layer: Redis 7.2 with LRU eviction
  • Document Processing: PyPDF2, python-docx, unstructured
  • Embeddings: sentence-transformers (BAAI/bge-large-en-v1.5)
  • Reranking: BGE reranker (GPU-accelerated)

Core Architecture

Apollo’s backend implements 8 architectural layers:

  • API Layer: REST endpoints with rate limiting, CORS, validation
  • RAG Engine: Master orchestrator with parallel initialization
  • Inference Engine: llama.cpp with thread-safe executor
  • Retrieval Engine: Adaptive strategies (simple/hybrid/advanced)
  • Vector Stores: Qdrant (dense + sparse search)
  • Document Processor: Multi-format support with semantic chunking
  • Caching System: L1-L5 ATLAS protocol
  • Model Manager: Hot-swappable LLM runtime

Performance Characteristics

Startup Time: 20-30 seconds (optimized from 78s via parallel init)
 
Query Latency:
  Cache Hit: less than 1ms (98% reduction from 50-100ms)
  Simple Mode: 8-15 seconds
  Adaptive Mode: 10-25 seconds
 
Inference Speed: 80-100 tokens/sec (RTX 5080, Q5_K_M quantization)
 
Cache Hit Rate: 60-80% in production workloads
 
Resource Usage:
  RAM: 8GB idle, 12GB active, 20GB during reindexing
  VRAM: 6GB loaded model, 10GB with reranker
  CPU: less than 5% idle, 100% single-core during generation

Critical Innovations

Parallel Component Initialization

# Groups independent components for concurrent loading
# Speedup: 3.4x (78s → 23s)
async def initialize():
    # GROUP 1: Fully independent
    await asyncio.gather(
        init_embedding_cache(),
        init_embeddings(),
        load_bm25_data(),
        init_llm()
    )
 
    # GROUP 2: Depends on GROUP 1
    await asyncio.gather(
        init_cache_manager(),
        init_vectorstore(),
        init_bm25_retriever(),
        llm_warmup()
    )
 
    # GROUP 3: Final components
    await asyncio.gather(
        load_metadata(),
        init_conversation_memory(),
        init_confidence_scorer()
    )

Multi-Level Caching (ATLAS Protocol)

  • L1 - Query Cache: Exact → Normalized → Semantic matching (Redis hash)
  • L2 - Embedding Cache: 98% latency reduction (50ms → less than 1ms)
  • L3 - Conversation Memory: Ring buffer with automatic summarization
  • L4 - Model Cache: Pre-cached in Docker image (15-20s saved)
  • L5 - Query Prefetcher: Predictive pattern-based prefetching

KV Cache Preservation

# Trade-off: Minor context bleed for 40-60% speedup
# Does NOT clear cache between requests
def generate(self, prompt, **kwargs):
    response = self.llm(prompt, max_tokens=kwargs.get('max_tokens', 512))
    return response['choices'][0]['text']

Hot Model Swapping

Workflow:

  • POST /api/models/switch with model_id
  • Validate and acquire switching lock (blocks queries)
  • Unload current model + VRAM cleanup (torch.cuda.empty_cache())
  • Load new model from /models/ directory
  • Test generation (5 tokens)
  • Sync RAGEngine LLM reference
  • Clear cache (old model answers incompatible)
  • Release lock

Timing: 15-30 seconds total downtime Safety: Mutex prevents concurrent switches, automatic fallback on failure

Communication Patterns

Dual Communication Architecture

Apollo uses two parallel communication channels for optimization:

Tauri IPC (JSON-RPC):

  • Use Case: Native features, health checks, system integration
  • Protocol: JSON serialization over IPC bridge
  • Latency: less than 10ms for lightweight operations
  • Examples: check_atlas_health(), switch_model(), show_notification()

Direct HTTP (Fetch API):

  • Use Case: Data-intensive operations, streaming
  • Protocol: REST + Server-Sent Events
  • Latency: Varies by operation (ms for health, seconds for queries)
  • Examples: /api/query/stream, /api/documents/upload

Why Dual Channels? IPC adds serialization overhead. For high-throughput operations like streaming tokens (100+/sec), direct HTTP bypasses IPC, reducing latency by 40-60%.

IPC Bridge Pattern

// Frontend: Type-safe Tauri invoke
import { invoke } from '@tauri-apps/api/core';
 
interface HealthStatus {
  status: 'healthy' | 'unhealthy' | 'degraded';
  components: {
    vectorstore: 'ready' | 'error';
    llm: 'ready' | 'error';
    cache: 'ready' | 'error';
  };
}
 
export const checkHealth = async (): Promise<HealthStatus> => {
  return await invoke<HealthStatus>('check_atlas_health');
};
// Backend: Rust command handler
#[tauri::command]
async fn check_atlas_health(
    state: State<'_, Arc<tokio::sync::Mutex<AppState>>>
) -> Result<HealthStatus, String> {
    let app_state = state.lock().await;
 
    let response = app_state.http_client
        .get(format!("{}/api/health", app_state.backend_url))
        .timeout(Duration::from_secs(5))
        .send()
        .await
        .map_err(|e| format!("Health check failed: {}", e))?;
 
    response.json().await.map_err(|e| format!("Parse error: {}", e))
}

Data Flow Overview

Query Processing Pipeline (High-Level)

User Query --> Frontend Validation --> Cache Lookup (L1-L2)
    |
    +--> [Cache Hit] --> Return cached result (<1ms)
    |
    +--> [Cache Miss] --> Continue to retrieval
                |
                v
        Security Checks (rate limit, injection detection)
                |
                v
        Query Classification (simple/moderate/complex)
                |
                v
        Document Retrieval (mode-specific strategy)
                |
                v
        Answer Generation (llama.cpp GPU)
                |
                v
        Confidence Scoring (parallel)
                |
                v
        Response Formatting
                |
                v
        Background Tasks (cache update, prefetch)

Timing Breakdown:

  • Security checks: 5ms
  • Classification (if adaptive): 200ms
  • Embedding generation: 50ms (CPU) or less than 1ms (cache hit)
  • Vector search: 100ms (Qdrant HNSW @ 1M documents)
  • Reranking (if advanced): 60ms (BGE GPU)
  • LLM generation: 8-12s (simple) or 10-15s (adaptive)
  • Confidence scoring: 500ms (parallel with generation)

Document Upload Pipeline (High-Level)

File Selection --> Frontend Validation (size, type)
    |
    v
POST multipart/form-data --> Backend /api/documents/upload
    |
    v
Filename sanitization --> SHA256 hash --> Duplicate check
    |
    v
Save to documents/ directory
    |
    v
User triggers reindexing
    |
    v
Acquire mutex lock (prevent concurrent reindex)
    |
    v
Text extraction --> Semantic chunking (1024/128)
    |
    v
Generate embeddings (BGE-large on CPU)
    |
    v
Index in Qdrant (dense + sparse BM25)
    |
    v
Atomic swap (temp --> production)
    |
    v
Release lock --> Return stats

Timing: 15-25 seconds for typical document (depending on size and complexity)

Technology Stack Summary

Layer-by-Layer Breakdown

LayerComponentTechnologyPurpose
FrontendUI FrameworkReact 19.1.1Modern concurrent rendering
LanguageTypeScript 5.7Type-safe development
StateZustand 5.0.8Lightweight state management
StylingTailwindCSS 3.4.18Utility-first CSS
BuildVite 7.1.7Fast HMR and bundling
BridgeRuntimeTauri 2.9.0Native desktop capabilities
LanguageRust 2021Performance and safety
HTTP Clientreqwest 0.11Async HTTP requests
Serializationserde + serde_jsonType-safe IPC
BackendFrameworkFastAPI 0.115.6Async REST API
LanguagePython 3.11ML/AI ecosystem
LLMllama.cpp 0.3.2GPU inference
Embeddingssentence-transformersDense vectors
Vector DBQdrant 1.15.0Similarity search
CacheRedis 7.2Multi-tier caching
QueueasyncioBackground tasks

Infrastructure

ComponentVersionConfigurationPurpose
Docker24+Compose v2Container orchestration
CUDA Toolkit12.1cuBLAS, cuDNNGPU acceleration
Qdrant1.15.04 CPU, 16GB RAMVector storage
Redis7.28GB maxmemory, LRUCaching layer
WebView2LatestEmbedded bootstrapperDesktop rendering

Design Principles

Apollo’s architecture is guided by several core principles:

Performance First

  • Parallel initialization: 3.4x speedup through concurrent component loading
  • Multi-tier caching: 60-80% cache hit rate reduces latency by 98%
  • Token buffering: 60fps cap prevents UI freezes during streaming
  • KV cache preservation: 40-60% speedup in exchange for minor context bleed

Type Safety Everywhere

  • TypeScript: Strict mode with no implicit any
  • Pydantic: Request/response validation with auto-documentation
  • Rust: Compile-time guarantees and memory safety
  • serde: Type-safe serialization across language boundaries

Graceful Degradation

  • Try preferred method → fallback → log: Every component has error handling
  • Offline detection: Automatic recovery when connectivity returns
  • Component isolation: Failures don’t cascade across layers
  • Health monitoring: Background checks with automatic restart on failure

Security by Default

  • Command allowlisting: Only registered Tauri commands can execute
  • Input sanitization: Multi-layer validation (rate limit → injection → schema)
  • Path validation: Prevent directory traversal in file operations
  • WebView sandboxing: Isolated execution environment with capability system

Developer Ergonomics

  • Hot reload: Frontend and Tauri changes reflected instantly
  • Type-safe APIs: IntelliSense and compile-time validation
  • Comprehensive logging: Structured logs with field operator export
  • Clear separation: Each layer can be developed independently

Production Readiness

  • Error boundaries: Catch and recover from React errors
  • Resource limits: Docker memory/CPU constraints prevent OOM
  • Health checks: Automatic restart on failure
  • Atomic operations: Prevent corruption during updates

Architecture Philosophy: Apollo prioritizes performance and reliability over simplicity. Every optimization is documented with its trade-offs, and every component is designed for production use.

Next Steps: Deep Dives

Now that you understand the high-level architecture, explore each layer in detail:


Architecture Diagram Note: This overview uses ASCII art for universal compatibility. For interactive Mermaid diagrams and detailed component flows, see the deep-dive pages.