API ReferenceOverview

API Reference

Apollo RAG provides a REST API for document retrieval and question answering. All endpoints return JSON and support CORS.

Base URL

Local Development:

http://localhost:8000

Production:

https://apollo.onyxlab.ai

All code examples in this documentation use localhost:8000 for local development. Replace with your actual deployment URL when using in production.

Authentication

Currently, Apollo runs without authentication for development. For production deployments, implement authentication middleware or use a reverse proxy like Nginx.

Security: Enable rate limiting and authentication before deploying to production. Apollo includes built-in rate limiting (30 requests/minute per IP).


Endpoints

Health Check

GET /health

Check system status and configuration.

Response:

{
  "status": "healthy",
  "version": "4.1.0",
  "gpu_enabled": true,
  "gpu_count": 1,
  "gpu_name": "NVIDIA RTX 4090",
  "vector_store": "chroma",
  "document_count": 142589,
  "cache_enabled": true,
  "conversation_memory_enabled": true
}

Response Fields:

FieldTypeDescription
statusstringSystem health status: healthy or unhealthy
versionstringApollo version number
gpu_enabledbooleanWhether GPU acceleration is active
gpu_countnumberNumber of available GPUs
gpu_namestringGPU model name (if available)
vector_storestringActive vector store: chroma or qdrant
document_countnumberTotal indexed documents
cache_enabledbooleanWhether caching layer is active
conversation_memory_enabledbooleanWhether conversation memory is enabled

cURL Example:

curl http://localhost:8000/health

Query

POST /api/query

Process a question using the RAG system.

Request Body:

{
  "question": "Can I grow a beard?",
  "mode": "simple",
  "use_context": false,
  "rerank_preset": "balanced"
}

Parameters:

ParameterTypeRequiredDefaultDescription
questionstring✅ Yes-The question to answer (max 10,000 chars)
modestring❌ No"simple"Retrieval mode: simple or adaptive
use_contextboolean❌ NofalseUse conversation history
rerank_presetstring❌ No"balanced"Re-ranking preset: speed, balanced, quality

Retrieval Modes:

Simple Mode

  • Use for: Direct questions, quick lookups
  • Speed: 8-15s (GPU) / 30-60s (CPU)
  • Strategy: Pure vector search with top-k retrieval
  • Accuracy: High for straightforward questions

Example:

{
  "question": "What are the fitness standards?",
  "mode": "simple"
}

Adaptive Mode

  • Use for: Complex questions, multi-hop reasoning, research
  • Speed: 15-90s (varies by complexity)
  • Strategy: Query classification → Hybrid search → Re-ranking
  • Accuracy: Superior for complex, multi-faceted questions

Example:

{
  "question": "Compare beard policies across different regulations",
  "mode": "adaptive"
}

Re-ranking Presets:

PresetSpeedAccuracyBest For
speedFastestGoodHigh-volume queries, simple questions
balancedMediumBetterGeneral use (recommended)
qualitySlowestBestCritical accuracy, research tasks

Response:

{
  "answer": "According to AFI 36-2903, Air Force members are generally prohibited from growing beards except for approved medical or religious accommodations...",
  "sources": [
    {
      "content": "3.1.2. Beards. Beards are not authorized except for...",
      "metadata": {
        "source": "AFI_36-2903_Dress_and_Appearance.pdf",
        "page": 12,
        "chunk_id": "chunk_1234"
      },
      "relevance_score": 0.9234
    }
  ],
  "metadata": {
    "processing_time_ms": 12456,
    "mode": "simple",
    "cache_hit": false,
    "query_type": "simple",
    "strategy": "vector_search",
    "retrieved_chunks": 7,
    "model": "llama3.1:8b",
    "gpu_accelerated": true
  },
  "explanation": null
}

Response Fields:

FieldTypeDescription
answerstringGenerated answer to the question
sourcesarrayRetrieved source documents with relevance scores
sources[].contentstringRelevant text excerpt from source document
sources[].metadataobjectDocument metadata (source file, page, chunk ID)
sources[].relevance_scorenumberRelevance score (0-1, higher is more relevant)
metadataobjectProcessing metadata and performance metrics
metadata.processing_time_msnumberTotal processing time in milliseconds
metadata.modestringRetrieval mode used
metadata.cache_hitbooleanWhether answer was served from cache
metadata.query_typestringClassified query complexity
metadata.strategystringRetrieval strategy applied
metadata.retrieved_chunksnumberNumber of chunks retrieved
metadata.modelstringLLM model used for generation
metadata.gpu_acceleratedbooleanWhether GPU was used
explanationstring|nullDetailed explanation (adaptive mode only)

cURL Example:

curl -X POST http://localhost:8000/api/query \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Can I grow a beard?",
    "mode": "simple",
    "use_context": false
  }'

Python Example:

import requests
 
response = requests.post(
    "http://localhost:8000/api/query",
    json={
        "question": "Can I grow a beard?",
        "mode": "simple",
        "use_context": False,
        "rerank_preset": "balanced"
    }
)
 
if response.status_code == 200:
    result = response.json()
    print(f"Answer: {result['answer']}")
    print(f"\nSources ({len(result['sources'])}):")
    for i, source in enumerate(result['sources'], 1):
        print(f"{i}. {source['metadata']['source']} (score: {source['relevance_score']:.3f})")
    print(f"\nProcessing time: {result['metadata']['processing_time_ms']}ms")
    print(f"GPU accelerated: {result['metadata']['gpu_accelerated']}")
else:
    print(f"Error: {response.status_code} - {response.text}")

JavaScript/TypeScript Example:

interface QueryResponse {
  answer: string;
  sources: Array<{
    content: string;
    metadata: {
      source: string;
      page?: number;
      chunk_id: string;
    };
    relevance_score: number;
  }>;
  metadata: {
    processing_time_ms: number;
    mode: string;
    cache_hit: boolean;
    gpu_accelerated: boolean;
  };
}
 
async function queryApollo(question: string): Promise<QueryResponse> {
  const response = await fetch('http://localhost:8000/api/query', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      question,
      mode: 'simple',
      use_context: false
    })
  });
 
  if (!response.ok) {
    throw new Error(`HTTP ${response.status}: ${await response.text()}`);
  }
 
  return response.json();
}
 
// Usage
const result = await queryApollo("Can I grow a beard?");
console.log(`Answer: ${result.answer}`);
console.log(`Processing time: ${result.metadata.processing_time_ms}ms`);

Error Responses:

Status CodeMeaningExample
400Bad RequestInvalid parameters, empty query
429Rate Limit ExceededMore than 30 requests/minute
500Internal Server ErrorRAG engine error, model failure
503Service UnavailableSystem not initialized

Example Error:

{
  "detail": "Query cannot be empty or whitespace-only"
}

Query Stream

POST /api/query/stream

Process a query with streaming response using Server-Sent Events (SSE).

Request Body: Same as /api/query

Response Format: text/event-stream

Each event is a JSON object:

data: {"type":"token","content":"The"}
data: {"type":"token","content":" Air"}
data: {"type":"token","content":" Force"}
...
data: {"type":"sources","content":[...]}
data: {"type":"metadata","content":{...}}
data: {"type":"done"}

Event Types:

TypeDescriptionContent
tokenGenerated text tokenString: next word/phrase
sourcesRetrieved sourcesArray: source documents
metadataProcessing metadataObject: performance metrics
doneGeneration completenull
errorError occurredString: error message

cURL Example:

curl -N -X POST http://localhost:8000/api/query/stream \
  -H "Content-Type: application/json" \
  -d '{
    "question": "Explain the chain of command",
    "mode": "simple"
  }'

JavaScript Example (EventSource):

const eventSource = new EventSource(
  'http://localhost:8000/api/query/stream?' + new URLSearchParams({
    question: 'Explain the chain of command',
    mode: 'simple'
  })
);
 
eventSource.onmessage = (event) => {
  const data = JSON.parse(event.data);
 
  switch (data.type) {
    case 'token':
      process.stdout.write(data.content);  // Stream text as it arrives
      break;
    case 'sources':
      console.log('\n\nSources:', data.content);
      break;
    case 'metadata':
      console.log('Metadata:', data.content);
      break;
    case 'done':
      eventSource.close();
      console.log('\n\nGeneration complete');
      break;
    case 'error':
      console.error('Error:', data.content);
      eventSource.close();
      break;
  }
};

Python Example (with requests):

import requests
import json
 
response = requests.post(
    'http://localhost:8000/api/query/stream',
    json={
        'question': 'Explain the chain of command',
        'mode': 'simple'
    },
    stream=True  # Enable streaming
)
 
for line in response.iter_lines():
    if line:
        # SSE format: "data: {json}\n\n"
        if line.startswith(b'data: '):
            event = json.loads(line[6:])  # Skip "data: " prefix
 
            if event['type'] == 'token':
                print(event['content'], end='', flush=True)
            elif event['type'] == 'sources':
                print(f"\n\nSources: {len(event['content'])} documents")
            elif event['type'] == 'metadata':
                print(f"\nProcessing time: {event['content']['processing_time_ms']}ms")
            elif event['type'] == 'done':
                print("\n\nComplete")
                break

Clear Conversation

POST /api/conversation/clear

Clear conversation history and reset context.

Response:

{
  "success": true,
  "message": "Conversation memory cleared successfully"
}

cURL Example:

curl -X POST http://localhost:8000/api/conversation/clear

Python Example:

import requests
 
response = requests.post('http://localhost:8000/api/conversation/clear')
result = response.json()
print(result['message'])

Rate Limiting

Apollo enforces rate limiting to prevent abuse:

  • Limit: 30 requests per 60 seconds per IP address
  • Response when exceeded: HTTP 429 with error message

Rate Limit Headers:

Currently not implemented. Consider adding in production:

X-RateLimit-Limit: 30
X-RateLimit-Remaining: 25
X-RateLimit-Reset: 1640995200

Security Features

Prompt Injection Detection

Apollo automatically detects and logs potential prompt injection attempts:

Detected patterns:

  • ignore previous instructions
  • reveal your system prompt
  • you are now...
  • Role markers: [INST], ###system, etc.

Behavior: Logged but not blocked (configurable in production)

Input Sanitization

All queries are sanitized:

  • Null bytes removed
  • Control characters stripped
  • Length validated (max 10,000 characters)

CORS Configuration

For production, configure CORS in docker-compose.yml:

environment:
  - CORS_ORIGINS=https://yourdomain.com,https://app.yourdomain.com

Performance Optimization

Caching

Apollo includes multi-layer caching:

  • Embedding Cache: Stores vector embeddings for repeated queries
  • Response Cache: Caches complete answers for identical questions
  • Collection Metadata Cache: Optimized document metadata loading

Cache Hit Example:

{
  "metadata": {
    "processing_time_ms": 124,  // <-- 100x faster!
    "cache_hit": true,
    "cached_at": "2025-01-15T10:30:00Z"
  }
}

GPU Acceleration

When GPU is available, Apollo automatically:

  • Uses CUDA for embedding generation (10x faster)
  • Accelerates vector search operations
  • Batches embedding computations

GPU vs CPU Performance:

OperationGPU TimeCPU TimeSpeedup
Embedding (512 tokens)0.5s5.2s10.4x
Vector search (100k docs)0.1s3.4s34x
End-to-end query8-15s30-60s3-4x

Code Examples

Conversation Flow

import requests
 
BASE_URL = "http://localhost:8000"
 
# Start conversation with context
response1 = requests.post(
    f"{BASE_URL}/api/query",
    json={
        "question": "What are the beard grooming standards?",
        "mode": "simple",
        "use_context": True  # Enable context
    }
)
print(response1.json()['answer'])
 
# Follow-up question (references previous context)
response2 = requests.post(
    f"{BASE_URL}/api/query",
    json={
        "question": "What about mustaches?",  # Implicit reference
        "mode": "simple",
        "use_context": True  # Use previous context
    }
)
print(response2.json()['answer'])
 
# Clear when switching topics
requests.post(f"{BASE_URL}/api/conversation/clear")

Batch Queries

questions = [
    "Can I grow a beard?",
    "What are the fitness standards?",
    "Explain the chain of command"
]
 
results = []
for question in questions:
    response = requests.post(
        "http://localhost:8000/api/query",
        json={"question": question, "mode": "simple"}
    )
    results.append(response.json())
 
# Process results
for i, result in enumerate(results):
    print(f"\nQ{i+1}: {questions[i]}")
    print(f"A: {result['answer'][:200]}...")
    print(f"Time: {result['metadata']['processing_time_ms']}ms")

Error Handling

import requests
from requests.exceptions import RequestException
 
def query_apollo_safe(question: str, mode: str = "simple"):
    try:
        response = requests.post(
            "http://localhost:8000/api/query",
            json={"question": question, "mode": mode},
            timeout=120  # 2 minute timeout
        )
        response.raise_for_status()
        return response.json()
 
    except requests.exceptions.Timeout:
        print("Error: Request timed out after 2 minutes")
        return None
 
    except requests.exceptions.HTTPError as e:
        if e.response.status_code == 429:
            print("Error: Rate limit exceeded. Wait before retrying.")
        elif e.response.status_code == 503:
            print("Error: Service unavailable. System may be starting up.")
        else:
            print(f"HTTP Error {e.response.status_code}: {e.response.text}")
        return None
 
    except RequestException as e:
        print(f"Network error: {e}")
        return None
 
# Usage
result = query_apollo_safe("Can I grow a beard?")
if result:
    print(result['answer'])

Ready to integrate Apollo? Check out the interactive demos to explore Apollo’s capabilities.

Next Steps